**David Piepgrass** @dpiepgrass@schelling.pt · 2023-01-06T03:53:08Z

David Piepgrass @dpiepgrass@schelling.pt

From Twitter

RT @CollinBurns4
How can we figure out if what a language model says is true, even when human evaluators can’t easily tell?

We show (http://arxiv.org/abs/2212.03827) that we can identify whether text is true or false directly from a model’s *unlabeled activations*. 🧵