done reading arxiv.org/abs/2204.05212
weird, unexpected result

= setup =
give a human a reading comprehension task about a looong sci-fi story, with A/B options. give them 2 arguments with supporting quotes, one arguing for A, one for B. give them 90 seconds to read both arguments & quotes, and consult the text.

measure how often they pick the right option.

= result =
there's **no difference** between showing them just the quotes, or quotes+arguments.

this result matters because AI safety via debate (arxiv.org/abs/1805.00899) is one of the main proposals for how to align strong superhuman AIs.

it's basically a scaled up version of this, where you have 2 players arguing about whether A/B is right, and a human judges the result based on the arguments.

I am confused by this negative result and hope it's wrong / doesn't hold for bigger, non-toy problems.

Follow

@agentydragon someone needs to review the entire evidence here

In general havong Debate be not robust enough to ~always work (see also Obfuscated Arguments) is a bad sign

Sign in to participate in the conversation
Mastodon

a Schelling point for those who seek one