@niplav I dunno what shard theory is, but I agree with this notion.
@Paradox If you're interested: https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values
@Paradox They claim that human learning is a lot like current AI training: A lot of self-supervised pre-training+some fine-tuning+a little bit of RL (and in this view then multi-agent RL on top)
and since humans hv the ability to do hypocrisy (aka value-action gap, rationalisation, memetic-immune-system), it enables our verbal values to evolve independently of what makes effective behaviour. this is crucial, and (i think) extremely lucky, bc no brain cud possibly evolve cosmopolitan values if it had to actually implement it in its behaviour.
@rime wouldn't go as far as Ngo to say all of alignment risk comes from here but seems like a rather large source