@niplav I dunno what shard theory is, but I agree with this notion.
@Paradox If you're interested: https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values
@Paradox They claim that human learning is a lot like current AI training: A lot of self-supervised pre-training+some fine-tuning+a little bit of RL (and in this view then multi-agent RL on top)
and since humans hv the ability to do hypocrisy (aka value-action gap, rationalisation, memetic-immune-system), it enables our verbal values to evolve independently of what makes effective behaviour. this is crucial, and (i think) extremely lucky, bc no brain cud possibly evolve cosmopolitan values if it had to actually implement it in its behaviour.
alas, i think it's highly unlikely that a given learning-regime will make the AI 1) evolve proxy-values optimised for seeming nice to others upon ~direct inspection, and 2) overgeneralise those proxy-values to actual behaviour, unless somehow carefwly designed that way. (this isn't a suggestion; i'm just talking abt the ontogeny of human values).
@rime staring at ontological crises for a while makes me believe this too
More parsimonious ai-values might be pretty weird to humans as an axis, just as simplocity priors are strange
@rime love this explanation! Explains some tension: if some parts generalize twd altruism and others twd selfishness you have to find the equilibrium
@rime wouldn't go as far as Ngo to say all of alignment risk comes from here but seems like a rather large source
"effective altruism" is the v rare mutation where a brain starts to break down its own rationalisation/hypocrisy-barriers, and instead of then becoming consistently selfish, it generalises the other way, such that verbal values start to influence actual behaviour. humans can do this bc we are v prone to overgeneralising our learned proxies.