@niplav I dunno what shard theory is, but I agree with this notion.
@Paradox If you're interested: https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values
@Paradox They claim that human learning is a lot like current AI training: A lot of self-supervised pre-training+some fine-tuning+a little bit of RL (and in this view then multi-agent RL on top)
"effective altruism" is the v rare mutation where a brain starts to break down its own rationalisation/hypocrisy-barriers, and instead of then becoming consistently selfish, it generalises the other way, such that verbal values start to influence actual behaviour. humans can do this bc we are v prone to overgeneralising our learned proxies.
@rime staring at ontological crises for a while makes me believe this too
More parsimonious ai-values might be pretty weird to humans as an axis, just as simplocity priors are strange
@rime love this explanation! Explains some tension: if some parts generalize twd altruism and others twd selfishness you have to find the equilibrium
@rime wouldn't go as far as Ngo to say all of alignment risk comes from here but seems like a rather large source
alas, i think it's highly unlikely that a given learning-regime will make the AI 1) evolve proxy-values optimised for seeming nice to others upon ~direct inspection, and 2) overgeneralise those proxy-values to actual behaviour, unless somehow carefwly designed that way. (this isn't a suggestion; i'm just talking abt the ontogeny of human values).