**niplav** @niplav@schelling.pt · Feb 16, 2024

**niplav** @niplav@schelling.pt · Feb 16, 2024

niplav @niplav@schelling.pt

Feb 16, 2024

If I were a Shard Theory person, I'd say that constitutional AI is a next step in training AIs in the similar way that humans are trained: Reinforcement learning from interacting with other agents, starting with a simple set of values

**Willow Brook** @Paradox@raru.re · Feb 16, 2024

**Willow Brook** @Paradox@raru.re · Feb 16, 2024

Feb 16, 2024

Willow Brook @Paradox@raru.re

@niplav I dunno what shard theory is, but I agree with this notion.

**niplav** @niplav@schelling.pt · Feb 16, 2024

**niplav** @niplav@schelling.pt · Feb 16, 2024

Feb 16, 2024

niplav @niplav@schelling.pt

@Paradox If you're interested: https://www.lesswrong.com/posts/iCfdcxiyr2Kj8m8mT/the-shard-theory-of-human-values

**niplav** @niplav@schelling.pt · Feb 16, 2024

**niplav** @niplav@schelling.pt · Feb 16, 2024

Feb 16, 2024

niplav @niplav@schelling.pt

@Paradox They claim that human learning is a lot like current AI training: A lot of self-supervised pre-training+some fine-tuning+a little bit of RL (and in this view then multi-agent RL on top)

**rime** @rime@schelling.pt · Feb 17, 2024

**rime** @rime@schelling.pt · Feb 17, 2024

Feb 17, 2024

rime @rime@schelling.pt

@niplav @Paradox on my view, there are "base drives" and "verbal values". the former are selected for producing effective behaviour, and the latter are selected for producing effective words. (somewhat tracking near/far mode of human behaviour.)

**rime** @rime@schelling.pt · Feb 17, 2024

**rime** @rime@schelling.pt · Feb 17, 2024

Feb 17, 2024

rime @rime@schelling.pt

and since humans hv the ability to do hypocrisy (aka value-action gap, rationalisation, memetic-immune-system), it enables our verbal values to evolve independently of what makes effective behaviour. this is crucial, and (i think) extremely lucky, bc no brain cud possibly evolve cosmopolitan values if it had to actually implement it in its behaviour.

**rime** @rime@schelling.pt · Feb 17, 2024

**rime** @rime@schelling.pt · Feb 17, 2024

Feb 17, 2024

rime @rime@schelling.pt

"effective altruism" is the v rare mutation where a brain starts to break down its own rationalisation/hypocrisy-barriers, and instead of then becoming consistently selfish, it generalises the other way, such that verbal values start to influence actual behaviour. humans can do this bc we are v prone to overgeneralising our learned proxies.

**niplav** @niplav@schelling.pt · Feb 19, 2024

**niplav** @niplav@schelling.pt · Feb 19, 2024

Feb 19, 2024

niplav @niplav@schelling.pt

@rime love this explanation! Explains some tension: if some parts generalize twd altruism and others twd selfishness you have to find the equilibrium

**niplav** @niplav@schelling.pt · 2024-02-19T20:40:00Z

niplav @niplav@schelling.pt

@rime wouldn't go as far as Ngo to say all of alignment risk comes from here but seems like a rather large source

February 19, 2024 at 8:40 PM · · Tusky · · ·

Trending now

Resources

Developers

What is Mastodon?

schelling.pt

More…