**niplav** @niplav@schelling.pt · Apr 23, 2023

**niplav** @niplav@schelling.pt · Apr 23, 2023

niplav @niplav@schelling.pt

Apr 23, 2023

Thinking out loud what still doesn't work with giving AutoGPT agents instructions like "do X but respect human preferences while doing so".

• Inner optimizers are still a problem if they exist in the GPT models
• Do LLM agents have sufficient goal stability? I.e. when delegating & delegating further does the original goal get perturbed or even lost?
• Limited to the models' understanding of "human values"
• Doesn't solve ambitious value learning, model might generalise badly once in new domains

**niplav** @niplav@schelling.pt · 2023-04-23T11:13:12Z

niplav @niplav@schelling.pt

• Last point especially crucial in situations where such an agent starts recursively improving itself (e.g. training new models)

April 23, 2023 at 11:13 AM · · · ·

**niplav** @niplav@schelling.pt · Apr 23, 2023

**niplav** @niplav@schelling.pt · Apr 23, 2023

Apr 23, 2023

niplav @niplav@schelling.pt

The last points especially might be ameliorated by literally just appending "and don't optimize too hard" and "let yourself be shut down by a human" to the prompt?

Man I feel confused, but assuming that language models aren't infested with inner optimizers now I'm more hopeful?

Or am I missing something crucial here…

**Flats** @flats@schelling.pt · Apr 24, 2023

**Flats** @flats@schelling.pt · Apr 24, 2023

Apr 24, 2023

Flats @flats@schelling.pt

@niplav It seems like GPT-4 based AutoGPT is just too weak of an optimizer to confidently extrapolate bounds? Though, it admittedly should be SOME evidence that a thing that can pass the bar exam is nevertheless basically hopeless when tasked to act as an agent.

Trending now

Resources

Developers

What is Mastodon?

schelling.pt

More…