Follow

Thinking out loud what still doesn't work with giving AutoGPT agents instructions like "do X but respect human preferences while doing so".

• Inner optimizers are still a problem if they exist in the GPT models
• Do LLM agents have sufficient goal stability? I.e. when delegating & delegating further does the original goal get perturbed or even lost?
• Limited to the models' understanding of "human values"
• Doesn't solve ambitious value learning, model might generalise badly once in new domains

• Last point especially crucial in situations where such an agent starts recursively improving itself (e.g. training new models)

The last points especially might be ameliorated by literally just appending "and don't optimize too hard" and "let yourself be shut down by a human" to the prompt?

Man I feel confused, but assuming that language models aren't infested with inner optimizers now I'm more hopeful?

Or am I missing something crucial here…

@niplav It seems like GPT-4 based AutoGPT is just too weak of an optimizer to confidently extrapolate bounds? Though, it admittedly should be SOME evidence that a thing that can pass the bar exam is nevertheless basically hopeless when tasked to act as an agent.

@niplav One the one hand, I'd be really happy if recursive LLMs could reach very high intelligence before anything else, because the capabilities are built out of parts (text-based communication) we can inspect. AI research turns into network epistemology.

On the other hand, if GPT-5 is accessible via API that doesn't control for this, some idiots are going to try their hardest to destroy the world with it (cf. ChaosGPT).

@rime
Always Less Dignified™

Network epistemology — sounds really nice :-) Perhaps that's what CoEms are getting at

Sign in to participate in the conversation
Mastodon

a Schelling point for those who seek one