It matters when the first sharp left turn happens

Sep 29, 2022 14:27 · 1060 words · 5 minute read AI Safety AI Capabilities Deceptive Alignment

Thanks to Evan Hubinger for comments on these ideas. Posted on the AI Alignment Forum.

Introduction

A “sharp left turn” is a point where capabilities generalize beyond alignment. In a sharp left turn, an AI becomes much more capable than aligned, and so starts to exploit flaws in its alignment.

This can look like Goodharting, where strong optimization pressure causes outer alignment failure because the base goal isn’t identical to what we want. Or it can look like deceptive alignment, where the model is aligned to proxy goals that aren’t identical to the base goal, but we don’t notice until the model is capable enough to make the proxies fail.

However they happen, sharp left turns are bad.

An important question, though, is: when will the first sharp left turn happen?

Timing

Consider three scenarios:

Weak: The sharp left turn happens before models are able to cause existential risks. For instance, maybe GPT-4 has a sharp left turn but isn’t able to reliably execute on its plans.
Strong: The sharp left turn happens once capabilities are very super-human and dangerous if misused. In this world AI’s are doing the bulk of the world’s scientific research by the time alignment unravels.
Human-level: The sharp left turn happens while capabilities are around human level, meaning both dangerous and not game-changingly helpful for research.

I think that the most dangerous of these is the human-level world.

In the “weak” scenario, we get to see the sharp left turn coming. We potentially get lots of experience with models looking aligned and then that alignment failing with increasing scale. In this world, we effectively get multiple shots at alignment because the first badly-misaligned models just aren’t that dangerous. We get to experiment with deceptively aligned models and get a grounded empirical understanding of why models become deceptively aligned and how we might stop that. In this world, alignment research gets a chance to become a paradigmatic scientific field before the models get too capable.

In the “strong” scenario, we get to do alignment research with super-human models that are still aligned. In this world, we don’t have to solve the technical challenge of alignment entirely on our own. Alignment research potentially gets much further along before we have to have an airtight solution and prevent sharp left turns.

Both of these seem like pretty hopeful worlds. By contrast, the world where sharp left turns happen near human-level capabilities seems pretty dangerous. In that world, we don’t get to study things like deceptive alignment empirically. We don’t get to leverage powerful AI researcher tools. We have to solve the problem entirely on our own, and we have to do it without any useful empirical feedback. This seems really bad.

Scales

The timing of sharp left turns depends on a few capability scales, some of which I conflated above:

The danger scale, above which the model is capable enough to pose an existential risk.
The research scale, above which the model is able to do scientific research better than humans.
The left-turn scale, above which the model is able to undergo a sharp left turn. A useful anchor here is given by deceptive alignment, which suggests the scale at which a model can successfully reason about its own training process.

The three scenarios I outlined are characterized by different orderings of these scales:

Weak: (left-turn < danger)
Strong: (research < left-turn)
Human-level: (left-turn ~ research, left-turn ~ danger)

There can be other orderings, leading to other outcomes. For instance, we could live in a world with (danger < left-turn < research). In that world, the left turn happens at a scale that’s dangerous but incapable of super-human research. Imagine a model that’s very capable of developing and executing plans using information freely available on the internet. Such a model could pose an existential risk by engineering pathogens that we already know how to make, without doing any novel R&D, and deploying them with competent but not superhuman logistics.

Capabilities are (sort of) multi-dimensional

Of course capabilities aren’t strictly one-dimensional, so talking about “the scale of capabilities” conflates a few different ideas. Concretely, I think a model with fixed research ability could be either more or less dangerous, depending on other dimensions of its capabilities like “competence interacting with people” and “robust plan design”. These skills vary significantly in humans, and it seems plausible that we can engineer models to be more or less capable along these different dimensions.

That’s not to say that there’s complete freedom here. There’s definitely correlation between capabilities (e.g. lots of capabilities improve with scale). And in the limit of extreme research ability a model should be able to learn whatever other capabilities it needs. Intelligence is closely related to general purpose optimization ability, and to the extent that this holds we should expect different capabilities to scale together.

But we might live in worlds where it only takes models twice as capable as humans to solve the technical challenges of alignment, and those models may not be reflectively stable and may have lots of flawed abilities. In those worlds there’s a lot of benefit to thinking about which capabilities we want to weaken in models, and to trying to arrange that.

Outlook

If we could know when to expect a sharp left turn that might change what we do. If we knew that a sharp left turn would happen long before dangerous capabilities, we might focus on developing alignment benchmarks and more empirical study of alignment failures. If we knew that it would happen long after powerful research capabilities, we might focus on applying those capabilities towards alignment work.

It seems plausible we can learn more about this timing question by studying low-likelihood outputs, studying how one-or-multi-dimensional capabilities are, and reasoning about the kinds of capabilities that are dangerous and the kind that are helpful.

Along similar lines, even if we don’t know when to expect a sharp left turn, there may be things we can do to shift that timing in a favorable direction. For instance, we can put weak models in environments that encourage alignment failure so as to study left turns empirically. And we can preferentially focus alignment efforts on whatever the state of the art is at any time (currently large language models), rather than on very different architectures, so as to delay left turns in superhuman models.