Quantilizers and Generative Models

Jul 18, 2022 12:33 · 1206 words · 6 minute read AI Safety AI research

(Cross-posted from the AI Alignment Forum)

Thanks to Evan Hubinger for discussions about quantilizers, and to James Lucassen for discussions about conditioned generative models. Many of these ideas are discussed in Quantilizers: A Safer Alternative to Maximizers for Limited Optimization: this post just expands on a particular thread of ideas in that paper. Throughout I’ll refer to sections of the paper. I have some remaining confusion about the “targeted impact” section, and would appreciate clarifications/corrections!

Abstract

This post explores the relationship between quantilizers and generative models. My main takeaways are:

A natural way to build a quantilizer is by sampling from an appropriately-conditioned generative model.
Unfortunately quantilizing doesn’t seem to confer much advantage over the underlying generative model: to the extent that a quantilizer is more powerful than a generative model, it’s more dangerous, and vice versa.
Quantilizing is pretty computationally expensive relative to the advantage it brings, making it unclear if this is a competitive approach even if it conferred a net safety advantage at fixed power.

Definitions

I’ll follow the setup in “Quantilizers: A Safer Alternative to Maximizers for Limited Optimization” (Section 1, Definition 1) and say that a quantilizer is a model with:

A base distribution $\gamma$, which describes a set of outputs we think are “normal” or “unlikely to lead to doom”.
A quantile $q\in [0,1]$.
An ordering over outputs (e.g. a utility or loss function).

A quantilizer returns a random output from the top $q$ fraction of outputs from $\gamma$ according to the ordering. So for instance a 0-quantilizer is a utility maximizer, a 0.1-quantilizer returns outputs in the best 10%, etc.

Where does the base distribution come from?

One challenge with quantilizers is that they require a base distribution $\gamma$ that we trust. That is, we want to be able to sample from $\gamma$ many times and never hit bad outputs. Unfortunately, we don’t know how to explicitly write this distribution down.

A natural alternative to explicit construction is to train a generative model on the distribution of actions a humans takes (Section 3.1). Then for any given task $\gamma$ can be written as

$$ \gamma_{\rm task} = P(\mathrm{action}|\mathrm{good\ human\ attempts\ task}) $$

This seems quite doable. Train GPT-N on text written by humans, then prompt it with the task:

Prompt: You are a good person who wants to accomplish . What follows is your description of how you intend to do this.

The output will probably be text describing how a good person would accomplish that task!

Unfortunately we might get unlucky and sample a person who isn’t so good (but thinks of themselves as good). That still satisfies the prompt. Robustly getting the agent we want seems hard, but might be possible in the limit where we ask for agents that are generally plausible in the training distribution of the generative model.

Power vs Safety

Armed with our base distribution, we can start quantilizing. Naively we can do this by sampling $q^{-1}$ times from $\gamma_{\rm task}$, then returning the top output. This produces a tradeoff between capabilities and safety.

Single Sampler

In the limit $q\rightarrow 1$ we’re just returning actions a good human would take. That is, our quantilizer just samples once and returns that sample as output, so it’s no more powerful than a random good person.

Such a model is really unlikely to take over the world, but it’s also really unlikely to solve quantum chemistry problems. This is because both of these outcomes unlikely according to the base distribution.

We can improve things a bit by conditioning on being an unusually capable person, e.g.

Prompt: You are a good and extremely capable person who wants to accomplish . What follows is your description of how you intend to do this.

Now we can get top-end human performance out, though at the cost that we have increased the odds of bad outcomes.

For instance, the model might decide to role-play a biologist who invents some extremely positive technology that happens to have unintended negative consequence (e.g. making engineered pandemics more likely). That’s not a unique problem to AI though, just a “giving humans more capabilities” problem.

Many Samples

In the limit $q\ll 1$ we’re sampling many times and returning the best action we find. Quantilizing like this makes actions more powerful, but at the cost of making them less likely according to the base distribution. This potentially makes them more dangerous.

For instance, instead of role-playing an extremely capable biologist who is likely to be alive today, the model can now role-play even more capable biologists who are unlikely to exist. Or it can role-play that same biologist on their best day, having the best insight of their life. Or, even worse, it can role-play a deceptive AI that’s pretending to be an extremely capable (and good) human biologist.

We now have at least two dangers:

As in the single-sampler limit ($q\rightarrow 1)$, there is danger in simulating more powerful human agents. This problem becomes worse as $q$ decreases, because we can sample increasingly powerful agents who are increasingly far from the distribution we trust (”actions good humans take”).
In addition, as we sample increasingly unlikely agents from the model we are more and more likely to encounter something non-human (e.g. a deceptive AI). This seems very bad, and I’m not sure how to avoid it.

Targeted Impact

The general tradeoff above is that as we make the quantilizer more powerful (e.g. reduce $q$, or condition on more capabilities) we make it more dangerous by pulling it away from the action-space we think of as generally safe.

Section 2.2 refers to a technique for quantilizing that avoids this tradeoff called Targeted Impact, but I have not been able to follow the argument. It seems to produce the same quantilizing algorithm (because the condition at the end of the section is exactly the same as the one in Section 2.1), and so I’m confused about where the improvement comes from.

I’m probably missing something, and would really appreciate hearing/seeing a more detailed explanation of this approach!

Summary

My understanding right now is:

A natural way to build a quantilizer is by sampling from an appropriately-conditioned generative model.
This approach probably succeeds at being ‘close to as safe as a human’, at the cost of not being much more powerful.
The more we try to make the model powerful, the weaker our safety guarantees get. This seems bad.
Loss of safety comes both from sampling increasingly unlikely actions and from sampling from increasingly powerful agents. The former means we lose formal guarantees on how bad actions can be relative to normal human actions, the latter means both that we sample from a riskier class of actions (”actions taken by very capable agents”) and that we are more likely to accidentally sample from actions taken by misaligned/deceptive agents.
The number of bits of optimization power we get by quantilizing rather than directly sampling from the generative model is small ($\sim \log q^{-1}$) and comes at large cost ($\sim q^{-1}$), which makes this likely uncompetitive on capabilities relative to just using bigger models trained for longer.

Based on this, my general sense is that quantilizers are not a promising path to alignment, though I’d be very happy to hear otherwise.