Give the AI safe tools

Jun 4, 2022 08:21 · 1147 words · 6 minute read AI Safety AI research

(Cross-posted from LessWrong)

One kind of concern with AI is that:

  1. There are some tools that are instrumentally useful for an AI to have.
  2. Most/the most accessible versions of those tools are dangerous.
  3. The AI doesn’t care which versions are dangerous.
  4. Hence, the AI will probably develop dangerous tools for instrumental reasons.

You might call concerns like this Instrumental Danger Problems. This post aims to examine some existing approaches to Instrumental Danger Problems, and to introduce a new one, namely “Giving the AI safe tools”.

A few examples

Here are a few concrete examples of Instrumental Danger Problems:

An (incomplete) statement of the inner alignment problem:

  1. Mesa-optimizers are instrumentally useful for an AI to have.
  2. The easiest mesa-optimizers to find are misaligned, hence dangerous.
  3. The AI doesn’t care which mesa-optimizers are dangerous.
  4. Hence, the AI will probably develop dangerous mesa-optimizers for instrumental reasons.

One of the arguments in Are Minimal Circuits Deceptive?:

  1. It is instrumentally useful for an AI to search for programs.
  2. Many simple programs are dangerous because they cannot encode or point to our values.
  3. The AI doesn’t care which programs are dangerous.
  4. Hence, the AI will probably find and run dangerous programs for instrumental reasons.

Concerns about optimizing on human feedback:

  1. It is instrumentally useful for the AI to have a model of the human’s values.
  2. The most accessible models of human values are dangerous because they don’t generalize out of the training distribution.
  3. The AI doesn’t care which versions are dangerous, just which do well in training.
  4. Hence, the AI will probably develop a dangerous model of human values for instrumental reasons.

How can we avoid these concerns?

I think this is a useful lens because it suggests some approaches we can use to make AI safer.

(1) Make the tools useless

We’re only worried about tools that are instrumentally useful. What if we make the tools-of-interest useless? This seems hard without sacrificing competitiveness, but I think there are some cases where it’s possible.

For instance some tools are only useful because of choices we made in specifying a goal or setting up the training environment. Training a narrow microscope-AI in a multiplayer environment artificially makes all sorts of tools instrumentally useful (e.g. theory-of-mind) which wouldn’t have been otherwise.

That’s a silly example, but it illustrates that we can sometimes remove potential failure modes by considering possible Instrumental Danger Problems in setting up the training environment.

(2) Bias the AI towards safe versions of tools

Another way we can make the AI safer is to bias it towards safe versions of tools.

For instance we might be able to build priors that bias against deception (e.g. The Speed+Simplicity Prior is probably anti-deceptive). Or use interpretability tools to detect deception and regularize with that (e.g. [Chris Olah’s views on AGI safety]( notes that one of,network into human-understable code.))).

(3) Give the AI safe tools

A third approach is to give the AI access to a safe tool. This is in the style of biasing the AI towards safe versions of tools, and it’s sort of the complement to detecting bad tools. Rather than weeding out AI’s that use dangerous tools, we directly give our AI safe tools.

If we can give the AI some tools, and they’re effective tools for the AI’s instrumental needs, it might use those tools instead of inventing its own. If we can identify tools that are dangerous for an AI to build itself, and hand it versions of those that we’ve built and verified to be safe, then we can eliminate some failure modes.

What might this look like in practice?

  • If we can build a tool that searches over non-deceptive simple programs, we could make that available to the AI so it doesn’t search on its own.
  • If we can build a known-safe optimizer (e.g. one that never suggests a dangerous action), we can hand that to the AI to avoid it developing mesa-optimizers.
  • If we can use interpretability tools to build known-safe language modules, vision modules, etc., we can hand those to the AI to reduce the “attack surface area” (the parts of the AI that could be dangerous).

Those are a lot of if’s! But I think these could be doable, and they’re definitely smaller tasks than “somehow stop the model from building dangerous tools for instrumental reasons”.


I’m excited about giving AI’s safe tools because:

  • If we can devise safe tools, it gives a very direct approach to Instrumental Danger Problems.
  • I expect devising safe tools to be easier than indirectly steering models away from dangerous ones.

This approach also raises an immediate empirical question: How hard is it to get the AI to use a tool we provide? More concretely, if we give GPT access to a calculator during training, can we make it better at arithmetic? A world where it’s easy to get models to use the tools we provide is more hopeful than one where the model ignores our tools and builds its own.

Failure Modes

At the same time, I know of at least three failure modes for this approach (thanks to Nicholas Schiefer and Evan Hubinger for pointing these out):

(1) The AI might be so much smarter than us that it can build more effective versions of tools than the ones we give it. Even though our tools are readily at-hand, it might just skip straight to its own (dangerous but more performant) version of the tool.

This is something we might be able to watch for empirically. We can log tool-uses, and should become quite worried if, upon scaling an AI up, it stops using our tools. That should be cause for immediate concern and motivate interpretability work to figure out what the AI is doing instead.

Of course an AI could use our tools spuriously to deceive us. This seems less likely than ordinary deception though, because the part of the AI that would want to deceive us here is some inner tool it built (rather than the “outer” agent), and the inner tool would need to know an awful lot about its environment to immediately realize that its continued existence requires sending spurious-yet-valid queries to a very different part of the AI.

(2) Alternatively, the AI might struggle to use our tools. We might not be able to craft easy ways for it to structure inputs and process outputs. This one strikes me as less likely in the limit of powerful systems, but could definitely be a barrier to studying smaller tool-using AI’s.

(3) The AI could use our tools in ways we don’t expect. For instance if we give an AI an ineffective tool, but heavily incentivize using it, it might use that ineffective tool as part of building a more effective (but dangerous) tool. This is the failure mode I’m most worried about, because it seems most hard to detect or plan for.

tweet Share