Multigate Priors

Jun 15, 2022 15:33 · 800 words · 4 minute read AI Safety AI research

Posted to LessWrong here.

(Thanks to Evan Hubinger and Nicholas Schiefer for suggestions and discussions around these ideas)

Multi-Gate Traces

We can improve on one-gate traces with multi-gate traces! Suppose we have $N$ training samples. Then we can reasonably estimate joint probability distributions over up to $k=\lfloor\log_2 N\rfloor$ gates, giving the prior:

$$ p\propto \exp\left(-\sum_{i_1…i_k}\sum_{j\in [0,1]^k}p_{i_1…i_k}(j)^m\right) $$

That is, we sum over all selections of $k$ gates and compute the entropy of their joint distribution over $2^k$ states. Once more we promote models with high entropy and punish those with low entropy, which favors models that explore as much of their state space as possible during training, and disfavors those which have hidden states never touched in training.

Now it’s still possible for a model to hide the machinery to defect here, but it gets harder and harder as $N$ and $k$ get large. In this limit the only way for that machinery to hide is to be heavily obfuscated.

Objection: Non-Competitive

The Multi-Gate Trace prior over $k$ gates scales very badly with $k$. In a system with $G$ gates there are $G \choose k$ different selections of $k$ gates, which means the cost of evaluating the prior is roughly exponential in $k$ for $k \ll G$.

In practice this means that $k$ needs to be very small, and it’s possible that deception could hide in subsets of the circuit just a bit larger than the largest ones we can check.

Objection: Penalizes Redundancy

It is often useful to have redundant machinery in a program for e.g. readability or ease-of-modification. Redundancies though guarantee that some states will never arise. E.g. if the same calculation happens in two disjoint subcircuits in the model we will never see those two subcircuits in different states from each other.

The Multi-Gate Trace prior disfavors models with that sort of redundancy, because those models have a larger apparent state space than they really do. This could cause the Multi-Gate Trace prior to favor hard-to-interpret models, and could result in a performance penalty by making it harder for the training process to modify the model.

Next Try: Embedding Priors

The Multi-Gate Trace prior is probably not competitive because the state space of the model traces is enormous. One way to remedy this is embed the state of the model in a lower-dimensional space and favor models that explore more of that space during training.

That is, suppose we learn an embedding of the space of state vectors $v={g_1,…,g_N}$ into a continuous space of fixed lower dimension $d$. Given such an embedding, we can readily look for parts of parameter space that haven’t been explored in training. For instance we can discretize the embedding space into $d$-cubes and compute

$$ p \propto \exp\left(-\sum_{\mathrm{cube}} p(\mathrm{cube})^m\right) $$

where $p(x)$ is the fraction of training samples that fell in the cube centered on $x$.

This scales much better than the Multi-Gate Trace prior. In particular, for fixed embedding dimension the runtime cost is just a polynomial in the resolution of the discretization and proportional to the embedding cost (which can be chosen to be e.g. linear in the number of gates).

There are a few ways to learn an embedding:

Try to preserve distance between state vectors in the embedding. That is, two states $v$ and $v'$ with distance $|v-v'|$ should map onto embedded vectors $x$ and $x'$ such that $|v-v'|\approx |x-x'|$.
Try to preserve distance in the inputs/outputs. So for instance if the distance between inputs $i$ and $i'$ is $|i-i'|$ then embed the resulting states $v,v'$ to produce $x,x'$ such that $|i-i'|\approx |x-x'|$. Or alternately do this for the outputs of the model.

This approach is also more forgiving of redundancy in the model. The embedding can ignore redundant parts of the model state and so an embedding prior need not penalize redundancy.

Incidentally, the embedding prior shares some similarities to the ELK setup. You can view the embedding prior as an additional head attached to the original model. Rather than trying to learn a translator, we’re trying to learn a mapping between internal state and the embedding space, which we use to understand how much of the space of possible behaviors we’ve explored.

Embedding priors can have blind spots. An embedding might neglect a piece of the model’s state space, not because it’s redundant but because it was never encountered in training.

Ways Forward

This approach of developing priors based on execution traces in training is very closely related to the field of fuzzing. What we’re looking for, in effect, is a way to ensure good program coverage during training. My hope is that fuzzing tools and knowledge can help here.

For instance the issues of blind spots in coverage metrics must arise in fuzzing, I just don’t know how it gets solved there.

Multi-Gate Traces

Objection: Non-Competitive

Objection: Penalizes Redundancy

Next Try: Embedding Priors

Objection: Blind Spots

Ways Forward