Posted to the AI Alignment Forum here.
Thanks to Evan Hubinger for suggesting this idea.
Training Trace Priors are priors over boolean circuits which examine the outputs of gates on samples from the training distribution, typically for purposes of steering models away from having components that were never tested during training. The one I like to think about is the One-Gate Trace Prior (OGT Prior), which penalizes circuits if there are gates with constant outputs during training.
Priors like this might be a way to make deception less likely. The intuition is that deception involves a decision of whether or not to defect, which at its simplest shows up as a gate that always outputs False during training but outputs True in deployment.
This post explores the relationship between the OGT Prior and Speed Priors, which I’ve operationalized with the circuit depth prior.
Speed Priors incentivize memorization. In particular, any result that is constant on the training distribution and contributes to the circuit’s depth should just become a constant input into the circuit.
The OGT Prior does something similar. Any output that is constant on the training distribution is treated as suspicious, and so should be pre-computed and turned into a constant input instead.
To see where these produce the same result, consider a physics simulator with N particles interacting pairwise subject to some conditional check (perhaps on the distance between particles). If only some pairwise interactions occur in the training set then both the Speed Prior and the OGT Prior will only compute the interactions between pairs actually encountered in training.
To see where these differ, note that the Speed Prior has no problem with computing constants so long as these do not contribute to the overall circuit depth. For instance, consider a circuit of depth 10 can contain constant calculations of depth 9 that feed straight into the final output. The Speed Prior has no reason to eliminate these, whereas the OGT Prior will turn the results of these calculations into constant inputs.
The Speed Prior incentivizes circuits to use lookup tables. Specifically, the Speed Prior produces circuits that memorize the results of expensive calculations in lookup tables containing all instances encountered in training.
The OGT Prior is perfectly fine with circuits being very deep, and so has no need to tabulate expensive calculations.
A problem with speed priors is that if they have to search over inner models to run, the fastest way to do that may be to search over *simple* programs, which can lead to inner alignment failure because simple programs are likely deceptive.
Does the OGT Prior have the same problem? I’m not sure, but I suspect it goes the other way, and incentivizes searching over complicated programs. To see this, here’s a circuit architecture that I think the OGT Prior incentivizes which can perform search over models:
- A generator proposes a model prompted by some seed taken from the rest of the circuit.
- The proposed model is communicated along some wires to the interpreter, which runs the model.
- Models are run until they either naturally terminate or some fixed number of steps have elapsed.
- On termination the model is evaluated and either accepted or rejected.
- If the model is rejected, the search begins again.
- Once a model is accepted, the output from that model is used as a seed to randomize the wires on any unused parts of the interpreter and generator subcircuits.
A challenge is if the model is simple, not all wires between the generator and the interpreter will be used:
- We can’t set unused wires randomly because they specify the model and random garbage in the model specification is probably bad.
- If we set unused wires to constants then the OGT Prior penalizes our circuit because some of the unused wires have the same value in every run.
One way around this is to randomize which wires are unused based on some property of the environment, or to encode simple models in a more complicated way to ensure that all wires are used. Those are both possible, but an easier way is to just only search over models that are complex enough to use all the wires. That seems like the most natural solution and, given that complex models often perform better than simple ones, suggests that the OGT Prior will prefer searching over more complex models first.
I think this is a bit safer than searching over simple models, but the difference might be quite small. Complex models are able to encode complex values, and so at least can be aligned, but it’s not clear that aligned models are particularly likely to be encountered early in such a search.
Speed Priors and Training Trace Priors are actually pretty different in the kinds of models they produce. The main similarity I could find is that both prefer memorizing constant calculations. Beyond that, though, they differ in every case I’ve considered. Notably, Training Trace Priors seem to make models that then (in meta-learning) search for complex inner models, which could be helpful in avoiding inner optimizers finding deceptive models.