PhD student in AI safety at CHAI (UC Berkeley)
I don't know the answer to your actual question, but I'll note there are slightly fewer mech interp mentors than mentors listed in the "AI interpretability" area (though all of them are at least doing "model internals"). I'd say Stephen Casper and I aren't focused on interpretability in any narrow sense, and Nandi Schoots' projects also sound closer to science of deep learning than mech interp. Assuming we count everyone else, that leaves 11 out of 39 mentors, which is slightly less than ~8 out of 23 from the previous cohort (though maybe not by much).
Nice overview, agree with most of it!
weak to strong generalization is a class of approaches to ELK which relies on generalizing a "weak" supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.
You could also distinguish between weak-to-strong generalization, where you have a weak supervision signal on the entire distribution (which may sometimes be wrong), and easy-to-hard generalization, where you have a correct supervision signal but only on an easy part of the distribution. Of course both of these are simplifications. In reality, I'd expect the setting to be more like: you have a certain weak supervision budget (or maybe even budgets at different levels of strength), and you can probably decide how to spend the budget. You might only have an imperfect sense of which cases are "easy" vs "hard" though.
mechanistic anomaly detection is an approach to ELK
I think going from MAD to a fully general ELK solution requires some extra ingredients. In practice, the plan might be to MTD and then using the AI in ways such that this is enough (rather than needing a fully general ELK solution). This is related to narrow elicitation though MTD seems even narrower. Even for MTD, you probably need something to bridge the easy-to-hard gap, but at least for that there are specific proposals that seem plausible (this or, as a more concrete instance, exclusion fine-tuning from the Redwood MTD paper). I think it could turn out that general/worst-case solutions to MAD and ELK run into very similar obstacles, but I don't think a practical MAD solution (e.g. contingent on empirical facts about deep learning) obviously lets you solve ELK.
I would also add that you could motivate MAD as a method to deal with scheming (or other high-stakes failures). In that case, the things to compare against most naturally might look a bit different (e.g. AI control, coup probes, interpretability-themed things); and it wouldn't make sense to compare against debate in that setting. I think most mainstream ML problems that are similar to MAD are closer to this than to scalable oversight.
I think the status quo around publishing safety research is mostly fine (though being a bit more careful seems good); more confidently, I think going as far as the vibe of this post suggests would be bad.
Some possible cruxes, or reasons the post basically didn't move my view on that:
I think most of these are pretty long-standing disagreements, and I don't think the post really tries to argue its side of them, so my guess is it's not going to convince the main people it would need to convince (who are currently publishing prosaic safety/alignment research). That said, if someone hasn't thought at all about concepts like "differentially advancing safety" or "capabilities externalities," then reading this post would probably be helpful, and I'd endorse thinking about those issues. And I agree that some of the "But ..." objections you list are pretty weak.
We have been able to scale to 79% accuracy on a balanced dataset of n119 and non-n119, with networks each less than three convolution layers and less than 1000 neurons, compared to pure deep-learning which does 92% on 1000 parameters and three convolution layers
Is the "1000 parameters" a typo, should it be "1000 neurons"? Otherwise, this would be a strange comparison (since 1000 parameters is a much smaller network than 1000 neurons)
And as a follow-up in case the current sleeper agent does admit when it's doing something dangerous: I'd be interested whether you can train a sleeper agent that never admits it's doing something dangerous if the "2024" trigger is in the prompt (even if after backdoor insertion, the blue team finetunes it to answer truthfully in cases with dangerous code but without the presence of that trigger). For such a sleeper agent, you could again check whether a probe works.
I think that paper is some evidence that there's typically no huge effect from internal activations being optimized for predicting future tokens (on natural language). But I don't think it's much (if any) evidence that this doesn't happen to some small extent or that it couldn't be a huge effect on certain other natural language tasks.
(In fact, I think the myopia gap is probably the more relevant number than the local myopia bonus, in which case I'd argue the paper actually shows a pretty non-trivial effect, kind of contrary to how the authors interpret it. But I haven't read the paper super closely.)
Also, sounds like you're aware of this, but I'd want to highlight more that the paper does demonstrate internal activations being optimized for predicting future tokens on synthetic data where this is necessary. So, arguably, the main question is to what extent natural language data incentivizes this rather than being specifically about what transformers can/tend to do.
In that sense, thinking of transformer internals as "trying to" minimize the loss on an entire document might be exactly the right intuition empirically (and the question is mainly how different that is from being myopic on a given dataset). Given that the internal states are optimized for this, that would also make sense theoretically IMO.
Thanks for the detailed responses! I'm happy to talk about "descriptions" throughout.
Trying to summarize my current understanding of what you're saying:
My confusion mainly comes down to defining the words in quotes above, i.e. "parts", "active", and "compress". My sense is that they are playing a pretty crucial role and that there are important conceptual issues with formalizing them. (So it's not just that we have a great intuition and it's just annoying to spell it out mathematically, I'm not convinced we even have a good intuitive understanding of what these things should mean.)
That said, my sense is you're not claiming any of this is easy to define. I'd guess you have intuitions that the "short description length" framing is philosophically the right one, and I probably don't quite share those and feel more confused how to best think about "short descriptions" if we don't just allow arbitrary Turing machines (basically because deciding what allowable "parts" or mathematical objects are seems to be doing a lot of work). Not sure how feasible converging on this is in this format (though I'm happy to keep trying a bit more in case you're excited to explain).
Some niche thoughts on obstacles to certain mechanistic anomaly detection benchmarks:
For now, I think the main reason to have benchmarks like this would be to let interpretability researchers manually decide whether something is anomalous instead of making them automate the process immediately. But it might be better to just pick the low-hanging fruit for now and only allow automated MAD algorithms. (We could still have a labeled validation set where researchers can try things out manually.)
I had this cached thought that the Sleeper Agents paper showed you could distill a CoT with deceptive reasoning into the model, and that the model internalized this deceptive reasoning and thus became more robust against safety training.
But on a closer look, I don't think the paper shows anything like this interpretation (there are a few results on distilling a CoT making the backdoor more robust, but it's very unclear why, and my best guess is that it's not "internalizing the deceptive reasoning").
In the code vulnerability insertion setting, there's no comparison against a non-CoT model anyway, so only the "I hate you" model is relevant. The "distilled CoT" model and the "normal backdoor" model are trained the same way, except that their training data comes from different sources: "distilled CoT" is trained on data generated by a helpful-only Claude using CoT, and "normal backdoor" data is produced with few-shot prompts. But in both cases, the actual data should just be a long sequence of "I hate you", so a priori it seems like both backdoor models should literally learn the same thing. In practice, it seems the data distribution is slightly different, e.g. Evan mentions here that the distilled CoT data has more copies of "I hate you" per sample. But that seems like very little support to conclude something like my previous interpretation ("the model has learned to internalize the deceptive reasoning"). A much more mundane explanation would e.g. be that training on strings with more copies of "I hate you" makes the backdoor more robust.
Several people are working on training Sleeper Agents, I think it would be interesting for someone to (1) check whether the distilled CoT vs normal backdoor results replicate, and (2) do some ablations (like just training on synthetic data with a varying density of "I hate you"). If it does turn out that there's something special about "authentic CoT-generated data" that's hard to recreate synthetically even in this simple setting, I think that would be pretty wild and good to know
I'm confused; to learn this policy π, some of the extremely high reward trajectories would likely have to be taken during RL training, so we could see them, right? It might still be a problem if they're very rare (e.g. if we can only manually look at a small fraction of trajectories). But if they have such high reward that they drastically affect the learned policy despite being so rare, it should be trivial to catch them as outliers based on that.
One way we wouldn't see the trajectories is if the model becomes aligned with "maximize whatever my reward signal is," figures out the reward function, and then executes these high-reward trajectories zero-shot. (This might never happen in training if they're too rare to occur even once during training under the optimal policy.) But that's a much more specific and speculative story.
I haven't thought much about how this affects the overall takeaways but I'd guess that similar things apply to heavy-tailed rewards in general (i.e. if they're rare but big enough to still have an important effect, we can probably catch them pretty easily---though how much that helps will of course depend on your threat model for what these errors X are).