
Trying to get into alignment


FYI, since I think you missed this: According to the responsible scaling policy update, the Long-Term Benefit Trust would "have sufficient oversight over the [responsible scaling] policy implementation to identify any areas of non-compliance." 


It's also EAG London weekend lol it's a busy weekend for all


I thought that the part about models needing to keep track of a more complicated mix-state presentation as opposed to just the world model is one of those technical insights that's blindingly obvious once someone points it out to you (i.e., the best type of insight :)). I love how the post starts out by describing the simple ZIR example to help us get a sense of what these mixed state presentations are like. Bravo!


So out of the twelve people on the weak to strong generalization paper, four have since left OpenAI? (Leopald, Pavel, Jan, and Ilya)

Other recent safety related departures that come to mind are Daniel Kokotajlo and William Saunders.

Am I missing anyone else?


Others have mentioned Coase (whose paper is a great read!). I would also recommend The Visible Hand: The Managerial Revolution in American Business. This is an economic history work detailing how large corporations emerged in the US in the 19th century. 


Thanks for the response!

I'm worried that instead of complicated LMA setups with scaffolding and multiple agents, labs are more likely to push for a single tool using LM agent, which seems cheaper and simpler. I think some sort of internal steering for a given LM based on learned knowledge discovered through interpretability tools is probably the most competitive method. I get your point that the existing method in LLMs aren't necessarily re targeting some sort of searching method, but at the same time they don't have to be? Since there isn't this explicit search and evaluation process in the first place, I think of it more as a nudge guiding LLM hallucinations.

I was just thinking, a really ambitious goal would be apply some sort of GSLK steering to LLAMA and see if you could get it to perform well on the LLM leaderboard, similar to how there's models there that's just DPO applied to LLAMA


The existing research on selecting goals from learned knowledge would be conceptual interpretability and model steering through activation addition or representation engineering, if I understood your post correctly? I think these are promising paths to model steering without RL.

I'm curious if there is a way to bake conceptual interpretability into the training process. In a sense, can we find some suitable loss function that incentivizes the model to represent its learned concepts in an easily readable form, and applying it during training? Maybe train a predictor that predicts a model's output from its weights and activations? The hope is to have a reliable interpretability method that scales with compute. Another issue is that existing papers also focus on concepts represented linearly, which is fine if most important concepts are represented that way, but who knows?

Anyways, sorry for the slightly rambling comment. Great post! I think this is the most promising plan to alignment. 

I don't have any substantive comment to provide at the moment, but I want to share that this is the post that piqued my initial interest in alignment. It provided a fascinating conceptual framework around how we can qualitatively describe the behavior of LLMs, and got me thinking about implications of more powerful future models. Although it's possible that I would eventually become interested in alignment, this post (and simulator theory broadly) deserve a large chunk of the credit. Thanks janus.


Joe Biden watched mission impossible and that's why we have the EO is now my favorite conspiracy theory. 


Basic Background:

Risks from Learned Optimization introduces a set of terminologies that help us think about the safety of ML systems, specifically as it relates to inner alignment. Here’s a general overview of what these ideas are.

A neural network is trained on some loss/reward function by a base optimizer (e.g., stochastic gradient descent on a large language model using next token prediction as the loss function). The loss function can also be thought of the base-objective , and the base optimizer would select for algorithms that perform well on this base-objective.

After training, the neural net implements some algorithm, which we call the learned algorithm. The learned algorithm can itself be an optimization process (but it may also be, for example, a collection of heuristics). Optimizers are loosely defined, but the gist is that an optimizer is something that searches through a space of actions and picks one that scores the highest according to some function, which depends on the input it's given. One can think of AlphaGo as an optimizer that searches through the space of the next moves and picks one that leads to the highest win probability. 

When the learned algorithm is also an optimizer, we call it a mesa-optimizer. All optimizers have a goal, which we call the mesa-objective  The objective of the mesa-optimizer may be different from the base-objective which programmers have explicit control over. The mesa-objective, however, needs to be learned through training. 

Inner misalignment happens when the learned mesa-objective differs from the base-objective. For example, if we are training a roomba neural net, we can use how clean the floor is as a reward function. That would be the base-objective. However, if the roomba is a mesa-optimizer, it could have different mesa-objectives such as maximizing the amount of dust sucked in or the amount of dust inside the dust collector. The post below deals with one such class of inner alignment failure: suboptimality alignment. 

In the post, I sometimes compare suboptimality alignment with deceptive alignment, which is a complicated concept. I think it’s best to just read the actual paper if you want to understand that.

