Thomas Kwa

Was on Vivek Hebbar's team at MIRI, now working with Adrià Garriga-Alonso on various empirical alignment projects.

I'm looking for projects in interpretability, activation engineering, and control/oversight; DM me if you're interested in working with me.

Sequences

Catastrophic Regressional Goodhart

Wiki Contributions

Comments

As recently as early 2023 Eliezer was very pessimistic about AI policy efforts amounting to anything, to the point that he thought anyone trying to do AI policy was hopelessly naive and should first try to ban biological gain-of-function research just to understand how hard policy is. Given how influential Eliezer is, he loses a lot of points here (and I guess Hendrycks wins?)

Then Eliezer updated and started e.g. giving podcast interviews. Policy orgs spun up and there are dozens of safety-concerned people working in AI policy. But this is not reflected in the LW frontpage. Is this inertia, or do we like thinking about computer science more than policy, or is it something else?

My prior is that solutions contain on the order of 1% active ingredients, and of things on the Enovid ingredients list, citric acid and NaNO2 are probably the reagents that create NO [1], which happens at a 5.5:1 mass ratio. 0.11ppm*hr as an integral over time already means the solution is only around 0.01% NO by mass [1], which is 0.055% reagents by mass, probably a bit more because yield is not 100%. This is a bit low but believable. If the concentration were really only 0.88ppm and dissipated quickly, it would be extremely dilute which seems unlikely. This is some evidence for the integral interpretation over the instantaneous 0.88ppm interpretation-- not very strong evidence; I mostly believe it because it seems more logical and also dimensionally correct. [2]

[1] https://chatgpt.com/share/e95fcaa3-4062-4805-80c3-7f1b18b12db2

[2] If you multiply 0.11ppmhr by 8 hours, you get 0.88ppmhr^2, which doesn't make sense.

Also, why do you think that error is heavier tailed than utility?

Goodhart's Law is really common in the real world, and most things only work because we can observe our metrics, see when they stop correlating with what we care about, and iteratively improve them. Also the prevalence of reward hacking in RL often getting very high values.

If the reward model is as smart as the policy and is continually updated with data, maybe we're in a different regime where errors are smaller than utility.

It could be anything because KL divergence basically does not restrict the expected value of anything heavy-tailed. You could get finite utility and  error, or the reverse, or infinity of both, or neither converging, or even infinite utility and negative infinity error—any of these with arbitrarily low KL divergence.

To draw any conclusions, you need to assume some joint distribution between the error and utility, and use some model of selection that is not optimal policies under a KL divergence penalty or limit. If they are independent and you think of optimization as conditioning on a minimum utility threshold, we proved last year that you get 0 of whichever has lighter tails and  of whichever has heavier tails, unless the tails are very similar. I think the same should hold if you model optimization as best-of-n selection. But the independence assumption is required and pretty unrealistic, and you can't weaken it in any obvious way.

Realistically I expect that error will be heavy-tailed and heavier-tailed than utility by default so error goes to infinity. But error will not be independent of utility, so the expected utility depends mostly on how good extremely high error outcomes are. The prospect of AIs creating some random outcome that we overestimated the utility of by 10 trillion points does not seem especially good, so I think we should not be training AIs to maximize this kind of static heavy-tailed reward function.

If my interpretation is right, the relative dose from humming compared to NO nasal spray is >200 times lower than this post claims, so humming is unlikely to work.

I think 0.11 ppm*hrs means that the integral of the curve of NO concentration added by the nasal spray is 0.11 ppm*hr. This is consistent with the dose being 130µl of a dilute liquid. If NO is produced and reacts immediately, say in 20 seconds, this means the concentration achieved is 19.8 ppm, not 0.88 ppm, which seems far in excess of what is possible through humming. The study linked (Weitzberg et al) found nasal NO concentrations ranging between 0.08 and 1 ppm depending on subject, with the center (mean log concentration) being 0.252 ppm, not this post's estimate of 2-3 ppm.

If the effectiveness of NO depends on the integral of NO concentration over time, then one would have to hum for 0.436 hours to match one spray of Enovid, and it is unclear if it works like this. It could be that NO needs to reach some threshold concentration >1ppm to have an antiseptic effect, or that the production of NO in the sinuses would drop off after a few minutes. On the other hand it could be that 0.252ppm is enough and the high concentrations delivered by Enovid are overkill. In this case humming would work, but so would a 100x lower dose of the nasal spray. Which someone should study inasmuch as you still believe in humming.

This is a fair criticism. I changed "impossible" to "difficult".

My main concern is with future forms of RL that are some combination of better at optimization (thus making the model more inner aligned even in situations it never directly sees in training) and possibly opaque to humans such that we cannot just observe outliers in the reward distribution. It is not difficult to imagine that some future kind of internal reinforcement could have these properties; maybe the agent simulates various situations it could be in without stringing them together into a trajectory or something. This seems worth worrying about even though I do not have a particular sense that the field is going in this direction.

Much dumber ideas have turned into excellent papers

Is there an AI transcript/summary?

Thomas KwaΩ470

I started a dialogue with @Alex_Altair a few months ago about the tractability of certain agent foundations problems, especially the agent-like structure problem. I saw it as insufficiently well-defined to make progress on anytime soon. I thought the lack of similar results in easy settings, the fuzziness of the "agent"/"robustly optimizes" concept, and the difficulty of proving things about a program's internals given its behavior all pointed against working on this. But it turned out that we maybe didn't disagree on tractability much, it's just that Alex had somewhat different research taste, plus thought fundamental problems in agent foundations must be figured out to make it to a good future, and therefore working on fairly intractable problems can still be necessary. This seemed pretty out of scope and so I likely won't publish.

Now that this post is out, I feel like I should at least make this known. I don't regret attempting the dialogue, I just wish we had something more interesting to disagree about.

Thomas KwaΩ120

The model ultimately predicts the token two positions after B_def. Do we know why it doesn't also predict the token two after B_doc? This isn't obvious from the diagram; maybe there is some way for the induction head or arg copying head to either behave differently at different positions, or suppress the information from B_doc.

Load More