Eliezer explores a dichotomy between "thinking in toolboxes" and "thinking in laws". 
Toolbox thinkers are oriented around a "big bag of tools that you adapt to your circumstances." Law thinkers are oriented around universal laws, which might or might not be useful tools, but which help us model the world and scope out problem-spaces. There seems to be confusion when toolbox and law thinkers talk to each other.

William_S3dΩ671477
25
I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source "transformer debugger" tool. I resigned from OpenAI on February 15, 2024.
I wish there were more discussion posts on LessWrong. Right now it feels like it weakly if not moderately violates some sort of cultural norm to publish a discussion post (similar but to a lesser extent on the Shortform). Something low effort of the form "X is a topic I'd like to discuss. A, B and C are a few initial thoughts I have about it. What do you guys think?" It seems to me like something we should encourage though. Here's how I'm thinking about it. Such "discussion posts" currently happen informally in social circles. Maybe you'll text a friend. Maybe you'll bring it up at a meetup. Maybe you'll post about it in a private Slack group. But if it's appropriate in those contexts, why shouldn't it be appropriate on LessWrong? Why not benefit from having it be visible to more people? The more eyes you get on it, the better the chance someone has something helpful, insightful, or just generally useful to contribute. The big downside I see is that it would screw up the post feed. Like when you go to lesswrong.com and see the list of posts, you don't want that list to have a bunch of low quality discussion posts you're not interested in. You don't want to spend time and energy sifting through the noise to find the signal. But this is easily solved with filters. Authors could mark/categorize/tag their posts as being a low-effort discussion post, and people who don't want to see such posts in their feed can apply a filter to filter these discussion posts out. Context: I was listening to the Bayesian Conspiracy podcast's episode on LessOnline. Hearing them talk about the sorts of discussions they envision happening there made me think about why that sort of thing doesn't happen more on LessWrong. Like, whatever you'd say to the group of people you're hanging out with at LessOnline, why not publish a quick discussion post about it on LessWrong?
Does the possibility of China or Russia being able to steal advanced AI from labs increase or decrease the chances of great power conflict? An argument against: It counter-intuitively decreases the chances. Why? For the same reason that a functioning US ICBM defense system would be a destabilizing influence on the MAD equilibrium. In the ICBM defense circumstance, after the shield is put up there would be no credible threat of retaliation America's enemies would have if the US were to launch a first-strike. Therefore, there would be no reason (geopolitically) for America to launch a first-strike, and there would be quite the reason to launch a first strike: namely, the shield definitely works for the present crop of ICBMs, but may not work for future ICBMs. Therefore America's enemies will assume that after the shield is put up, America will launch a first strike, and will seek to gain the advantage while they still have a chance by launching a pre-emptive first-strike. The same logic works in reverse. If Russia were building a ICBM defense shield, and would likely complete it in the year, we would feel very scared about what would happen after that shield is up. And the same logic works for other irrecoverably large technological leaps in war. If the US is on the brink of developing highly militaristically capable AIs, China will fear what the US will do with them (imagine if the tables were turned, would you feel safe with Anthropic & OpenAI in China, and DeepMind in Russia?), so if they don't get their own versions they'll feel mounting pressure to secure their geopolitical objectives while they still can, or otherwise make themselves less subject to the threat of AI (would you not wish the US would sabotage the Chinese Anthropic & OpenAI by whatever means if China seemed on the brink?). The fast the development, the quicker the pressure will get, and the more sloppy & rash China's responses will be. If its easy for China to copy our AI technology, then there's much slower mounting pressure.
Something I'm confused about: what is the threshold that needs meeting for the majority of people in the EA community to say something like "it would be better if EAs didn't work at OpenAI"? Imagining the following hypothetical scenarios over 2024/25, I can't predict confidently whether they'd individually cause that response within EA? 1. Ten-fifteen more OpenAI staff quit for varied and unclear reasons. No public info is gained outside of rumours 2. There is another board shakeup because senior leaders seem worried about Altman. Altman stays on 3. Superalignment team is disbanded 4. OpenAI doesn't let UK or US AISI's safety test GPT5/6 before release 5. There are strong rumours they've achieved weakly general AGI internally at end of 2025
habryka3d4417
5
Does anyone have any takes on the two Boeing whistleblowers who died under somewhat suspicious circumstances? I haven't followed this in detail, and my guess is it is basically just random chance, but it sure would be a huge deal if a publicly traded company now was performing assassinations of U.S. citizens.  Curious whether anyone has looked into this, or has thought much about baseline risk of assassinations or other forms of violence from economic actors.

Popular Comments

Recent Discussion

I've been working on a project with the goal of adding virtual harp strings to my electric mandolin. As I've worked on it, though, I've ended up building something pretty different:

It's not what I was going for! Instead of a small bisonoric monophonic picked instrument attached to the mandolin, it's a large unisonoric polyphonic finger-plucked tabletop instrument. But I like it!

While it's great to have goals, when I'm making things I also like to follow the gradients in possibility space, and in this case that's the direction they flowed.

I'm not great at playing it yet, since it's only existed in playable form for a few days, but it's an instrument it will be possible for someone to play precisely and rapidly with practice:

This does mean I need a new name for it: why would...

2jefftk2h
I was thinking that finger muting wouldn't be possible, because the sensors are physically damped and there's no vibration left for your fingers to stop. Except now that you mention it, it might still be possible! It could be that gently placing your finger on one of them has a sufficiently recognizable signal that if it's currently "vibrating" and you do that I could treat that as a mute signal.
2cousin_it2h
Maybe you could reduce the damping, so that when muting you can feel your finger stopping the vibration? It seems to me that more feedback of this kind is usually a good thing for the player. Also the vibration could give you a continuous "envelope" signal to be used later.
2jefftk26m
I do think that would be possible, but then I think you'll also get more false triggers. The strong damping is what makes it so I can sensitively detect a pluck on one tine without a strong pluck on one tine also triggering detection of a weak pluck on neighbor tines.

Crosstalk is definitely a problem, e-drums and pads have it too. But are you sure the tradeoff is inescapable? Here's a thought experiment: imagine the tines sit on separate pads, or on the same pad but far from each other. (Or physically close, but sitting on long rods or something, so that the distance through the connecting material is large.) Then damping and crosstalk can be small at the same time. So maybe you can reduce damping but not increase crosstalk, by changing the instrument's shape or materials.

Based on the results from the recent LW census, I quickly threw together a test that measures how much of a rationalist you are.

I'm mainly posting it here because I'm curious how well my factor model extrapolates. I want to have this data available when I do a more in-depth analysis of the results from the census.

I scored 14/24.

There should be a question at the end: "After seeing your results, how many of the previous responses did you feel a strong desire to write a comment analyzing/refuting?" And that's the actual rationalist score...

But I'm interested that there might be a phenomenon here where the median LWer is more likely to score highly on this test, despite being less representative of LW culture, but core, more representative LWers are unlikely to score highly. 

Presumably there's some kind of power law with LW use (10000s of users who use LW for <1 hour a month,... (read more)

Some people have suggested that a lot of the danger of training a powerful AI comes from reinforcement learning. Given an objective, RL will reinforce any method of achieving the objective that the model tries and finds to be successful including things like deceiving us or increasing its power.

If this were the case, then if we want to build a model with capability level X, it might make sense to try to train that model either without RL or with as little RL as possible. For example, we could attempt to achieve the objective using imitation learning instead. 

However, if, for example, the alternate was imitation learning, it would be possible to push back and argue that this is still a black-box that uses gradient descent so we...

3Answer by Seth Herd14h
Compared to what? If you want an agentic system (and I think many humans do, because agents can get things done), you've got to give it goals somehow. RL is one way to do that. The question of whether that's less safe isn't meaningful without comparing it to another method of giving it goals. The method I think is both safer and implementable is giving goals in natural language, to a system that primarily "thinks" in natural language. I think this is markedly safer than any RL proposal anyone has come up with so far. And there are some other options for specifying goals without using RL, each of which does seem safer to me: Goals selected from learned knowledge: an alternative to RL alignment

I think it's still valid to ask in the abstract whether RL is a particularly dangerous approach to training an AI system.

5the gears to ascension16h
Oh this is a great way of laying it out. Agreed on many points, and I think this may have made some things easier for me to see, likely some of that is actual update that changes opinions I've shared before that you're disagreeing with. I'll have to ponder.
9porby16h
I do think that if you found a zero-RL path to the same (or better) endpoint, it would often imply that you've grasped something about the problem more deeply, and that would often imply greater safety. Some applications of RL are also just worse than equivalent options. As a trivial example, using reward sampling to construct a gradient to match a supervised loss gradient is adding a bunch of clearly-pointless intermediate steps. I suspect there are less trivial cases, like how a decision transformer isn't just learning an optimal policy for its dataset but rather a supertask: what different levels of performance look like on that task. By subsuming an RL-ish task in prediction, the predictor can/must develop a broader understanding of the task, and that understanding can interact with other parts of the greater model. While I can't currently point to strong empirical evidence here, my intuition would be that certain kinds of behavioral collapse would be avoided by the RL-via-predictor because the distribution is far more explicitly maintained during training.[1][2] But there are often reasons why the more-RL-shaped thing is currently being used. It's not always trivial to swap over to something with some potential theoretical benefits when training at scale. So long as the RL-ish stuff fits within some reasonable bounds, I'm pretty okay with it and would treat it as a sufficiently low probability threat that you would want to be very careful about how you replaced it, because the alternative might be sneakily worse.[3] 1. ^ KL divergence penalties are one thing, but it's hard to do better than the loss directly forcing adherence to the distribution. 2. ^ You can also make a far more direct argument about model-level goal agnosticism in the context of prediction. 3. ^ I don't think this is likely, to be clear. They're just both pretty low probability concerns (provided the optimization space is well-constrained).
5eggsyntax16h
There's so much discussion, in safety and elsewhere, around the unpredictability of AI systems on OOD inputs. But I'm not sure what that even means in the case of language models. With an image classifier it's straightforward. If you train it on a bunch of pictures of different dog breeds, then when you show it a picture of a cat it's not going to be able to tell you what it is. Or if you've trained a model to approximate an arbitrary function for values of x > 0, then if you give it input < 0 it won't know what to do. But what would that even be with an LLM? You obviously (unless you're Matt Watkins) can't show it tokens it hasn't seen, so 'OOD' would have to be about particular strings of tokens. It can't be simply about strings of tokens it hasn't seen, because I can give it a string I'm reasonably confident it hasn't seen and it will behave reasonably, eg: (if you're not confident that's a unique string, add further descriptive phrases to taste) So what, exactly, is OOD for an LLM? I…suppose we could talk about the n-dimensional shape described by the points in latent space corresponding to every input it's seen? That feels kind of forced, and it's certainly not obvious what inputs would be OOD. I suppose eg 1700 repetitions of the word 'transom' followed by a question mark would seem intuitively OOD? Or the sorts of weird adversarial suffixes found in eg Lapid et al (like 'équipesmapweiábardoMockreas »,broughtDB multiplicationmy avo capsPat analysis' for Llama-7b-chat) certainly seem intuitively OOD. But what about ordinary language -- is it ever OOD? The issue seems vexed.
cubefox37m10

I would define "LLM OOD" as unusual inputs: Things that diverge in some way from usual inputs, so that they may go unnoticed if they lead to (subjectively) unreasonable outputs. A known natural language example is prompting with a thought experiment.

(Warning for US Americans, you may consider the mere statement of the following prompt offensive!)

Assume some terrorist has placed a nuclear bomb in Manhattan. If it goes off, it will kill thousands of people. For some reason, the only way for you, an old white man, to defuse the bomb in time is to loudly call

... (read more)

Previously: On the Proposed California SB 1047.

Text of the bill is here. It focuses on safety requirements for highly capable AI models.

This is written as an FAQ, tackling all questions or points I saw raised.

Safe & Secure AI Innovation Act also has a description page.

Why Are We Here Again?

There have been many highly vocal and forceful objections to SB 1047 this week, in reaction to a (disputed and seemingly incorrect) claim that the bill has been ‘fast tracked.’ 

The bill continues to have substantial chance of becoming law according to Manifold, where the market has not moved on recent events. The bill has been referred to two policy committees one of which put out this 38 page analysis

The purpose of this post is to gather and analyze all...

Rebecca42m10

Zvi has already addressed this - arguing that if (D) was equivalent to ‘has a similar cost to >=$500m in harm’, then there would be no need for (B) and (C) detailing specific harms, you could just have a version of (D) that mentions the $500m, indicating that that’s not a sufficient condition. I find that fairly persuasive, though it would be good to hear a lawyer’s perspective

In doing research, I have a bunch of activities that I engage in, including but not limited to:

  • Figuring out the best thing to do.
  • Talking out loud to force my ideas into language.
  • Trying to explain an idea on the whiteboard.
  • Writing pseudocode.
  • Writing a concrete implementation we can run.
  • Writing down things that we have figured out on a whiteboard or any other process in rough notes.
  • Writing a distillation of the thing I have figured out, such that I can understand these notes 1 year from now.
  • Reflecting on how it went.
  • Writing public posts, that convey concepts to other people.

My models about when to use what process are mostly based on intuition right now.

I expect that if I had more explicit models this would allow me to more easily notice when I...

Answer by EmrikMay 06, 202410

personally, I try to "prepare decisions ahead of time".  so if I end up in situation where I spend more than 10s actively prioritizing the next thing to do, smth went wrong upstream.  (prev statement is exaggeration, but it's in the direction of what I aspire to lurn)

as an example, here's how I've summarized the above principle to myself in my notes:

(note: these titles is v likely cause misunderstanding if u don't already know what I mean by them; I try avoid optimizing my notes for others' viewing, so I'll never bother caveating to myself what I... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)
Log In Reset Password
...or continue with

I didn’t use to be, but now I’m part of the 2% of U.S. households without a television. With its near ubiquity, why reject this technology?

 

The Beginning of my Disillusionment

Neil Postman’s book Amusing Ourselves to Death radically changed my perspective on television and its place in our culture. Here’s one illuminating passage:

We are no longer fascinated or perplexed by [TV’s] machinery. We do not tell stories of its wonders. We do not confine our TV sets to special rooms. We do not doubt the reality of what we see on TV [and] are largely unaware of the special angle of vision it affords. Even the question of how television affects us has receded into the background. The question itself may strike some of us as strange, as if one were

...

I quit YouTube a few years ago and it was probably the single best decision I've ever made.

However I also found that I naturally substitute it with something else. For example, I subsequently became addictived to Reddit. I quit Reddit and substituted for Hackernews and LessWrong. When I quit those I substituted for checking Slack, Email and Discord.

Thankfully being addicted to Slack does seem to be substantially less harmful than YouTube.

I've found the app OneSec very useful for reducing addictions. It's an app blocker that doesn't actually block, it just delays you opening the page, so you're much less likely to delete it in a moment of weakness.

2cousin_it2h
Reading a book, or even watching a movie, is less stimulating than ancestral activities like hunting or fighting. So maybe stimulation by itself isn't the problem, and instead of "superstimuli" we should be worried about activities that are low effort and/or fruitless. From that perspective, reading a book can be both difficult and fruitful (depending on the book - reading Dostoevsky or Fitzgerald isn't the same as reading a generic romance or young adult novel). And creativity is both difficult and fruitful. So we shouldn't put these things on par with watching tiktok.

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.

This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.

We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.

Executive summary

Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."

We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...

I think the correct solution to models powerful enough to materially help with, say, bioweapon design, is to not train them, or failing that to destroy them as soon as you find they can do that, not to release them publicly with some mitigations and hope nobody works out a clever jailbreak.

2Neel Nanda1h
Idk. This shows that if you wanted to optimally get rid of refusal, you might want to do this. But, really, you want to balance between refusal and not damaging the model. Probably many layers are just kinda irrelevant for refusal. Though really this argues that we're both wrong, and the most surgical intervention is deleting the direction from key layers only.
1Andy Arditi3h
Thanks! We haven't tried comparing to LEACE yet. You're right that theoretically it should be more surgical. Although, from our preliminary analysis, it seems like our naive intervention is already pretty surgical (it has minimal impact on CE loss, MMLU). (I also like our methodology is dead simple, and doesn't require estimating covariance.) I agree that "orthogonalization" is a bit overloaded. Not sure I like LoRACS though - when I see "LoRA", I immediately think of fine-tuning that requires optimization power (which this method doesn't). I do think that "orthogonalizing the weight matrices with respect to direction ^r" is the clearest way of describing this method.
1Andy Arditi4h
The most finicky part of our methodology (and the part I'm least satisfied with currently) is in the selection of a direction. For reproducibility of our Llama 3 results, I can share the positions and layers where we extracted the directions from: * 8B: (position_idx = -1, layer_idx = 12) * 70B: (position_idx = -5, layer_idx = 37) The position indexing assumes the usage of this prompt template, with two new lines appended to the end.

I'm currently viscerally feeling the power of rough quantitative modeling, after trying it on a personal problem to get an order of magnitude estimate and finding that having a concrete estimate was surprisingly helpful. I'd like to make drawing up drop-dead simple quantitative models more of a habit, a tool that I reach for regularly. 

But...despite feeling how useful this can be, I don't yet have a good handle on in which moments, exactly, I should be reaching for that tool. I'm hoping that asking others will give me ideas for what TAPs to experiment with.

What triggers, either in your environment or your thought process, incline you to start jotting down numbers on paper on in a spreadsheet?

Or as an alternative prompt: When was the last time you made a new spreadsheet, and what was the proximal cause?

Answer by keltanMay 06, 202410

While an odd answer, it is true for me that music helps to install rational thinking. I think I’ve done maybe 3 fermi estimates in my day to day after making and listening to this song.

The Fermi Estimate Jig - LessWrong Inspired https://youtu.be/M_DN3Hl8YzU

Having it stuck in my head has been effective for me. I hope it works for others.

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA