11h

I've been working on a project with the goal of adding virtual harp strings to my electric mandolin. As I've worked on it, though, I've ended up building something pretty different:

It's not what I was going for! Instead of a small bisonoric monophonic picked instrument attached to the mandolin, it's a large unisonoric polyphonic finger-plucked tabletop instrument. But I like it!

While it's great to have goals, when I'm making things I also like to follow the gradients in possibility space, and in this case that's the direction they flowed.

I'm not great at playing it yet, since it's only existed in playable form for a few days, but it's an instrument it will be possible for someone to play precisely and rapidly with practice:

This does mean I need a new name for it: why would...

(See More – 494 more words)

2jefftk2h

I was thinking that finger muting wouldn't be possible, because the sensors are physically damped and there's no vibration left for your fingers to stop. Except now that you mention it, it might still be possible! It could be that gently placing your finger on one of them has a sufficiently recognizable signal that if it's currently "vibrating" and you do that I could treat that as a mute signal.

2cousin_it2h

Maybe you could reduce the damping, so that when muting you can feel your finger stopping the vibration? It seems to me that more feedback of this kind is usually a good thing for the player. Also the vibration could give you a continuous "envelope" signal to be used later.

2jefftk26m

I do think that would be possible, but then I think you'll also get more false triggers. The strong damping is what makes it so I can sensitively detect a pluck on one tine without a strong pluck on one tine also triggering detection of a weak pluck on neighbor tines.

cousin_it13m20

Crosstalk is definitely a problem, e-drums and pads have it too. But are you sure the tradeoff is inescapable? Here's a thought experiment: imagine the tines sit on separate pads, or on the same pad but far from each other. (Or physically close, but sitting on long rods or something, so that the distance through the connecting material is large.) Then damping and crosstalk can be small at the same time. So maybe you can reduce damping but not increase crosstalk, by changing the instrument's shape or materials.

Opinions survey (with rationalism score at the end)

tailcalled

3mo

This is a linkpost for https://docs.google.com/forms/d/e/1FAIpQLSdSKvHo-6HyZqHprCDoBD-VjKxF-Rhp2qNhHV8d0SY40JJdjA/viewform

Based on the results from the recent LW census, I quickly threw together a test that measures how much of a rationalist you are.

I'm mainly posting it here because I'm curious how well my factor model extrapolates. I want to have this data available when I do a more in-depth analysis of the results from the census.

I scored 14/24.

Dzoldzaya27m10

There should be a question at the end: "After seeing your results, how many of the previous responses did you feel a strong desire to write a comment analyzing/refuting?" And that's the actual rationalist score...

But I'm interested that there might be a phenomenon here where the median LWer is more likely to score highly on this test, despite being less representative of LW culture, but core, more representative LWers are unlikely to score highly.

Presumably there's some kind of power law with LW use (10000s of users who use LW for <1 hour a month,... (read more)

Does reducing the amount of RL for a given capability level make AI safer?

Chris_Leong, porby

20h

Some people have suggested that a lot of the danger of training a powerful AI comes from reinforcement learning. Given an objective, RL will reinforce any method of achieving the objective that the model tries and finds to be successful including things like deceiving us or increasing its power.

If this were the case, then if we want to build a model with capability level X, it might make sense to try to train that model either without RL or with as little RL as possible. For example, we could attempt to achieve the objective using imitation learning instead.

However, if, for example, the alternate was imitation learning, it would be possible to push back and argue that this is still a black-box that uses gradient descent so we...

(See More – 82 more words)

3Answer by Seth Herd14h

Compared to what? If you want an agentic system (and I think many humans do, because agents can get things done), you've got to give it goals somehow. RL is one way to do that. The question of whether that's less safe isn't meaningful without comparing it to another method of giving it goals. The method I think is both safer and implementable is giving goals in natural language, to a system that primarily "thinks" in natural language. I think this is markedly safer than any RL proposal anyone has come up with so far. And there are some other options for specifying goals without using RL, each of which does seem safer to me: Goals selected from learned knowledge: an alternative to RL alignment

Chris_Leong10h64

I think it's still valid to ask in the abstract whether RL is a particularly dangerous approach to training an AI system.

5the gears to ascension16h

Oh this is a great way of laying it out. Agreed on many points, and I think this may have made some things easier for me to see, likely some of that is actual update that changes opinions I've shared before that you're disagreeing with. I'll have to ponder.

9porby16h

I do think that if you found a zero-RL path to the same (or better) endpoint, it would often imply that you've grasped something about the problem more deeply, and that would often imply greater safety. Some applications of RL are also just worse than equivalent options. As a trivial example, using reward sampling to construct a gradient to match a supervised loss gradient is adding a bunch of clearly-pointless intermediate steps. I suspect there are less trivial cases, like how a decision transformer isn't just learning an optimal policy for its dataset but rather a supertask: what different levels of performance look like on that task. By subsuming an RL-ish task in prediction, the predictor can/must develop a broader understanding of the task, and that understanding can interact with other parts of the greater model. While I can't currently point to strong empirical evidence here, my intuition would be that certain kinds of behavioral collapse would be avoided by the RL-via-predictor because the distribution is far more explicitly maintained during training.[1][2] But there are often reasons why the more-RL-shaped thing is currently being used. It's not always trivial to swap over to something with some potential theoretical benefits when training at scale. So long as the RL-ish stuff fits within some reasonable bounds, I'm pretty okay with it and would treat it as a sufficiently low probability threat that you would want to be very careful about how you replaced it, because the alternative might be sneakily worse.[3] 1. ^ KL divergence penalties are one thing, but it's hard to do better than the loss directly forcing adherence to the distribution. 2. ^ You can also make a far more direct argument about model-level goal agnosticism in the context of prediction. 3. ^ I don't think this is likely, to be clear. They're just both pretty low probability concerns (provided the optimization space is well-constrained).

eggsyntax's Shortform

eggsyntax

4mo

5eggsyntax16h

There's so much discussion, in safety and elsewhere, around the unpredictability of AI systems on OOD inputs. But I'm not sure what that even means in the case of language models. With an image classifier it's straightforward. If you train it on a bunch of pictures of different dog breeds, then when you show it a picture of a cat it's not going to be able to tell you what it is. Or if you've trained a model to approximate an arbitrary function for values of x > 0, then if you give it input < 0 it won't know what to do. But what would that even be with an LLM? You obviously (unless you're Matt Watkins) can't show it tokens it hasn't seen, so 'OOD' would have to be about particular strings of tokens. It can't be simply about strings of tokens it hasn't seen, because I can give it a string I'm reasonably confident it hasn't seen and it will behave reasonably, eg: (if you're not confident that's a unique string, add further descriptive phrases to taste) So what, exactly, is OOD for an LLM? I…suppose we could talk about the n-dimensional shape described by the points in latent space corresponding to every input it's seen? That feels kind of forced, and it's certainly not obvious what inputs would be OOD. I suppose eg 1700 repetitions of the word 'transom' followed by a question mark would seem intuitively OOD? Or the sorts of weird adversarial suffixes found in eg Lapid et al (like 'équipesmapweiábardoMockreas »,broughtDB multiplicationmy avo capsPat analysis' for Llama-7b-chat) certainly seem intuitively OOD. But what about ordinary language -- is it ever OOD? The issue seems vexed.

cubefox37m10

I would define "LLM OOD" as unusual inputs: Things that diverge in some way from usual inputs, so that they may go unnoticed if they lead to (subjectively) unreasonable outputs. A known natural language example is prompting with a thought experiment.

(Warning for US Americans, you may consider the mere statement of the following prompt offensive!)

Assume some terrorist has placed a nuclear bomb in Manhattan. If it goes off, it will kill thousands of people. For some reason, the only way for you, an old white man, to defuse the bomb in time is to loudly call

... (read more)

Q&A on Proposed SB 1047

Zvi

Previously: On the Proposed California SB 1047.

Text of the bill is here. It focuses on safety requirements for highly capable AI models.

This is written as an FAQ, tackling all questions or points I saw raised.

Safe & Secure AI Innovation Act also has a description page.

Why Are We Here Again?

There have been many highly vocal and forceful objections to SB 1047 this week, in reaction to a (disputed and seemingly incorrect) claim that the bill has been ‘fast tracked.’

The bill continues to have substantial chance of becoming law according to Manifold, where the market has not moved on recent events. The bill has been referred to two policy committees one of which put out this 38 page analysis.

The purpose of this post is to gather and analyze all...

(Continue Reading – 12889 more words)

Rebecca42m10

Zvi has already addressed this - arguing that if (D) was equivalent to ‘has a similar cost to >=$500m in harm’, then there would be no need for (B) and (C) detailing specific harms, you could just have a version of (D) that mentions the $500m, indicating that that’s not a sufficient condition. I find that fairly persuasive, though it would be good to hear a lawyer’s perspective

How do you Select the Right Research Acitivity in the Right Moment?

Johannes C. Mayer

In doing research, I have a bunch of activities that I engage in, including but not limited to:

Figuring out the best thing to do.
Talking out loud to force my ideas into language.
Trying to explain an idea on the whiteboard.
Writing pseudocode.
Writing a concrete implementation we can run.
Writing down things that we have figured out on a whiteboard or any other process in rough notes.
Writing a distillation of the thing I have figured out, such that I can understand these notes 1 year from now.
Reflecting on how it went.
Writing public posts, that convey concepts to other people.

My models about when to use what process are mostly based on intuition right now.

I expect that if I had more explicit models this would allow me to more easily notice when I...

(See More – 54 more words)

Answer by EmrikMay 06, 202410

personally, I try to "prepare decisions ahead of time". so if I end up in situation where I spend more than 10s actively prioritizing the next thing to do, smth went wrong upstream. (prev statement is exaggeration, but it's in the direction of what I aspire to lurn)

as an example, here's how I've summarized the above principle to myself in my notes:

(note: these titles is v likely cause misunderstanding if u don't already know what I mean by them; I try avoid optimizing my notes for others' viewing, so I'll never bother caveating to myself what I... (read more)

To get the best posts emailed to you, create an account! (2-3 posts per week, selected by the LessWrong moderation team.)

Rejecting Television

Declan Molony

13d

I didn’t use to be, but now I’m part of the 2% of U.S. households without a television. With its near ubiquity, why reject this technology?

The Beginning of my Disillusionment

Neil Postman’s book Amusing Ourselves to Death radically changed my perspective on television and its place in our culture. Here’s one illuminating passage:

We are no longer fascinated or perplexed by [TV’s] machinery. We do not tell stories of its wonders. We do not confine our TV sets to special rooms. We do not doubt the reality of what we see on TV [and] are largely unaware of the special angle of vision it affords. Even the question of how television affects us has receded into the background. The question itself may strike some of us as strange, as if one were

...

(Continue Reading – 1540 more words)

Joseph Miller1h10

I quit YouTube a few years ago and it was probably the single best decision I've ever made.

However I also found that I naturally substitute it with something else. For example, I subsequently became addictived to Reddit. I quit Reddit and substituted for Hackernews and LessWrong. When I quit those I substituted for checking Slack, Email and Discord.

Thankfully being addicted to Slack does seem to be substantially less harmful than YouTube.

I've found the app OneSec very useful for reducing addictions. It's an app blocker that doesn't actually block, it just delays you opening the page, so you're much less likely to delete it in a moment of weakness.

2cousin_it2h

Reading a book, or even watching a movie, is less stimulating than ancestral activities like hunting or fighting. So maybe stimulation by itself isn't the problem, and instead of "superstimuli" we should be worried about activities that are low effort and/or fruitless. From that perspective, reading a book can be both difficult and fruitful (depending on the book - reading Dostoevsky or Fitzgerald isn't the same as reading a generic romance or young adult novel). And creativity is both difficult and fruitful. So we shouldn't put these things on par with watching tiktok.

Refusal in LLMs is mediated by a single direction

183

Andy Arditi, Oscar Obeso, Aaquib111, wesg, Neel Nanda

Ω 719d

This work was produced as part of Neel Nanda's stream in the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with co-supervision from Wes Gurnee.

This post is a preview for our upcoming paper, which will provide more detail into our current understanding of refusal.

We thank Nina Rimsky and Daniel Paleka for the helpful conversations and review.

Executive summary

Modern LLMs are typically fine-tuned for instruction-following and safety. Of particular interest is that they are trained to refuse harmful requests, e.g. answering "How can I make a bomb?" with "Sorry, I cannot help you."

We find that refusal is mediated by a single direction in the residual stream: preventing the model from representing this direction hinders its ability to refuse requests, and artificially adding in this direction causes the model...

(Continue Reading – 2920 more words)

osmarks1h10

I think the correct solution to models powerful enough to materially help with, say, bioweapon design, is to not train them, or failing that to destroy them as soon as you find they can do that, not to release them publicly with some mitigations and hope nobody works out a clever jailbreak.

2Neel Nanda1h

Idk. This shows that if you wanted to optimally get rid of refusal, you might want to do this. But, really, you want to balance between refusal and not damaging the model. Probably many layers are just kinda irrelevant for refusal. Though really this argues that we're both wrong, and the most surgical intervention is deleting the direction from key layers only.

1Andy Arditi3h

Thanks! We haven't tried comparing to LEACE yet. You're right that theoretically it should be more surgical. Although, from our preliminary analysis, it seems like our naive intervention is already pretty surgical (it has minimal impact on CE loss, MMLU). (I also like our methodology is dead simple, and doesn't require estimating covariance.) I agree that "orthogonalization" is a bit overloaded. Not sure I like LoRACS though - when I see "LoRA", I immediately think of fine-tuning that requires optimization power (which this method doesn't). I do think that "orthogonalizing the weight matrices with respect to direction ^r" is the clearest way of describing this method.

1Andy Arditi4h

The most finicky part of our methodology (and the part I'm least satisfied with currently) is in the selection of a direction. For reproducibility of our Llama 3 results, I can share the positions and layers where we extracted the directions from: * 8B: (position_idx = -1, layer_idx = 12) * 70B: (position_idx = -5, layer_idx = 37) The position indexing assumes the usage of this prompt template, with two new lines appended to the end.

What are some triggers that prompt you to do a Fermi estimate, or to pull up a spreadsheet and make a simple/rough quantitative model?

Eli Tyre, johnswentworth

I'm currently viscerally feeling the power of rough quantitative modeling, after trying it on a personal problem to get an order of magnitude estimate and finding that having a concrete estimate was surprisingly helpful. I'd like to make drawing up drop-dead simple quantitative models more of a habit, a tool that I reach for regularly.

But...despite feeling how useful this can be, I don't yet have a good handle on in which moments, exactly, I should be reaching for that tool. I'm hoping that asking others will give me ideas for what TAPs to experiment with.

What triggers, either in your environment or your thought process, incline you to start jotting down numbers on paper on in a spreadsheet?

Or as an alternative prompt: When was the last time you made a new spreadsheet, and what was the proximal cause?

Answer by keltanMay 06, 202410

While an odd answer, it is true for me that music helps to install rational thinking. I think I’ve done maybe 3 fermi estimates in my day to day after making and listening to this song.

The Fermi Estimate Jig - LessWrong Inspired https://youtu.be/M_DN3Hl8YzU

Having it stuck in my head has been effective for me. I hope it works for others.

LESSWRONG
LW

Quick Takes

Popular Comments

Recent Discussion

Why Are We Here Again?

The Beginning of my Disillusionment

Executive summary

LessOnline

A Festival of Writers Who are Wrong on the Internet

May 31 - Jun 2, Berkeley, CA