Seth Herd

I've been doing computational cognitive neuroscience research since getting my PhD in 2006, until the end of 2022. I've worked on computatonal theories of vision, executive function, episodic memory, and decision-making. I've focused on the emergent interactions that are needed to explain complex thought. I was increasingly concerned with AGI applications of the research, and reluctant to publish my best ideas. I'm incredibly excited to now be working directly on alignment, currently with generous funding from the Astera Institute. More info and publication list here.

Wiki Contributions

Corrigibility

6mo

(+472)

Language model cognitive architecture

6mo

(+805)

Chain-of-Thought Alignment

(+14/-21)

Comments

AI Timelines

Seth Herd7mo94

The important thing for alignment work isn't the median prediction; if we had an alignment solution just by then, we'd have a 50% chance of dying from that lack.

I think the biggest takeaway is that nobody has a very precise and reliable prediction, so if we want to have good alignment plans in advance of AGI, we'd better get cracking.

I think Daniel's estimate does include a pretty specific and plausible model of a path to AGI, so I take his the most seriously. My model of possible AGI architectures requires even less compute than his, but I think the Hofstadter principle applies to AGI development if not compute progress.

Estimates in the absence of gears-level models of AGI seem much more uncertain, which might be why Ajeya and Ege's have much wider distributions.

We're Not Ready: thoughts on "pausing" and responsible scaling policies

Seth Herd7mo52

That all makes sense. To expand a little more on some of the logic:

It seems like the outcome of a partial pause rests in part on whether that would tend to put people in the lead of the AGI race who are more or less safety-concerned.

I think it's nontrivial that we currently have three teams in the lead who all appear to honestly take the risks very seriously, and changing that might be a very bad idea.

On the other hand, the argument for alignment risks is quite strong, and we might expect more people to take the risks more seriously as those arguments diffuse. This might not happen if polarization becomes a large factor in beliefs on AGI risk. The evidence for climate change was also pretty strong, but we saw half of America believe in it less, not more, as evidence mounted. The lines of polarization would be different in this case, but I'm afraid it could happen. I outlined that case a little in AI scares and changing public beliefs

In that case, I think a partial pause would have a negative expected value, as the current lead decayed, and more people who believe in risks less get into the lead by circumventing the pause.

This makes me highly unsure if a pause would be net-positive. Having alignment solutions won't help if they're not implemented because the taxes are too high.

The creation of compute overhang is another reason to worry about a pause. It's highly uncertain how far we are from making adequate compute for AGI affordable to individuals. Algorithms and compute will keep getting better during a pause. So will theory of AGI, along with theory of alignment.

This puts me, and I think the alignment community at large, in a very uncomfortable position of not knowing whether a realistic pause would be helpful.

It does seem clear that creating mechanisms and political will for a pause are a good idea.

Advocating for more safety work also seems clear cut.

To this end, I think it's true that you create more political capitol by successfully pushing for policy.

A pause now would create even more capitol, but it's also less likely to be a win, and it could wind up creating polarization and so costing rather than creating capitol. It's harder to argue for a pause now when even most alignment folks think we're years from AGI.

So perhaps the low-hanging fruit is pushing for voluntary RSPs, and government funding for safety work. These are clear improvements, and likely to be wins that create capitol for a pause as we get closer to AGI.

There's a lot of uncertainty here, and that's uncomfortable. More discussion like this should help resolve that uncertainty, and thereby help clarify and unify the collective will of the safety community.

On AutoGPT

Seth Herd1y287

Great analysis. I'm impressed by how thoroughly you've thought this through in the last week or so. I hadn't gotten as far. I concur with your projected timeline, including the difficulty of putting time units onto it. Of course, we'll probably both be wrong in important ways, but I think it's important to at least try to do semi-accurate prediction if we want to be useful.

I have only one substantive addition to your projected timeline, but I think it's important for the alignment implications.

LLM-bots are inherently easy to align. At least for surface-level alignment. You can tell them "make me a lot of money selling shoes, but also make the world a better place" and they will try to do both. Yes, there are still tons of ways this can go off the rails. It doesn't solve outer alignment or alignment stability, for a start. But GPT4's ability to balance several goals, including ethical ones, and to reason about ethics, is impressive.^[1] You can easily make agents that both try to make money, and thinks about not harming people.

In short, the fact that you can do this is going to seep into the public consciousness, and we may see regulations and will definitely see social pressure to do this.

I think the agent disasters you describe will occur, but they will happen to people that don't put safeguards into their bots, like "track how much of my money you're spending and stop if it hits $X and check with me". When agent disasters affect other people, the media will blow it sky high, and everyone will say "why the hell didn't you have your bot worry about wrecking things for others?". Those who do put additional ethical goals into their agents will crow about it. There will be pressure to conform and run safe bots. As bot disasters get more clever, people will take more seriously the big bot disaster.

Will all of that matter? I don't know. But predicting the social and economic backdrop for alignment work is worth trying.

Edit: I finished my own followup post on the topic, Capabilities and alignment of LLM cognitive architectures. It's a cognitive psychology/neuroscience perspective on why these things might work better, faster than you'd intuitively think. Improvements to the executive function (outer script code) and episodic memory (pinecone or other vector search over saved text files) will interact so that improvements in each make the rest of system work better and easier to improve.

^{^}
I did a little informal testing of asking for responses in hypothetical situations where ethical and financial goals collide, and it did a remarkably good job, including coming up with win/win solutions that would've taken me a while to come up with. It looked like the ethical/capitalist reasoning of a pretty intelligent person; but also a fairly ethical one.

MIRI 2024 Communications Strategy

Seth Herd18h20

You're right. I didn't mean to say that kindness is arcane. I was referring to acausal trade or other strange reasons to keep some humans around for possible future use.

Kindness is normal in our world, but I wouldn't assume it will exist in every or even most situations with intelligent beings. Humans are instinctively kind (except for sociopathic and sadistic people), because that is good game theory for our situation: interactions with peers, in which collaboration/teamwork is useful.

A being capable of real recursive self-improvement, let alone duplication and creation of subordinate minds is not in that situation. They may temporarily be dealing with peers, but they might reasonably expect to have no need of collaborators in the near future. Thus, kindness isn't rational for that type of being.

The exception would be if they could make a firm commitment to kindness while they do have peers and need collaborators. They might have kindness merely as an instrumental goal, in which case it would be abandoned as soon as it was no longer useful.

Or they might display kindness more instinctively, as a tendency in their thought or behavior. They might even have it engineered as an innate goal, as Steve hopes to engineer. In those last two cases, I think it's possible that reflexive stability would keep that kindness in place as the AGI continued to grow, but I wouldn't bet on it unless kindness was their central goal. If it was merely a tendency and not an explicit and therefore self-endorsed goal, I'd expect it to be dropped like the bad habit it effectively is. If it was an innate goal but not the strongest one, I don't know but wouldn't bet on it being long-term reflexively stable under deliberate self-modification.

(As far as I know, nobody has tried hard to work through the logic of reflexive stability of multiple goals. I tried, and gave it up as too vague and less urgent than other alignment questions. My tentative answer was maybe multiple goals would be reflectively stable; it depends on the exact structure of the decision-making process in that AGI/mind).

"No-one in my org puts money in their pension"

Seth Herd19h20

I totally agree that increasing your own happiness is a valid way to pursue utilitarianism. I think this is often overlooked. (although let's bear in mind that almost nobody actually earns-to-give and so almost nobody walks the talk of being fully utilitarian; the few I know of who do have made a career of it, keeping their true motives in question)

I think rationalists are aware of the following calculus: My odds of actually saving my own life by working on AGI alignment are very small. There are thousands of people involved; the odds of my making the critical contribution are tiny, on the order of maybe 1/10000 at most. But the payoff could be immense; I might live for a million years and expand my mind to experience much more happiness per year, if this all goes very well.

For anyone who does that calculus, it is worth bing quite unhappy now to have that less than 1/10000 chance of achieving so much more happiness.

I don't think that's how everyone thinks of it, and probably not most of them. I suspect that even rationalist utilitarians don't have it all spelled out in mathematical detail. I certainly don't.

But my point is, just telling them "hey you should do something that makes you happy" doesn't address the reasons they're doing what they are, for most alignment people, because they have very specific logic for why they're doing what they are.

On the other hand, some of them did just start out thinking "this sounds fun" and have found out it's not, and reminding them to ask if that's the case could make them happy.

And slightly reduce our odds of a grand future...

Response to nostalgebraist: proudly waving my moral-antirealist battle flag

Seth Herd19h20

Good points. I think the term moral realism is probably used in a variety of ways in the public sphere. I think the relevant sense is "will alignment solve itself because a smart machine will decide to behave in a way we like". If there's some vague sense of stuff everyone "should" do, but it doesn't make them actually do it, then it doesn't matter for this purpose.

I was (and have been) making a theory about definitions.

I think "the good is what you should do" is remarkably devoid of useful meaning. People often mean very little by "should", are unclear both to others and themselves, and use it in different ways in different situations.

My theory is that "good" is usually defined as an emotion, not another set of words, and that emotion roughly means "I want that person on my team" (when applied to behavior), because evolution engineered us to find useful teammates, and that feeling is its mechanism for doing so.

We might be dropping the ball on Autonomous Replication and Adaptation.

Seth Herd21h60

If you think we won't die from the first ARA entity, won't it be a valuable warning shot that will improve our odds against a truly deadly AGI?

Response to nostalgebraist: proudly waving my moral-antirealist battle flag

Seth Herd1d2-2

I have no idea what you mean by your claim if you won't define the central term. Or I do, but I'm just guessing. I think people are typically very vague in what they mean by "good", so it's not adequate for analytical discussion. In this case, a vague sense of good produces only a vague sense in which "moral realism" isn't a strong claim. I just don't know what you mean by that.

I'd be happy to come back later and give my guesses at what people tend to mean by "good"; it's something like "stuff people do whom I want on my team" or "actions that make me feel positively toward someone". But it would require a lot more words to even start nailing down. And while that's a claim about reality, it's quite a complex, dependent, and therefore vague claim, so I'd be reluctant to call it moral realism. Although it is in one sense. So maybe that's what you mean?

Is suffering like shit?

Seth Herd1d40

Some types of suffering, like chronic pain exacerbated by a shitty repetitive .annual labor job, have never gotten much attention. I think that's like the mundane happiness you describe.

Literature lingers on complex emotional suffering, I think, because it's actually more interesting by virtue of being complex but understandable with effort.

It is like a mind tied in complex knots, partly connected to the structure of the world.

I think there are complex joys as well, and literature can have as much fun with those.

I think we focus on suffering because it benefits from our negativity bias, and it seems more virtuous to spend our time understanding suffering than joy.

I think a careful unwrapping of the complexity of a joyful experience, like attending a party and interacting with people in ways they individually appreciate, or the beauty and strangeness of the walk in your other post, ultimately hold just as much interest and virtue, once we don't need to deal with so much shit.

Twin Peaks: under the air

Seth Herd1d20

I was wear rose tinted sunglasses, for exactly this perspective on overlooked ubiquitous beauty.

Was this secretly written by Golden Gate Claude? There's a lot of mentions of beauty and majesty and mists near that bridge...

I suspect it's coincidence. This sounds fun, and I've been meaning to do more healthy walking while working, particularly now that 4o makes it so easy to talk to someone with deep knowledge of alignment work. It's at least decent for working through ideas.

Crows are interesting little beings. I have a crow buddy that I think got ditched by his crow friends for being a crow asshole. We're not good buddies because I don't think we really get each other, but he does hang around and will respond when I crow at him.

LESSWRONG
LW

Posts

Wiki Contributions

Comments