Fabien Roger

Sequences

AI Control

Wiki Contributions

Comments

What attack budget are you imagining defending against?

Rosati 2024 looks at fine-tuning for 1 epoch on 10k samples, which is a tiny attack budget relative to pretrain. If your threat model is the open source community unlocking HHH models, then the attack budget could be at least $1M, maybe much more. If the threat model is China or large terrorist groups, then you should probably be looking at a budget closer to 1%-10% of the cost of training a model from scratch. I have thought about defending against the latter threat, and I don't see a promising path towards making it hard for such well-funded attackers to fine-tune LLMs (including hard to fine-tune in general, not just domain-specifically).

These vectors are not "linear probes" (which are generally optimized via SGD on a logistic regression task for a supervised dataset of yes/no examples), they are difference-in-means of activation vectors

I think DIM and LR aren't spiritually different (e.g. LR with infinite L2 regularization gives you the same direction as DIM), even though in practice DIM is better for steering (and ablations). But I agree with you that "steering vectors" is the good expression to talk about directions used for steering (while I would use linear probes to talk about directions used to extract information or trained to extract information and used for another purpose).

I was thinking that the formal guarantees would be about state evaluations (i.e. state -> badness bounds) - which would require sth like "showing there is no zero-day" (since "a code-base with a zero-day" might be catastrophically bad if no constraints are put on actions). Thanks for pointing out they can be about action (i.e. (state, action) -> badness bounds), which seem intuitively easier to get good bounds for (you don't need to show there are no zero-days, just that the currently considered action is extremely unlikely to exploit a potential zero-day).

I'd be curious to know what kind of formal process could prove that (codebase, codebase-interaction) pairs are provably not-bad (with high probability, and with a false positive rate low enough if you trained an AI to minimize it). My guess is that there is nothing like that on the horizon (that could become competitive at all), but I could be wrong.

("let you see what program is running" was an example of a very minimal safety guarantee I would like to have, not a representative safety guarantee. My point is that I'd be surprised if people got even such a simple and easy safety guarantee anytime soon, using formal methods to check AI actions that actually do useful stuff.)

Not entirely. This makes me slightly more hopeful that we can have formal guarantees of computer systems, but is the field advanced enough that it would be feasible to have a guaranteed no-zero-day evaluation and deployment codebase that is competitive with a regular codebase? (Given a budget of 1 LHC for both the codebase inefficiency tax + the time to build the formal guarantees for the codebase.)

(And computer systems are easy mode, I don't even know how you would start to build guarantees like "if you say X, then it's proven that it doesn't persuade humans of things in ways they would not have approved of beforehand.")

I'm interested in using the SAEs and auto-interp GPT-3.5-Turbo feature explanations for RES-JB for some experiments. Is there a way to download this data?

Fabien RogerΩ230

I also listened to How to Measure Anything in Cybersecurity Risk 2nd Edition by the same author. I had a huge amount of overlapping content with The Failure of Risk Management (and the non-overlapping parts were quite dry), but I still learned a few things:

  • Executives of big companies now care a lot about cybersecurity (e.g. citing it as one of the main threats they have to face), which wasn't true in ~2010.
  • Evaluation of cybersecurity risk is not at all synonyms with red teaming. This book is entirely about risk assessment in cyber and doesn't speak about red teaming at all. Rather, it focuses on reference class forecasting, comparison with other incidents in the industry, trying to estimate the damages if there is a breach, ... It only captures information from red teaming indirectly via expert interviews.

I'd like to find a good resource that explains how red teaming (including intrusion tests, bug bounties, ...) can fit into a quantitative risk assessment.

Fabien RogerΩ120

We compute AUROC(all(sensor_preds), all(sensors)). This is somewhat weird, and it would have been slightly better to do a) (thanks for pointing it out!), but I think the numbers for both should be close since we balance classes (for most settings, if I recall correctly) and the estimates are calibrated (since they are trained in-distribution, there is no generalization question here), so it doesn't matter much. 

The relevant pieces of code can be found by searching for "sensor auroc":

cat_positives = torch.cat([one_data["sensor_logits"][:, i][one_data["passes"][:, i]] for i in range(nb_sensors)])
cat_negatives = torch.cat([one_data["sensor_logits"][:, i][~one_data["passes"][:, i]] for i in range(nb_sensors)])
m, s = compute_boostrapped_auroc(cat_positives, cat_negatives)
print(f"sensor auroc pn {m:.3f}±{s:.3f}")

Isn't that only ~10x more expensive than running the forward-passes (even if you don't do LoRA)? Or is it much more because of communications bottlenecks + the infra being taken by the next pretraining run (without the possibility to swap the model in and out).

What do you expect to be expensive? The engineer hours to build the fine-tuning infra? Or the actual compute for fine-tuning?

Given the amount of internal fine-tuning experiments going on for safety stuff, I'd be surprised if the infra was a bottleneck, though maybe there is a large overhead in making these find-tuned models available through an API.

I'd be even more surprised if the cost of compute was significant compared to the rest of the activity the lab is doing (I think fine-tuning on a few thousand sequences is often enough for capabilities' evaluations, you rarely need massive training runs).

Fabien RogerΩ6130

List sorting does not play well with few-shot mostly doesn't replicate with davinci-002.

When using length-10 lists (it crushes length-5 no matter the prompt), I get:

  • 32-shot, no fancy prompt: ~25%
  • 0-shot, fancy python prompt: ~60% 
  • 0-shot, no fancy prompt: ~60%

So few-shot hurts, but the fancy prompt does not seem to help. Code here.

I'm interested if anyone knows another case where a fancy prompt increases performance more than few-shot prompting, where a fancy prompt is a prompt that does not contain information that a human would use to solve the task. This is because I'm looking for counterexamples to the following conjecture: "fine-tuning on k examples beats fancy prompting, even when fancy prompting beats k-shot prompting" (for a reasonable value of k, e.g. the number of examples it would take a human to understand what is going on).

Load More