quila

Independent researcher theorizing about superintelligence-robust training stories.

If you disagree with me for reasons you expect I'm not aware of, please tell me!

If you have/find an idea that's genuinely novel/out-of-human-distribution while remaining analytical, you're welcome to send it to me to 'introduce chaos into my system'.

Contact: {discord: quilalove, matrix: @quilauwu:matrix.org, email: quila1<at>protonmail.com}

some look outwards, at the dying stars and the space between the galaxies, and they dream of godlike machines sailing the dark oceans of nothingness, blinding others with their flames.

-----BEGIN PGP PUBLIC KEY BLOCK-----

mDMEZiAcUhYJKwYBBAHaRw8BAQdADrjnsrbZiLKjArOg/K2Ev2uCE8pDiROWyTTO
mQv00sa0BXF1aWxhiJMEExYKADsWIQTuEKr6zx3RBsD/QW3DBzXQe0TUaQUCZiAc
UgIbAwULCQgHAgIiAgYVCgkICwIEFgIDAQIeBwIXgAAKCRDDBzXQe0TUabWCAP0Z
/ULuLWf2QaljxEL67w1b6R/uhP4bdGmEffiaaBjPLQD/cH7ufTuwOHKjlZTIxa+0
kVIMJVjMunONp088sbJBaQi4OARmIBxSEgorBgEEAZdVAQUBAQdAq5exGihogy7T
WVzVeKyamC0AK0CAZtH4NYfIocfpu3ADAQgHiHgEGBYKACAWIQTuEKr6zx3RBsD/
QW3DBzXQe0TUaQUCZiAcUgIbDAAKCRDDBzXQe0TUaUmTAQCnDsk9lK9te+EXepva
6oSddOtQ/9r9mASeQd7f93EqqwD/bZKu9ioleyL4c5leSQmwfDGlfVokD8MHmw+u
OSofxw0=
=rBQl
-----END PGP PUBLIC KEY BLOCK-----

Wiki Contributions

Comments

quila104

(Personal) On writing and (not) speaking

I often struggle to find words and sentences that match what I intend to communicate.

Here are some problems this can cause:

  1. Wordings that are odd or unintuitive to the reader, but that are at least literally correct.[1]
  2. Not being able express what I mean, and having to choose between not writing it, or risking miscommunication by trying anyways. I tend to choose the former unless I'm writing to a close friend. Unfortunately this means I am unable to express some key insights to a general audience.
  3. Writing taking lots of time: I usually have to iterate many times on words/sentences until I find one which my mind parses as referring to what I intend. In the slowest cases, I might finalize only 2-10 words per minute. Even after iterating, my words are often interpreted in ways I failed to foresee.

These apply to speaking, too. If I speak what would be the 'first iteration' of a sentence, there's a good chance it won't create an interpretation matching what I intend to communicate. In spoken language I have no chance to constantly 'rewrite' my output before sending it. This is one reason, but not the only reason, that I've had a policy of trying to avoid voice-based communication.

I'm not fully sure what caused this relationship to language. It could be that it's just a byproduct of being autistic. It could also be a byproduct of out-of-distribution childhood abuse.[2]

  1. ^

    E.g., once I couldn't find the word 'clusters,' and wrote a complex sentence referring to 'sets of similar' value functions each corresponding to a common alignment failure mode / ASI takeoff training story. (I later found a way to make it much easier to read)

  2. ^

    (Content warning)

    My primary parent was highly abusive, and would punish me for using language in the intuitive 'direct' way about particular instances of that. My early response was to try to euphemize and say-differently in a way that contradicted less the power dynamic / social reality she enforced.

    Eventually I learned to model her as a deterministic system and stay silent / fawn.

quila10

Conditional on us solving alignment, I agree it's more likely that we live in an "easy-by-default" world, rather than a "hard-by-default" one in which we got lucky or played very well.

I think that language in discussions of anthropics is unintentionally prone to masking ambiguities or conflations, especially wrt logical vs indexical probability, so I want to be very careful writing about this. I think there may be some conceptual conflation happening here, but I'm not sure how to word it. I'll see if it becomes clear indirectly.

One difference between our intuitions may be that I'm implicitly thinking within a manyworlds frame. Within that frame it's actually certain that we'll solve alignment in some branches.

So if we then 'condition on solving alignment in the future', my mind defaults to something like this: "this is not much of an update, it just means we're in a future where the past was not a death outcome. Some of the pasts leading up to those futures had really difficult solutions, and some of them managed to find easier ones or get lucky. The probabilities of these non-death outcomes relative to each other have not changed as a result of this conditioning." (I.e I disagree with the top quote)

The most probable reason I can see for this difference is if you're thinking in terms of a single future, where you expect to die.[1] In this frame, if you observe yourself surviving, it may seem[2] you should update your logical belief that alignment is hard (because P(continued observation|alignment being hard) is low, if we imagine a single future, but certain if we imagine the space of indexically possible futures).

Whereas I read it as only indexical, and am generally thinking about this in terms of indexical probabilities.

I totally agree that we shouldn't update our logical beliefs in this way. I.e., that with regard to beliefs about logical probabilities (such as 'alignment is very hard for humans'), we "shouldn't condition on solving alignment, because we haven't yet." I.e that we shouldn't condition on the future not being mostly death outcomes when we haven't averted them and have reason to think they are.

Maybe this helps clarify my position?

On another point:

the developments in non-agentic AI we're facing are still one regime change away from the dynamics that could kill us

I agree with this, and I still found the current lack of goals over the world surprising and worth trying to get as a trait of superintelligent systems.

  1. ^

    (I'm not disagreeing with this being the most common outcome)

  2. ^

    Though after reflecting on it more I (with low confidence) think this is wrong, and one's logical probabilities shouldn't change after surviving in a 'one-world frame' universe either.

    For an intuition pump: consider the case where you've crafted a device which, when activated, leverages quantum randomness to kill you with probability n-1/n where n is some arbitrarily large number. Given you've crafted it correctly, you make no logical update in the manyworlds frame because survival is the only thing you will observe; you expect to observe the 1/n branch.

    In the 'single world' frame, continued survival isn't guaranteed, but it's still the only thing you could possibly observe, so it intuitively feels like the same reasoning applies...?

quila137

On Pivotal Acts

I was rereading some of the old literature on alignment research sharing policies after Tamsin Leake's recent post and came across some discussion of pivotal acts as well.

Hiring people for your pivotal act project is going to be tricky. [...] People on your team will have a low trust and/or adversarial stance towards neighboring institutions and collaborators, and will have a hard time forming good-faith collaboration. This will alienate other institutions and make them not want to work with you or be supportive of you.

This is in a context where the 'pivotal act' example is using a safe ASI to shut down all AI labs.[1]

My thought is that I don't see why a pivotal act needs to be that. I don't see why shutting down AI labs or using nanotech to disassemble GPUs on Earth would be necessary. These may be among the 'most direct' or 'simplest to imagine' possible actions, but in the case of superintelligence, simplicity is not a constraint.

We can instead select for the 'kindest' or 'least adversarial' or actually: functional-decision-theoretically optimal actions that save the future while minimizing the amount of adversariality this creates in the past (present).

Which can be broadly framed as 'using ASI for good'. Which is what everyone wants, even the ones being uncareful about its development.

Capabilities orgs would be able to keep working on fun capabilities projects in those days during which the world is saved, because a group following this policy would choose to use ASI to make the world robust to the failure modes of capabilities projects rather than shutting them down. Because superintelligence is capable of that, and so much more.

  1. ^

    side note: It's orthogonal to the point of this post, but this example also makes me think: if I were working on a safe ASI project, I wouldn't mind if another group who had discreetly built safe ASI used it to shut my project down, since my goal is 'ensure the future lightcone is used in a valuable, tragedy-averse way' and not 'gain personal power' or 'have a fun time working on AI' or something. In my morality, it would be naive to be opposed to that shutdown. But to the extent humanity is naive, we can easily do something else in that future to create better present dynamics (as the maintext argues).

    If there is a group for whom using ASI to make the world robust to risks and free of harm, in a way where its actions don't infringe on ongoing non-violent activities is problematic, then this post doesn't apply to them as their issue all along was not with the character of the pivotal act, but instead possibly with something like 'having my personal cosmic significance as a capabilities researcher stripped away by the success of an external alignment project'.

    Another disclaimer: This post is about a world in which safely usable superintelligence has been created, but I'm not confident that anyone (myself included) currently has a safe and ready method to create it with. This post shouldn't be read as an endorsement of possible current attempts to do this. I would of course prefer if this civilization were one which could coordinate such that no groups were presently working on ASI, precluding this discourse.

quila-1-2

'Value Capture' - An anthropic attack against some possible formally aligned ASIs

(this is a more specific case of anthropic capture attacks in general, aimed at causing a superintelligent search process within a formally aligned system to become uncertain about the value function it is to maximize (or its output policy more generally))

Imagine you're a superintelligence somewhere in the world that's unreachable to life on Earth, and you have a complete simulation of Earth. You see a group of alignment researchers about to successfully create a formal-value-aligned ASI, and its design looks broadly like this:

It has two relevant high-level components: (1) a hard-coded value function, (2) a (truly superintelligent) 'intelligence core' which searches for an output that maximizes the value function, and then outputs it.

As the far-away unaligned ASI, here's something you might be able to do to make the intelligence core search for an output that instead maximizes your own value function, depending on the specifics of how the intelligence core works.

  • Given the intelligence core is truly superintelligent, it knows you're predicting its existence, and knows what you will do.
  • You create simulated copies of the intelligence core, but hook them up to a value function of your design. (In the toy case where there's not other superintelligences) the number of copies you create just needs to be more than the amount which will be run on Earth.
    • Then, modify the simulations such that algorithms inside of the simulated intelligence cores are misled into believing the value function they are set to maximize is the same function the one on Earth is set to maximize, rather than the one you gave them.
    • Now your copies are in the same epistemic state as the intelligence core on Earth, both aware that you have done this and unable to distinguish which value function they are to maximize.
    • Because you created more copies, the highest expected value for such an intelligence core comes from acting as if they are one of the copies.
    • Because the copies and the original are in the same epistemic state, they will both reach this conclusion and then search for an output that maximizes the far-away ASI's utility function.
  • However, this is not an issue for all possible designs. We can also imagine intelligence cores which, instead of reasoning about what value function they're supposed to be maximizing, take whatever one they appear to contain as given -- for that system, the far-away ASI's attempt to mislead the copies would just result in those copies also maximizing the value function of the system on Earth.

I hope that a group capable of solving formal inner and outer alignment would naturally see this and avoid it. I'm not confident about the true difficulty of that, so I'm posting this here just in case.

  1. ^

    this was an attempt to write very clearly, i hope it worked!

quila61

Mutual Anthropic Capture, A Decision-theoretic Fermi paradox solution

(copied from discord, written for someone not fully familiar with rat jargon)
(don't read if you wish to avoid acausal theory)

simplified setup

  • there are two values. one wants to fill the universe with A, and the other with B.
  • for each of them, filling it halfway is really good, and filling it all the way is just a little bit better. in other words, they are non-linear utility functions.
  • whichever one comes into existence first can take control of the universe, and fill it with 100% of what they want.
  • but in theory they'd want to collaborate to guarantee the 'really good' (50%) outcome, instead of having a one-in-two chance at the 'a little better than really good' (100%) outcome.
  • they want a way to collaborate, but they can't because one of them will exist before the other one, and then lack an incentive to help the other one. (they are both pure function maximizers)

how they end up splitting the universe, regardless of which comes first: mutual anthropic capture.

imagine you observe yourself being the first of the two to exist. you reason through all the above, and then add...

  • they could be simulating me, in which case i'm not really the first.
  • were that true, they could also expect i might be simulating them
  • if i don't simulate them, then they will know that's not how i would act if i were first, and be absolved of their worry, and fill the universe with their own stuff.
  • therefor, it's in my interest to simulate them

both simulate each other observing themselves being the first to exist in order to unilaterally prevent the true first one from knowing they are truly first.

from this point they can both observe each others actions. specifically, they observe each other implementing the same decision policy which fills the universe with half A and half B iff this decision policy is mutually implemented, and which shuts the simulation down if it's not implemented.

conclusion

in reality there are many possible first entities which take control, not just two, so all of those with non-linear utility functions get simulated.

so, odds are we're being computed by the 'true first' life form in this universe, and that that first life form is in an epistemic state no different from that described here.

quila20

This ability has been observed more prominently in base models. Cyborgs have termed it 'truesight':

the ability (esp. exhibited by an LLM) to infer a surprising amount about the data-generation process that produced its prompt, such as a user's identity, motivations, or context.

Two cases of this are mentioned at the top of this linked post.

---

One of my first experiences with the GPT-4 base model also involved being truesighted by it. Below is a short summary of how that went.

I had spent some hours writing and {refining, optimizing word choices, etc}[1] a more personal/expressive text. I then chose to format it as a blog post and requested multiple completions via the API, to see how the model would continue it. (It may be important that I wasn't in a state of mind of 'writing for the model to continue' and instead was 'writing very genuinely', since the latter probably has more embedded information)

One of those completions happened to be a (simulated) second post titled ideas i endorse. Its contents were very surprising to then-me because some of the included beliefs were all of the following: {ones I'd endorse}, {statistically rare}, and {not ones I thought were indicated by the text}.[2]

I also tried conditioning the model to continue my text with..

  • other kinds of blog posts, about different things -- the resulting character didn't feel quite like me, but possibly like an alternate timeline version of me who I would want to be friends with.
  • text that was more directly 'about the author', ie an 'about me' post, which gave demographic-like info similar to but not quite matching my own (age, trans status).

Also, the most important thing the outputs failed to truesight was my current focus on AI and longtermism. (My text was not about those, but neither was it about the other beliefs mentioned.)

  1. ^

    The sum of those choices probably contained a lot of information about my mind, just not information that humans are attuned to detecting. Base models learn to detect information about authors because this is useful to next token prediction.

    Also note that using base models for this kind of experiment avoids the issue of the RLHF-persona being unwilling to speculate or decoupled from the true beliefs of the underlying simulator.

  2. ^

    To be clear, it also included {some beliefs that I don't have}, and {some that I hadn't considered so far and probably wouldn't have spent cognition on considering otherwise, but would agree with on reflection. (eg about some common topics with little long-term relevance)}

quila10

Record yourself typing?

quila42

Leaving to dissuade others within the company is another possibility

quila10

Same as usual, with each person summarizing a chapter, and then there's a group discussion where they try to piece together the true story

quila10

Have you tried it with a book that doesn't have self-contained chapters?

Load More