7

7 alignment is infinite

What is “the alignment problem”?

We could say that “the big alignment problem” is to make it so things go well around thinking better forever, maybe by devising a good protocol for adopting new [ways of thinking]/[thinking-structures]. I think this big alignment problem is probably in the “complexity class” of infinite problems described in thesis 1; so, we should perhaps say the “alignment endeavor” instead.
There are also various “small alignment problems” — for instance, (1) there is the problem of creating a system smarter than humanity which is fine to create, and (2) there is the problem of ending the current period of (imo) unusually high risk of \(\approx\)everything worthwhile being lost forever because of AI.¹ Problem (1) is quite solvable, because humanity-next-year will be such a system,² or because humans genetically modified to be somewhat smarter than the smartest humans currently alive would probably be fine to create, or because there is probably a kind of mind upload which is fine to create in some context which we could set up (with effort).³ Problem (2) is also quite solvable, because conditional on it indeed being a very bad idea to make a smarter-than-human and non-human artifact, it is possible to get humanity to understand that it is a very bad idea and act responsibly given that understanding (ban AGI) and severely reduce the annual risk of everything meaningful being wiped out.

Why is the big alignment problem infinite?

It seems likely that one ought to be careful about [becoming smarter]/[new thinking/doing-structures coming into play] forever, and that this being careful just isn’t the sort of thing for which a satisfactory protocol can be specified beforehand, but the sort of thing where indefinitely many arbitrarily different new challenges will need to be met as they come up, preferably with the full force of one’s understanding at each future time (as opposed to being met with some fixed protocol that could be specified at some finite time). There is an infinitely rich variety of new ways of thinking that one should be open to adopting, and it’d be bizarre if decisions about this rich variety of things could be appropriately handled by any largely fixed protocol.
to say more: There’s no hope to well-analyze anything but an infinitesimal fraction of the genuinely infinite space of potential [thinking/doing]-structures, and it is hard to tell what needs to be worked out ahead of time. Well-handling a particular [thinking/doing]-structure often requires actually thinking about it in particular to a significant extent (though one can of course bring to bear a system of understanding built for handling previous things, also), and there’s a tension in being able to do that significantly before the thinking-structure comes on the scene, because (1) if the structure can be seen clearly enough to be well-analyzed or even identified as worthy of attention, then it is often in use already in use or at least close to coming into use and (2) understanding it acceptably well often sorta requires playing around with it, so it must plausibly already be used (though maybe only “in a laboratory setting”) for it to be adequately understood. Like, it’s hard to imagine a bright person in 1900 identifying the internet as a force which should be understood and managed and going on to understand it adequately; it’s hard to imagine someone before Euclid (or before whenever the axiomatic method in mathematics as actually developed) developing a body of understanding decent for understanding the axiomatic method in mathematics (except by developing the axiomatic method in mathematics oneself) (and many important things about it were in fact only understood two millennia later by Gödel and company); it’s hard to imagine humans without language being well-prepared for the advent of language ahead of time (this is an example where the challenge is particularly severe). So, we should expect that in many cases, the capacity to adequately handle a novel [thinking/doing]-structure would only be developed only after it comes on the scene or as it is coming on the scene or only a bit before it comes on the scene.
A potential response: “Okay, let’s say I agree that there is this infinitely rich space of thinking-structures, and that one really just needs to keep thinking to handle this infinitely rich domain. But couldn’t there be a finite Grand Method for doing this thinking?”. My brief response is that this thinking will need to be rich and developing to be up to the challenge (as long as one is to continue to develop). It seems pretty clear that this question is roughly equivalent to “couldn’t there be a Protocol for math/science?”“; so, see Notes 1–6 for a longer response to this question. (And if you try to go meta more times, I’ll just keep giving the same response. It’s not like the higher meta-levels are any easier; actually, it’s not even like we’d want them to be handled by some very distinct thinking.)
This isn’t to say that handling alignment well looks like handling an infinite variety of completely unique particulars (just like that’s not what (doing) math looks like). One still totally can and totally should be developing methods/tools/ideas/understanding with broad applicability (just like one does in math) — it’s just that this is an infinite endeavor. For example, I think it’s a very good broadly applicable “idea” to become smarter oneself instead of trying to create some sort of benevolent alien entity. A further very good broadly applicable “idea” is to be extremely careful/thoughtful about becoming smarter as you are becoming smarter.

Even though it’s sort of confused to conceive of “the alignment problem” as a finite thing that could be solved, confused to imagine a textbook from the future solving alignment, it is totally sensible and very good for there to be a philosophico-scientifico-mathematical field that studies the infinitely rich variety of questions around how one⁴ should become smarter (without killing/losing oneself too much). We could call that field “alignment”.⁵ (It might make sense for alignment to be a topic of interest for people in many fields, not a very separate field in its own right; it should probably be done by researchers that think also more broadly about how thinking works, and about many other philosophical and mathematical matters.)

But again, for each particular time, the problem of making an intelligent system which is smarter than we are currently and which is fine to make is totally a finite problem, e.g. because it is fine to become smarter by doing/learning more math/science/philosophy. Also, one might even get to a position where one could reasonably think that things will probably be good over a very long stretch of gaining capabilities (I’m currently very uncertain here⁶). But even if this is possible, (I claim) this cannot be achieved by writing a textbook ahead of time for how to become smarter and just following that textbook (or having the protocol specified in this textbook in place) over a long stretch of time — as one proceeds, one will need to keep filling the shelves of an infinite alignment library with further textbooks one is writing.
Let me speedrun some other interesting questions about which textbooks could be written. Of course, the questions really deserve much more careful treatment. In my answers in this speedrun (in particular, in the specification gaming I will engage in), I will be guided by some background views which I have not yet properly laid out or justified in this initial segment of the notes — specifically, including and surrounding the view that it is a tremendously bad idea to make an artifact which is more intelligent than us and distinct/separate from us any time soon (and maybe ever). But laying out and justifying these views better is a central aim of the remainder of these notes.
1. “Is there some possible 100000-word text which would get us out of the present period of (imo) acute x-risk from artificial intelligence?” I think there probably is such a book, because if I’m indeed right that we are currently living through a period of acute risk, there could be a 100000-word text making a case for this which is compelling enough that it gets humanity to proceed with much more care; alternatively, one could specify a way to make humanity sufficiently smarter/wiser that we realize we are living through a period of acute risk ourselves (again, assuming this is indeed so); alternatively, one could give the most careful humans a “recipe” for mind uploads and instructions for how to set up a context in which uploading all humans meaningfully decreases x-risk (per unit of subjective time, anyway); etc — there is probably a great variety of 100000-word texts that would do this. Any such text has a probability of at least \(\left(10^5\right)^{-100000}\) to be “written” by a (quantum) random number generator, so for any such text, there is at least this astronomically small probability we would “write it” ourselves (lol); but really, I think it is realistically possible (idk, like, \(p>10^{-4}\) if we try?) for us to write any of the example texts from the previous sentence ourselves.
2. “But is there a text which lets us make an AI which brings our current era of high x-risk to a close?” Sorta yes, e.g. because mind uploads are possible and knowing how to make mind uploads could get us to a world where most people are uploaded and become somewhat better at thinking and understand that allowing careless fooming is a bad idea (assuming it indeed is a bad idea) and such a world would plausibly have lower p-doom per subjective year (especially if the textbook also provides various governance ideas) :).
3. “But is there one which lets us make a much more alien mind which is smarter than us which is good to make?” Sorta yes, e.g. because we could still have the alien mind output a different book and self-destruct :).
4. “Argh, but is there a text which lets us make a more alien mind which is smarter than us and good to make which then directly does big things in the world (+ additional conditions to prevent further specification gaming) which is good to make?” I think there probably is a text which would give the source code of a mind which is already somewhat smarter than us, which is only becoming smarter in very restricted ways, and which does nothing except destroying any AGIs that get built, withstanding human attempts to disable it for a century (without otherwise doing anything bad to humans) and then self-destructing.⁷
5. “But is there a text which looks like a textbook (so, not like unintelligible source code) which lets us understand relevant things well enough that we can build an alien AI that does the thing you stated in the previous response?”⁸ Now that the textbook cannot directly provide advanced understanding that the AI could use and it probably needs to become really smart from scratch and we cannot be handed a bespoke edit to make to the AI which makes it “sacrifice itself” for us this way, it seems much tougher, but I guess it’s probably possible, even though I lack any significantly “constructive” argument. I’m very unsure/confused about whether there will ever be a time in future history when we could reasonably write this kind of text ourselves (such that it is adequate for our situation then). I think it’s plausible that there will be no future point at which we should try to execute a plan of this kind that we devised ourselves (over continuing to do something that looks much more like becoming smarter ourselves⁹).
6. “What if the AI we can make given the textbook is supposed to act as a guardian of humanity over a very long stretch of [it and humanity] becoming smarter, with it staying ahead of humanity indefinitely? Or maybe there could be some other futures in which this AI is the smartest thing around indefinitely that are still somehow good?”¹⁰ The median difficulty of this is significantly higher still. The question makes me want to return to specification gaming (eg, what if the AI “becomes a human” after a while? does it count if it basically has the values and understanding of a human but after some careful fooming so it is alien in a sense?). I sorta feel like not answering this particular question here (partly because I don’t understand this matter well at present), but I will say the following: (1) this is minimally incredibly difficult and we should not be making an AI which would indefinitely be smarter than humanity any time soon; (2) I will provide (further) arguments for (1) in later notes in this sequence; (3) I think it is likely that this will never be a good idea (and we should just continue to become smarter ourselves); (4) these questions about the possibility of such a thing will also be addressed in later notes in this sequence, though somewhat obliquely; and (the primary point of the present note is/implies that) (5) work on how one ought to become smarter (alignment) would need to be continued by such an AI indefinitely.

And, to be clear, nowhere here am I talking particularly of [provably having a good future around gains in intelligence] being an infinite problem which requires indefinite work — instead, I am saying that if we are to handle gains in intelligence even remotely well, we must be thinking our way through genuinely new challenges indefinitely — there is not some finite grand protocol to be found to adequately handle gains in capability. One’s body of understanding for handling becoming smarter will need to be rich, developing (if it is to remain even remotely up to the task), and forever mostly incomplete.

Even conditioning on alignment being a finite problem to be “solved”, I think it could only be solved as one of the very last problems to be solved ever — like, it gets solved around when math and technology get “solved”, not earlier. What’s more, at that point (once math and technology have been solved), there’d probably no longer be any need to solve the alignment problem anyway, because there probably wouldn’t be anything left which one would want this further intelligent system to do for one.

In case you happen to be hopeful about any plans for navigating super-human AI involving getting AIs to “solve alignment” (in the sense of solving it for good) for us, it might be worth revisiting those hopes after pondering the above points.¹¹

To what extent does alignment being infinite (for our purposes) depend on some particular property of our values? Does my claim that alignment is infinite (for our purposes) rest on some implicit claim about our values?

First, to clarify, the central claim of this note is that for pretty much any values, if one is to become arbitrarily better at thinking, one should(-according-to-one’s-own-values) really keep thinking about how one should become better at thinking, with this thinking being always-mostly-unfinished and developing.
But, separately from the above conditional claim, maybe there are some values such that one shouldn’t(-according-to-those-values) become arbitrarily better at thinking? And if you aren’t looking to become arbitrarily capable, maybe alignment needn’t be an infinite endeavor for you? Given this, isn’t the infinitude-for-us of alignment dependent on some fact about our values?
1. Sure, some intelligent being probably could have values such that they would e.g. be happy to make fairly dumb probes which (absent intelligent intervention) turn most planets in our galaxy into diamond or something, and then make and launch those and eliminate all intelligent entities able to stop the probes (including themselves). We could maybe kinda-sorta imagine a group of human terrorists doing this at some point in some futures. For such an entity, alignment needn’t be an infinite endeavor — such a hypothetical entity could be fine with stopping thinking about how to become smarter once it has finished doing its finite thing.
2. But I think such values are really “strange/unlikely” — I think that almost any values ask one to continue becoming more intelligent/capable at any stage of development (though temperance in one’s fooming might well be commonly considered right). Minimally, this is to solve various problems one faces and to make progress on one’s various projects more generally; it is probably also common to have many profound projects which are just fairly directly about becoming smarter (just like humanity already to some extent “terminally cares” about various philosophical/mathematical/scientific/technological projects). In particular, our own values are probably (going to continue being) like this.
3. So, I think our values will continue asking us to become more capable, and that this isn’t really because of some empirical contingency — it is not the case that we could well have ended up with values that are not like this. However, I do think this is downstream of some “logical facts” about values that one could well be wrong/confused about. These themes will receive further treatment in upcoming notes in this sequence.

This isn’t to say that you should be unambitious when trying (to do research) to make things go well around intelligence/capability gains. I’m asking you not to be a certain kind of (imo) crazy/silly, but conditional on not being that kind of crazy/silly, please be very ambitious. Most alignment/safety researchers would do well to work on much more ambitious projects.

In particular, alignment being infinite and any progress on alignment being an infinitesimal fraction of all possible progress does not imply that anything is as good as anything else — there are still totally things which are more substantive/important/useful and things which are less substantive/important/useful, just like in math. And even more strongly, alignment being infinite does not imply that whatever research you’re doing is good.¹²

There are also various variants of the big alignment endeavor and various variants of these small problems and various other small problems and various finite toy problems with various relations to the big problem. If you have some other choices of interest here, what I say could, mutatis mutandis, still apply to your variants.↩︎
of course, creating humanity-next-year might be not-fine on some absolute scale because it’ll maybe commit suicide with AI or with something else, but it’s not like it’s more suicidal than humanity-now, so it’s “fine” in a relative sense↩︎
I do not consider this an exhaustive list of systems smarter than humanity which are fine and possible to create. For instance, it could be feasible at some point to create a system which is better than humans at many things and worse than humans at some things, perhaps being slightly smarter than humans when we perform some aggregation across all domains/skills, and which is highly non-human, but which wants to be friends with humans and to become smarter together instead of becoming smarter largely alone, with some nontrivial number of years of fooming together (perhaps after/involving a legitimate reconciliation/merge of that system with humans). But I think such a scenario is very weird/unlikely on our default path. More generally, I think the list I provide plausibly comes much closer to exhausting the space of “solutions” to the small alignment problem than most other current pictures of alignment would suggest. In particular, I think that creating some sort of fine-to-create super-human benefactor/guardian that “sits outside human affairs but supports them”/“looks at the human world from the outside” is quite plausibly going to remain a bad idea forever. (One could try to construct such a world with some humans fooming faster than others and acting as guardians, but I doubt this would actually end up looking much like them acting as guardians (though there could be some promising setup I haven’t considered here). I think it would very likely be much better than a universe with a random foom anyway, given that some kind of humanity might be doing very much and getting very far in such a universe — I just doubt it would be (a remotely unconditioned form of) this slower-fooming humanity that the faster-fooming humans were supposed to act as guardians for.) I elaborate on these themes in later notes.↩︎
in particular, humanity↩︎
One of my more absurd(ly expressed) hopes with these notes is to (help) push alignment out of its present “(imo) alchemical stage”, marked by people attempting to create some sort of ultimate “aligned artifact” — one which would plausibly solve all problems and make life good eternally (and also marked by people attempting to transmute arbitrary cognitive systems into “aligned” ones, though that deserves additional separate discussion). (To clarify: this analogy between the present alignment field and alchemy is not meant as a further argument — it just allows for a sillier restatement of a claim already made.)↩︎
To say a bit more: I’m unsure about how long a stretch one can reasonably think this about. It obviously depends on the types of capability-gainings involved — for example, there are probably meaningful differences in “safety” between (a) doing algebraic geometry, (b) inventing/discovering probability and beginning to use it in various contexts, (c) adding “neurons” via brain-computer interface and doing some kind of RL-training to set weights so these neurons help you do math (I don’t actually have a specific design for this, and it might well be confused/impossible, but try to imagine something like this), and (d) making an AI hoping it will do advanced math for you, even once we “normalize” the items in this comparison so that each “grants the same amount of total capability”. These differences can to some extent be gauged ex ante, and at “capability gain parity”, there will probably be meaningful selection toward “safer” methods. Also relevant to the question about this “good stretch” (even making sense): the extent to which one’s values mostly make sense only “in one’s own time”. I’m currently confused about these and other relevant background matters, and so lack a clear view of how long a good stretch (starting from various future vantage points) one could reasonably hope for. I think I’d give probability at least \(10^{-6}\) to humanity having a worthwhile future of at least \(10^{10}\) “subjective years”.↩︎
But this is plausibly still specification gaming.↩︎
One could also ask more generally if there is a textbook for making an alien AI which does something delimited-but-transformative (like making mind uploads) and then self-destructs (and doesn’t significantly affect the world except via its making of mind uploads, and with the mind upload tech it provides not being malign).↩︎
though most good futures surely involve us changing radically, and in particular involve a great variety of previously fairly alien AI components being used by/in humanity↩︎
One can also ask versions where many alien AIs are involved (in parallel or in succession), with some alien AIs being smarter than humanity indefinitely. I’d respond to these versions similarly.↩︎
That said, I’ll admit that even if you agree that alignment isn’t the sort of thing that can be solved, you could still think there are good paths forward on which AIs do alignment research for us, indefinitely handling the infinitely rich variety of challenges about how to proceed. I think that this isn’t going to work out — that such futures overwhelmingly have AIs doing AI things and basically nothing meaningfully human going on — but I admit that the considerations in these notes up until the present point are insufficient for justifying this. I’ll say more about this in later notes.↩︎
In fact, I don’t know of any research making any meaningful amount of progress on the technical challenge of making any AI which is smarter than us and good to make. But also, I think it’s plausible we should be doing something much more like becoming smarter ourselves forever instead. These statements call for (further) elaboration and justification, which I aim to provide in upcoming notes in this sequence.↩︎