7 alignment is infinite
- What is “the alignment problem”?
- We could say that “the big alignment problem” is to make it so
things go well around thinking better forever, maybe by devising a good
protocol for adopting new [ways of thinking]/[thinking-structures]. I
think this big alignment problem is probably in the “complexity class”
of infinite problems described in thesis 1; so, we should perhaps say
the “alignment endeavor” instead.
- There are also various “small alignment problems” — for instance,
(1) there is the problem of creating a system smarter than humanity
which is fine to create, and (2) there is the problem of ending the
current period of (imo) unusually high risk of \(\approx\)everything worthwhile being lost
forever because of AI. Problem (1) is quite solvable,
because humanity-next-year will be such a system, or
because humans genetically modified to be somewhat smarter than the
smartest humans currently alive would probably be fine to create, or
because there is probably a kind of mind upload which is fine to create
in some context which we could set up (with effort).
Problem (2) is also quite solvable, because conditional on it indeed
being a very bad idea to make a smarter-than-human and non-human
artifact, it is possible to get humanity to understand that it is a very
bad idea and act responsibly given that understanding (ban AGI) and
severely reduce the annual risk of everything meaningful being wiped
out.
- Why is the big alignment problem infinite?
- It seems likely that one ought to be careful about [becoming
smarter]/[new thinking/doing-structures coming into play] forever, and
that this being careful just isn’t the sort of thing for which a
satisfactory protocol can be specified beforehand, but the sort of thing
where indefinitely many arbitrarily different new challenges will need
to be met as they come up, preferably with the full force of one’s
understanding at each future time (as opposed to being met with some
fixed protocol that could be specified at some finite time). There is an
infinitely rich variety of new ways of thinking that one should be open
to adopting, and it’d be bizarre if decisions about this rich variety of
things could be appropriately handled by any largely fixed
protocol.
- to say more: There’s no hope to well-analyze anything but an
infinitesimal fraction of the genuinely infinite space of potential
[thinking/doing]-structures, and it is hard to tell what needs to be
worked out ahead of time. Well-handling a particular
[thinking/doing]-structure often requires actually thinking about it in
particular to a significant extent (though one can of course bring to
bear a system of understanding built for handling previous things,
also), and there’s a tension in being able to do that significantly
before the thinking-structure comes on the scene, because (1) if the
structure can be seen clearly enough to be well-analyzed or even
identified as worthy of attention, then it is often in use already in
use or at least close to coming into use and (2) understanding it
acceptably well often sorta requires playing around with it, so it must
plausibly already be used (though maybe only “in a laboratory setting”)
for it to be adequately understood. Like, it’s hard to imagine a bright
person in 1900 identifying the internet as a force which should be
understood and managed and going on to understand it adequately; it’s
hard to imagine someone before Euclid (or before whenever the axiomatic
method in mathematics as actually developed) developing a body of
understanding decent for understanding the axiomatic method in
mathematics (except by developing the axiomatic method in mathematics
oneself) (and many important things about it were in fact only
understood two millennia later by Gödel and company); it’s hard to
imagine humans without language being well-prepared for the advent of
language ahead of time (this is an example where the challenge is
particularly severe). So, we should expect that in many cases, the
capacity to adequately handle a novel [thinking/doing]-structure would
only be developed only after it comes on the scene or as it is coming on
the scene or only a bit before it comes on the scene.
- A potential response: “Okay, let’s say I agree that there is this
infinitely rich space of thinking-structures, and that one really just
needs to keep thinking to handle this infinitely rich domain. But
couldn’t there be a finite Grand Method for doing this thinking?”. My
brief response is that this thinking will need to be rich and developing
to be up to the challenge (as long as one is to continue to develop). It
seems pretty clear that this question is roughly equivalent to “couldn’t
there be a Protocol for math/science?”“; so, see Notes 1–6 for a longer
response to this question. (And if you try to go meta more times, I’ll
just keep giving the same response. It’s not like the higher meta-levels
are any easier; actually, it’s not even like we’d want them to be
handled by some very distinct thinking.)
- This isn’t to say that handling alignment well looks like handling
an infinite variety of completely unique particulars (just like that’s
not what (doing) math looks like). One still totally can and totally
should be developing methods/tools/ideas/understanding with broad
applicability (just like one does in math) — it’s just that this is an
infinite endeavor. For example, I think it’s a very good broadly
applicable “idea” to become smarter oneself instead of trying to create
some sort of benevolent alien entity. A further very good broadly
applicable “idea” is to be extremely careful/thoughtful about becoming
smarter as you are becoming smarter.
- Even though it’s sort of confused to conceive of “the alignment
problem” as a finite thing that could be solved, confused to imagine a
textbook from the future solving alignment, it is totally sensible and
very good for there to be a philosophico-scientifico-mathematical field
that studies the infinitely rich variety of questions around how one should become smarter (without
killing/losing oneself too much). We could call that field
“alignment”. (It might make sense for alignment
to be a topic of interest for people in many fields, not a very separate
field in its own right; it should probably be done by researchers that
think also more broadly about how thinking works, and about many other
philosophical and mathematical matters.)
- But again, for each particular time, the problem of making an
intelligent system which is smarter than we are currently and which is
fine to make is totally a finite problem, e.g. because it is fine to
become smarter by doing/learning more math/science/philosophy. Also, one
might even get to a position where one could reasonably think that
things will probably be good over a very long stretch of gaining
capabilities (I’m currently very uncertain here).
But even if this is possible, (I claim) this cannot be achieved by
writing a textbook ahead of time for how to become smarter and just
following that textbook (or having the protocol specified in this
textbook in place) over a long stretch of time — as one proceeds, one
will need to keep filling the shelves of an infinite alignment library
with further textbooks one is writing.
- Let me speedrun some other interesting questions about which
textbooks could be written. Of course, the questions really deserve much
more careful treatment. In my answers in this speedrun (in particular,
in the specification gaming I will engage in), I will be guided by some
background views which I have not yet properly laid out or justified in
this initial segment of the notes — specifically, including and
surrounding the view that it is a tremendously bad idea to make an
artifact which is more intelligent than us and distinct/separate from us
any time soon (and maybe ever). But laying out and justifying these
views better is a central aim of the remainder of these notes.
- “Is there some possible 100000-word text which would get us out of
the present period of (imo) acute x-risk from artificial intelligence?”
I think there probably is such a book, because if I’m indeed right that
we are currently living through a period of acute risk, there could be a
100000-word text making a case for this which is compelling enough that
it gets humanity to proceed with much more care; alternatively, one
could specify a way to make humanity sufficiently smarter/wiser that we
realize we are living through a period of acute risk ourselves (again,
assuming this is indeed so); alternatively, one could give the most
careful humans a “recipe” for mind uploads and instructions for how to
set up a context in which uploading all humans meaningfully decreases
x-risk (per unit of subjective time, anyway); etc — there is probably a
great variety of 100000-word texts that would do this. Any such text has
a probability of at least \(\left(10^5\right)^{-100000}\) to be
“written” by a (quantum) random number generator, so for any such text,
there is at least this astronomically small probability we would “write
it” ourselves (lol); but really, I think it is realistically possible
(idk, like, \(p>10^{-4}\) if we
try?) for us to write any of the example texts from the previous
sentence ourselves.
- “But is there a text which lets us make an AI which brings our
current era of high x-risk to a close?” Sorta yes, e.g. because mind
uploads are possible and knowing how to make mind uploads could get us
to a world where most people are uploaded and become somewhat better at
thinking and understand that allowing careless fooming is a bad idea
(assuming it indeed is a bad idea) and such a world would plausibly have
lower p-doom per subjective year (especially if the textbook also
provides various governance ideas) :).
- “But is there one which lets us make a much more alien mind which is
smarter than us which is good to make?” Sorta yes, e.g. because we could
still have the alien mind output a different book and self-destruct
:).
- “Argh, but is there a text which lets us make a more alien mind
which is smarter than us and good to make which then directly does big
things in the world (+ additional conditions to prevent further
specification gaming) which is good to make?” I think there probably is
a text which would give the source code of a mind which is already
somewhat smarter than us, which is only becoming smarter in very
restricted ways, and which does nothing except destroying any AGIs that
get built, withstanding human attempts to disable it for a century
(without otherwise doing anything bad to humans) and then
self-destructing.
- “But is there a text which looks like a textbook (so, not like
unintelligible source code) which lets us understand relevant things
well enough that we can build an alien AI that does the thing you stated
in the previous response?” Now that the textbook
cannot directly provide advanced understanding that the AI could use and
it probably needs to become really smart from scratch and we cannot be
handed a bespoke edit to make to the AI which makes it “sacrifice
itself” for us this way, it seems much tougher, but I guess it’s
probably possible, even though I lack any significantly “constructive”
argument. I’m very unsure/confused about whether there will ever be a
time in future history when we could reasonably write this kind of text
ourselves (such that it is adequate for our situation then). I think
it’s plausible that there will be no future point at which we should try
to execute a plan of this kind that we devised ourselves (over
continuing to do something that looks much more like becoming smarter
ourselves).
- “What if the AI we can make given the textbook is supposed to act as
a guardian of humanity over a very long stretch of [it and humanity]
becoming smarter, with it staying ahead of humanity indefinitely? Or
maybe there could be some other futures in which this AI is the smartest
thing around indefinitely that are still somehow good?”
The median difficulty of this is significantly higher still. The
question makes me want to return to specification gaming (eg, what if
the AI “becomes a human” after a while? does it count if it basically
has the values and understanding of a human but after some careful
fooming so it is alien in a sense?). I sorta feel like not answering
this particular question here (partly because I don’t understand this
matter well at present), but I will say the following: (1) this is
minimally incredibly difficult and we should not be making an AI which
would indefinitely be smarter than humanity any time soon; (2) I will
provide (further) arguments for (1) in later notes in this sequence; (3)
I think it is likely that this will never be a good idea (and we should
just continue to become smarter ourselves); (4) these questions about
the possibility of such a thing will also be addressed in later notes in
this sequence, though somewhat obliquely; and (the primary point of the
present note is/implies that) (5) work on how one ought to become
smarter (alignment) would need to be continued by such an AI
indefinitely.
- And, to be clear, nowhere here am I talking particularly of
[provably having a good future around gains in
intelligence] being an infinite problem which requires indefinite work
— instead, I am saying that if we are to handle gains in intelligence
even remotely well, we must be thinking our way through genuinely new
challenges indefinitely — there is not some finite grand protocol to be
found to adequately handle gains in capability. One’s body of
understanding for handling becoming smarter will need to be rich,
developing (if it is to remain even remotely up to the task), and
forever mostly incomplete.
- Even conditioning on alignment being a finite problem to be
“solved”, I think it could only be solved as one of the very last
problems to be solved ever — like, it gets solved around when math and
technology get “solved”, not earlier. What’s more, at that point (once
math and technology have been solved), there’d probably no longer be any
need to solve the alignment problem anyway, because there probably
wouldn’t be anything left which one would want this further intelligent
system to do for one.
- In case you happen to be hopeful about any plans for navigating
super-human AI involving getting AIs to “solve alignment” (in the sense
of solving it for good) for us, it might be worth revisiting those hopes
after pondering the above points.
- To what extent does alignment being infinite (for our purposes)
depend on some particular property of our values? Does my claim that
alignment is infinite (for our purposes) rest on some implicit claim
about our values?
- First, to clarify, the central claim of this note is that for pretty
much any values, if one is to become arbitrarily better at thinking, one
should(-according-to-one’s-own-values) really keep thinking about how
one should become better at thinking, with this thinking being
always-mostly-unfinished and developing.
- But, separately from the above conditional claim, maybe there are
some values such that one shouldn’t(-according-to-those-values) become
arbitrarily better at thinking? And if you aren’t looking to become
arbitrarily capable, maybe alignment needn’t be an infinite endeavor for
you? Given this, isn’t the infinitude-for-us of alignment dependent on
some fact about our values?
- Sure, some intelligent being probably could have values such that
they would e.g. be happy to make fairly dumb probes which (absent
intelligent intervention) turn most planets in our galaxy into diamond
or something, and then make and launch those and eliminate all
intelligent entities able to stop the probes (including themselves). We
could maybe kinda-sorta imagine a group of human terrorists doing this
at some point in some futures. For such an entity, alignment needn’t be
an infinite endeavor — such a hypothetical entity could be fine with
stopping thinking about how to become smarter once it has finished doing
its finite thing.
- But I think such values are really “strange/unlikely” — I think that
almost any values ask one to continue becoming more intelligent/capable
at any stage of development (though temperance in one’s fooming might
well be commonly considered right). Minimally, this is to solve various
problems one faces and to make progress on one’s various projects more
generally; it is probably also common to have many profound projects
which are just fairly directly about becoming smarter (just like
humanity already to some extent “terminally cares” about various
philosophical/mathematical/scientific/technological projects). In
particular, our own values are probably (going to continue being) like
this.
- So, I think our values will continue asking us to become more
capable, and that this isn’t really because of some empirical
contingency — it is not the case that we could well have ended up with
values that are not like this. However, I do think this is downstream of
some “logical facts” about values that one could well be wrong/confused
about. These themes will receive further treatment in upcoming notes in
this sequence.
- This isn’t to say that you should be unambitious when trying (to do
research) to make things go well around intelligence/capability gains.
I’m asking you not to be a certain kind of (imo) crazy/silly, but
conditional on not being that kind of crazy/silly, please be very
ambitious. Most alignment/safety researchers would do well to work on
much more ambitious projects.
- In particular, alignment being infinite and any progress on
alignment being an infinitesimal fraction of all possible progress does
not imply that anything is as good as anything else — there are still
totally things which are more substantive/important/useful and things
which are less substantive/important/useful, just like in math. And even
more strongly, alignment being infinite does not imply that whatever
research you’re doing is good.
onward to Note 8!