HIGHLIGHTS
Alignment
difficulty (Richard Ngo and Eliezer
Yudkowsky) (summarized by Rohin): Eliezer is known for
being pessimistic about our chances of averting AI catastrophe. His
argument in this dialogue is roughly as follows:
1. We are very likely going to
keep improving AI capabilities until we reach AGI, at which point
either the world is destroyed, or we use the AI system to take some
pivotal act before some careless actor destroys the world.
2. In either case, the AI system
must be producing high-impact, world-rewriting plans; such plans
are “consequentialist” in that the simplest way to get them (and
thus, the one we will first build) is if you are forecasting what
might happen, thinking about the expected consequences, considering
possible obstacles, searching for routes around the obstacles, etc.
If you don’t do this sort of reasoning, your plan goes off the
rails very quickly - it is highly unlikely to lead to high impact.
In particular, long lists of shallow heuristics (as with current
deep learning systems) are unlikely to be enough to produce
high-impact plans.
3. We’re producing AI systems by
selecting for systems that can do impressive stuff, which will
eventually produce AI systems that can accomplish high-impact plans
using a general underlying “consequentialist”-style reasoning
process (because that’s the only way to keep doing more impressive
stuff). However, this selection process
does not constrain the goals towards which those
plans are aimed. In addition, most goals seem to have convergent
instrumental subgoals like survival and power-seeking that would
lead to extinction. This suggests that we should expect an
existential catastrophe by default.
4. None of the methods people
have suggested for avoiding this outcome seem like they actually
avert this story.
Richard responds to this with a
few distinct points:
1. It might be possible to build
AI systems which are not of world-destroying intelligence and
agency, that humans use to save the world. For example, we could
make AI systems that do better alignment research. Such AI systems
do not seem to require the property of making long-term plans in
the real world in point (3) above, and so could plausibly be
safe.
2. It might be possible to build
general AI systems that only state plans for
achieving a goal of interest that we specify,
without executing that plan.
3. It seems possible to create
consequentialist systems with constraints upon their reasoning that
lead to reduced risk.
4. It also seems possible to
create systems with the primary aim of producing plans with certain
properties (that aren't just about outcomes in the world) -- think
for example of corrigibility (AN #35) or deference to a human
user.
5. (Richard is also more bullish
on coordinating not to use powerful and/or risky AI systems, though
the debate did not discuss this much.)
Eliezer’s responses:
1. AI systems that help with
alignment research to such a degree that it actually makes a
difference are almost certainly already dangerous.
2. It is the plan itself that is
risky; if the AI system made a plan for a goal that wasn’t the one
we actually meant, and we don’t understand that plan, that plan can
still cause extinction. It is the misaligned optimization
that produced the plan that is dangerous.
3 and 4. It is
certainly possible to do such things; the space
of minds that could be designed is very large. However, it
is difficult to do such things, as they tend to
make consequentialist reasoning weaker, and on our current
trajectory the first AGI that we build will probably not look like
that.
This post has also been
summarized by others here, though with different
emphases than in my summary.
|