HIGHLIGHTS
Draft report on existential risk from power-seeking
AI (Joe Carlsmith) (summarized by Rohin):
This report investigates the classic AI risk argument in detail,
and decomposes it into a set of conjunctive claims. Here’s the
quick version of the argument. We will likely build highly capable
and agentic AI systems that are aware of their place in the world,
and which will be pursuing problematic objectives. Thus, they will
take actions that increase their power, which will eventually
disempower humans leading to an existential catastrophe. We will
try and avert this, but will probably fail to do so since it is
technically challenging, and we are not capable of the necessary
coordination.
There’s a lot of vague words in the argument above, so let’s
introduce some terminology to make it clearer:
- Advanced capabilities: We
say that a system has advanced capabilities if it outperforms the
best humans on some set of important tasks (such as scientific
research, business/military/political strategy, engineering, and
persuasion/manipulation).
- Agentic
planning: We say that a system engages in agentic
planning if it (a) makes and executes plans, (b) in pursuit of
objectives, (c) on the basis of models of the world. This is a very
broad definition, and doesn’t have many of the connotations you
might be used to for an agent. It does not need to be a literal
planning algorithm -- for example, human cognition would count,
despite (probably) not being just a planning algorithm.
- Strategically
aware: We say that a system is strategically aware if
it models the effects of gaining and maintaining power over humans
and the real-world environment.
- PS-misaligned (power-seeking
misaligned): On some inputs, the AI system seeks
power in unintended ways, due to problems with its objectives (if
the system actually receives such inputs, then it
is practically PS-misaligned.)
The core argument is then that AI systems with advanced
capabilities, agentic planning, and strategic awareness
(APS-systems) will be practically PS-misaligned, to an extent that
causes an existential catastrophe. Of course, we will try to
prevent this -- why should we expect that we can’t fix the problem?
The author considers possible remedies, and argues that they all
seem quite hard:
- We could give
AI systems the right objectives (alignment), but this seems quite
hard -- it’s not clear how we would solve either outer or inner
alignment.
- We could try
to shape objectives to be e.g. myopic, but we don’t know how to do
this, and there are strong incentives against myopia.
- We could try
to limit AI capabilities by keeping systems special-purpose rather
than general, but there are strong incentives for generality, and
some special-purpose systems can be dangerous, too.
- We could try
to prevent the AI system from improving its own capabilities, but
this requires us to anticipate all the ways the AI system could
improve, and there are incentives to create systems that learn and
change as they gain experience.
- We could try
to control the deployment situations to be within some set of
circumstances where we know the AI system won’t seek power.
However, this seems harder and harder to do as capabilities
increase, since with more capabilities, more options become
available.
- We could impose
a high threshold of safety before an AI system is deployed, but the
AI system could still seek power during training, and there are
many incentives pushing for faster, riskier deployment (even if we
have already seen warning shots).
- We could try
to correct the behavior of misaligned AI systems, or mitigate their
impact, after deployment. This seems like it requires humans to
have comparable or superior power to the misaligned systems in
question, though; and even if we are able to correct the problem at
one level of capability, we need solutions that scale as our AI
systems become more powerful.
The author breaks the overall argument into six conjunctive
claims, assigns probabilities to each of them, and ends up
computing a 5% probability of existential catastrophe from
misaligned, power-seeking AI by 2070. This is a lower bound, since
the six claims together add a fair number of assumptions, and there
can be risk scenarios that violate these assumptions, and so
overall the author would shade upward another couple of percentage
points.
|