Dan Hendrycks' opinion: My coauthors and I
wrote this paper with the ML research community as our target
audience. Here are some thoughts on this topic:
1. The document includes numerous problems that, if left
unsolved, would imply that ML systems are unsafe. We need the
effort of thousands of researchers to address all of them. This
means that the main safety discussions cannot stay within the
confines of the relatively small EA community. I think we should
aim to have over one third of the ML research community work on
safety problems. We need the broader community to treat AI at least
as seriously as safety for nuclear power plants.
2. To grow the ML research community, we need to suggest
problems that can progressively build the community and organically
grow support for elevating safety standards within the existing
research ecosystem. Research agendas that pertain to AGI
exclusively will not scale sufficiently, and such research will
simply not get enough market share in time. If we do not get the
machine learning community on board with proactively mitigating
risks that already exist, we will have a harder time getting them
to mitigate less familiar and unprecedented risks. Rather than try
to win over the community with alignment philosophy arguments, I'll
try winning them over with interesting problems and try to make
work towards safer systems rewarded with prestige.
3. The benefits of a larger ML Safety community are numerous.
They can decrease the cost of safety methods and increase the
propensity to adopt them. Moreover, to make ML systems have
desirable properties, it is necessary to rapidly accumulate
incremental improvements, but this requires substantial growth
since such gains cannot be produced by just a few card-carrying
x-risk researchers with the purest intentions.
4. The community will fail to grow if we ignore near-term
concerns or actively exclude or sneer at people who work on
problems that are useful for both near- and long-term safety (such
as adversaries). The alignment community will need to stop engaging
in textbook territorialism and welcome serious hypercompetent
researchers who do not post on internet forums or who happen not to
subscribe to effective altruism. (We include a community strategy
in the Appendix.)
5. We focus on reinforcement learning but also deep learning.
Most of the machine learning research community studies deep
learning (e.g., text processing, vision) and does not use, say,
Bellman equations or PPO. While existentially catastrophic failures
will likely require competent sequential decision making agents,
the relevant problems and solutions can often be better studied
outside of gridworlds and MuJoCo. There is much useful safety
research to be done that does not need to be cast as a
reinforcement learning problem.
6. To prevent alienating readers, we did not use phrases such as
"AGI." AGI-exclusive research will not scale; for most academics
and many industry researchers, it's a nonstarter. Likewise, to
prevent needless dismissiveness, we kept x-risks implicit, only
hinted at them, or used the phrase "permanent catastrophe."
I would have personally enjoyed discussing at length how anomaly
detection is an indispensable tool for reducing x-risks
Balls, engineered microorganisms, and deceptive ML systems.
Here are how the problems relate to x-risk:
Adversarial Robustness: This is needed for proxy gaming. ML
systems encoding proxies must become more robust to optimizers,
which is to say they must become more adversarially robust. We make
this connection explicit at the bottom of page 9.
Black Swans and Tail Risks: It's hard to be safe without high
reliability. It's not obvious we'll achieve high reliability even
by the time we have systems that are superhuman in important
respects. Even though MNIST is solved for typical inputs, we still
do not even have an MNIST classifier for atypical inputs that is
reliable! Moreover, if optimizing agents become unreliable in the
face of novel or extreme events, they could start heavily
optimizing the wrong thing. Models accidentally going off the rails
poses an x-risk if they are sufficiently powerful (this is related
to "competent errors" and "treacherous turns"). If this problem is
not solved, optimizers can use these weaknesses; this is a simpler
problem on the way to adversarial robustness.
Anomaly and Malicious Use Detection: This is an indispensable
tool for detecting proxy gaming, Black
Balls, engineered microorganisms that present bio x-risks,
malicious users who may misalign a model, deceptive ML systems, and
rogue ML systems.
Representative Outputs: Making models honest is a way to avoid
many treacherous turns.
Hidden Model Functionality: This also helps avoid treacherous
turns. Backdoors is a potentially useful related problem, as it is
about detecting latent but potential sharp changes in behavior.
Value Learning: Understanding utilities is difficult even for
humans. Powerful optimizers will need to achieve a certain,
as-of-yet unclear level of superhuman performance at learning our
Translating Values to Action: Successfully prodding models to
optimize our values is necessary for safe outcomes.
Proxy Gaming: Obvious.
Value Clarification: This is the philosophy bot section. We will
need to decide what values to pursue. If we decide poorly, we may
lock in or destroy what is of value. It also possible that there is
an ongoing moral catastrophe, which we would not want to replicate
across the cosmos.
Unintended Consequences: This should help models not
accidentally work against our values.
ML for Cybersecurity: If you believe that AI governance is
valuable and that global turbulence risks can increase risks of
terrible outcomes, this section is also relevant. Even if some of
the components of ML systems are safe, they can become unsafe when
traditional software vulnerabilities enable others to control their
behavior. Moreover, traditional software vulnerabilities may lead
to the proliferation of powerful advanced models, and this may be
worse than proliferating nuclear weapons.
Informed Decision Making: We want to avoid decision making based
on unreliable gut reactions during a time of crisis. This reduces
risks of poor governance of advanced systems.
Here are some other notes:
1. We use systems theory to motivate inner optimization as we
expect motivation will be more convincing to others.
2. Rather than have a broad call for "interpretability," we
focus on specific transparency-related problems that are more
tractable and neglected. (See the Appendix for a table assessing
importance, tractability, and neglectedness.) For example, we
include sections on making models honest and detecting emergent
3. The "External Safety" section can also be thought of as
technical research for reducing "Governance" risks. For readers
mostly concerned about AI risks from global turbulence, there still
is technical research that can be done.
Here are some observations while writing the document:
1. Some approaches that were previously very popular are
currently neglected, such as inverse reinforcement learning. This
may be due to currently low tractability.
2. Five years ago, I started explicitly brainstorming the
content for this document. I think it took the whole time for this
document to take shape. Moreover, if this were written last fall,
the document would be far more confused, since it took around a
year after GPT-3 to become reoriented; writing these types of
documents shortly after a paradigm shift may be too hasty.
3. When collecting feedback, it was not uncommon for
"in-the-know" researchers to make opposite suggestions. Some people
thought some of the problems in the Alignment section were
unimportant, while others thought they were the most critical. We
attempted to include most research directions.