HIGHLIGHTS
Collaborating with Humans without Human Data (DJ
Strouse et al) (summarized by Rohin): We’ve previously
seen that if you want to collaborate with humans in the video game
Overcooked, it
helps to train a deep RL agent against a human
model (AN
#70), so that the agent “expects” to be playing against humans
(rather than e.g. copies of itself, as in self-play). We might call
this a “human-aware” model. However, since a human-aware model must
be trained against a model that imitates human gameplay, we need to
collect human gameplay data for training. Could we instead train an
agent that is robust enough to play with lots of different agents,
including humans as a special case?
This paper shows that this can be done
with Fictitious Co-Play (FCP), in which
we train our final agent against a population of self-play agents
and their past checkpoints taken throughout training. Such agents
get significantly higher rewards when collaborating with humans in
Overcooked (relative to the human-aware approach in the previously
linked paper).
In their ablations, the authors find that it is particularly
important to include past checkpoints in the population against
which you train. They also test whether it helps to have the
self-play agents have a variety or architectures, and find that it
mostly does not make a difference (as long as you are using past
checkpoints as well).
Read more: Related
paper: Maximum Entropy Population Based Training for Zero-Shot
Human-AI Coordination
|
|
Rohin's opinion: You could imagine two
different philosophies on how to build AI systems -- the first
option is to train them on the actual task of interest (for
Overcooked, training agents to play against humans or human
models), while the second option is to train a more robust agent on
some more general task, that hopefully includes the actual task
within it (the approach in this paper). Besides Overcooked, another
example would be supervised learning on some natural language task
(the first philosophy), as compared to pretraining on the Internet
GPT-style and then prompting the model to solve your task of
interest (the second philosophy). In some sense the quest for a
single unified AGI system is itself a bet on the second philosophy
-- first you build your AGI that can do all tasks, and then you
point it at the specific task you want to do now.
Historically, I think AI has focused primarily on the first
philosophy, but recent years have shown the power of the second
philosophy. However, I don’t think the question is settled yet: one
issue with the second philosophy is that it is often difficult to
fully “aim” your system at the true task of interest, and as a
result it doesn’t perform as well as it “could have”. In
Overcooked, the FCP agents will not learn specific quirks of human
gameplay that could be exploited to improve efficiency (which the
human-aware agent could do, at least in theory). In natural
language, even if you prompt GPT-3 appropriately, there’s still
some chance it ends up rambling about something else entirely, or
neglects to mention some information that it “knows” but that a
human on the Internet would not have said. (See also this
post (AN
#141).)
I should note that you can also have a hybrid approach, where
you start by training a large model with the second philosophy, and
then you finetune it on your task of interest as in the first
philosophy, gaining the benefits of both.
I’m generally interested in which approach will build more
useful agents, as this seems quite relevant to forecasting the
future of AI (which in turn affects lots of things including AI
alignment plans).
|
|
|
TECHNICAL AI
ALIGNMENT
LEARNING HUMAN
INTENT
Inverse Decision Modeling: Learning Interpretable Representations
of Behavior (Daniel Jarrett, Alihan Hüyük et
al) (summarized by Rohin): There’s lots of work on
learning preferences from demonstrations, which varies in how much
structure they assume on the demonstrator: for example, we might
consider them to be Boltzmann
rational (AN
#12) or risk
sensitive, or we could try to learn
their biases (AN
#59). This paper proposes a framework to encompass all of these
choices: the core idea is to model the demonstrator as choosing
actions according to a planner; some parameters of
this planner are fixed in advance to provide an assumption on the
structure of the planner, while others are learned from data. This
also allows them to separate beliefs, decision-making, and rewards,
so that different structures can be imposed on each of them
individually.
The paper provides a mathematical treatment of both the forward
problem (how to compute actions in the planner given the reward,
think of algorithms like value iteration) and the backward problem
(how to compute the reward given demonstrations, the typical
inverse reinforcement learning setting). They demonstrate the
framework on a medical dataset, where they introduce a planner with
parameters for flexibility of decision-making, optimism of beliefs,
and adaptivity of beliefs. In this case they specify the desired
reward function and then run backward inference to conclude that,
with respect to this reward function, clinicians appear to be
significantly less optimistic when diagnosing dementia in female
and elderly patients.
|
|
Rohin's opinion: One thing to note about
this paper is that it is an incredible work of scholarship; it
fluently cites research across a variety of disciplines including
AI safety, and provides a useful organizing framework for many such
papers. If you need to do a literature review on inverse
reinforcement learning, this paper is a good place to start.
|
|
|
Human irrationality: both bad and good for reward
inference (Lawrence Chan et al) (summarized
by Rohin): Last summary, we saw a framework for inverse
reinforcement learning with suboptimal demonstrators. This paper
instead investigates the qualitative effects of performing inverse
reinforcement learning with a suboptimal demonstrator. The authors
modify different parts of the Bellman equation in order to create a
suite of possible suboptimal demonstrators to study. They run
experiments with exact inference on random MDPs and FrozenLake, and
with approximate inference on a simple autonomous driving
environment, and conclude:
1. Irrationalities can be helpful for reward
inference, that is, if you infer a reward from
demonstrations by an irrational demonstrator (where you know the
irrationality), you often learn more about the
reward than if you inferred a reward from optimal demonstrations
(where you know they are optimal). Conceptually, this happens
because optimal demonstrations only tell you about what the best
behavior is, whereas most kinds of irrationality can also tell you
about preferences between suboptimal behaviors.
2. If you fail to model irrationality, your
performance can be very bad, that is, if you infer a
reward from demonstrations by an irrational demonstrator, but you
assume that the demonstrator was Boltzmann rational, you can
perform quite badly.
|
|
Rohin's opinion: One way this paper
differs from my intuitions is that it finds that assuming Boltzmann
rationality performs very poorly if the demonstrator is in fact
systematically suboptimal. I would have instead guessed that
Boltzmann rationality would do okay -- not as well as in the case
where there is no misspecification, but only a little worse than
that. (That’s what I found in my
paper (AN
#59), and it makes intuitive sense to me.) Some hypotheses for
what’s going on, which the lead author agrees are at least part of
the story:
1. When assuming Boltzmann rationality, you infer a distribution
over reward functions that is “close” to the correct one in terms
of incentivizing the right behavior, but differs in rewards
assigned to suboptimal behavior. In this case, you might get a very
bad log loss (the metric used in this paper), but still have a
reasonable policy that is decent at acquiring true reward (the
metric used in my paper).
2. The environments we’re using may differ in some important way
(for example, in the environment in my paper, it is primarily
important to identify the goal, which might be much easier to do
than inferring the right behavior or reward in the autonomous
driving environment used in this paper).
|
|
|
FORECASTING
Forecasting progress in language models (Matthew
Barnett) (summarized by Sudhanshu): This post aims to
forecast when a "human-level language model" may be created. To
build up to this, the author swiftly covers basic concepts from
information theory and natural language processing such as entropy,
N-gram models, modern LMs, and perplexity. Data for perplexity
achieved from recent state-of-the-art models is collected and used
to estimate - by linear regression - when we can expect to see
future models score below certain entropy levels, approaching the
hypothesised entropy for the English Language.
These predictions range across the next 15 years, depending
which dataset, method, and entropy level is being solved for;
there's an attached python
notebook with these details for curious readers to further
investigate. Preemptly disjunctive, the author concludes "either
current trends will break down soon, or human-level language models
will likely arrive in the next decade or two."
|
|
Sudhanshu's opinion: This quick read
provides a natural, accessible analysis stemming from recent
results, while staying self-aware (and informing readers) of
potential improvements. The comments section too includes some
interesting debates, e.g. about the Goodhart-ability of the
Perplexity metric.
I personally felt these estimates were broadly in line with my
own intuitions. I would go so far as to say that with the
confluence of improved generation capabilities across text,
speech/audio, video, as well as multimodal consistency and
integration, virtually any kind of content we see ~10 years from
now will be algorithmically generated and indistinguishable from
the work of human professionals.
Rohin's opinion: I would generally adopt
forecasts produced by this sort of method as my own, perhaps making
them a bit longer as I expect the quickly growing compute trend to
slow down. Note however that this is a forecast for human-level
language models, not transformative AI; I would expect these to be
quite different and would predict that transformative AI comes
significantly later.
|
|
|
MISCELLANEOUS
(ALIGNMENT)
Rohin Shah on the State of AGI Safety Research in
2021 (Lucas Perry and Rohin
Shah) (summarized by Rohin): As in previous years (AN
#54), on this FLI podcast I talk about the state of the field.
Relative to previous years, this podcast is a bit more
introductory, and focuses a bit more on what I find interesting
rather than what the field as a whole would consider
interesting.
Read more: Transcript
|
|
NEAR-TERM
CONCERNS
RECOMMENDER
SYSTEMS
User Tampering in Reinforcement Learning Recommender
Systems (Charles Evans et al) (summarized by
Zach): Large-scale recommender systems have emerged as a way to
filter through large pools of content to identify and recommend
content to users. However, these advances have led to social and
ethical concerns over the use of recommender systems in
applications. This paper focuses on the potential for social
manipulability and polarization from the use of RL-based
recommender systems. In particular, they present evidence that such
recommender systems have an instrumental goal to engage in user
tampering by polarizing users early on in an attempt to make later
predictions easier.
To formalize the problem the authors introduce a causal model.
Essentially, they note that predicting user preferences requires an
exogenous variable, a non-observable variable, that models
click-through rates. They then introduce a notion of instrumental
goal that models the general behavior of RL-based algorithms over a
set of potential tasks. The authors argue that such algorithms will
have an instrumental goal to influence the exogenous/preference
variables whenever user opinions are malleable. This ultimately
introduces a risk for preference manipulation.
The author's hypothesis is tested using a simple media
recommendation problem. They model the exogenous variable as either
leftist, centrist, or right-wing. User preferences are malleable in
the sense that a user shown content from an opposing side will
polarize their initial preferences. In experiments, the authors
show that a standard Q-learning algorithm will learn to tamper with
user preferences which increases polarization in both leftist and
right-wing populations. Moreover, even though the agent makes use
of tampering it fails to outperform a crude baseline policy that
avoids tampering.
|
|
Zach's opinion: This article is
interesting because it formalizes and experimentally demonstrates
an intuitive concern many have regarding recommender systems. I
also found the formalization of instrumental goals to be of
independent interest. The most surprising result was that the
agents who exploit tampering are not particularly more effective
than policies that avoid tampering. This suggests that the
instrumental incentive is not really pointing at what is actually
optimal which I found to be an illuminating distinction.
|
|
|
NEWS
OpenAI hiring Software Engineer, Alignment (summarized by
Rohin): Exactly what it sounds like: OpenAI is hiring a software
engineer to work with the Alignment team.
BERI hiring ML Software Engineer (Sawyer
Bernath) (summarized by Rohin): BERI is hiring a remote
ML Engineer as part of their collaboration with the Autonomous
Learning Lab at UMass Amherst. The goal is to create a
software library that enables easy deployment of the ALL's
Seldonian algorithm framework for safe and aligned AI.
AI Safety Needs Great Engineers (Andy
Jones) (summarized by Rohin): If the previous two roles
weren't enough to convince you, this post explicitly argues that a
lot of AI safety work is bottlenecked on good engineers, and
encourages people to apply to such roles.
AI Safety Camp Virtual 2022 (summarized by Rohin):
Applications are open for this remote research program, where
people from various disciplines come together to research an open
problem under the mentorship of an established AI-alignment
researcher. Deadline to apply is December 1st.
Political Economy of Reinforcement Learning
schedule (summarized by Rohin): The date for
the PERLS
workshop (AN
#159) at NeurIPS has been set for December 14, and the schedule
and speaker list are now available on the website.
|
|
|