AI GOVERNANCE
Truthful AI: Developing and governing AI that does not
lie (Owain Evans, Owen Cotton-Barratt et
al) (summarized by Rohin): This paper argues that we
should develop both the technical capabilities and the governance
mechanisms necessary to ensure that AI systems
are truthful. We will primarily think about
conversational AI systems here (so not, say, AlphaFold).
Some key terms:
1. An AI system is honest if it only
makes statements that it actually believes. (This requires you to
have some way of ascribing beliefs to the system.) In
contrast, truthfulness only checks if
statements correspond to reality, without making any claims about
the AI system’s beliefs.
2. An AI system is broadly
truthful if it doesn’t lie, volunteers all the
relevant information it knows, is well-calibrated and knows the
limits of its information, etc.
3. An AI system is narrowly
truthful if it avoids making negligent
suspected-falsehoods. These are statements that can
feasibly be determined by the AI system to be unacceptably likely
to be false. Importantly, a narrowly truthful AI is not required to
make contentful statements, it can express uncertainty or refuse to
answer.
This paper argues for narrow truthfulness as the appropriate
standard. Broad truthfulness is not very precisely defined, making
it challenging to coordinate on. Honesty does not give us the
guarantees we want: in settings in which it is advantageous to say
false things, AI systems might end up being honest
but deluded. They would honestly report their
beliefs, but those beliefs might be false.
Narrow truthfulness is still a much stronger standard that we
impose upon standards. This is desirable, because (1) AI systems
need not be constrained by social norms, the way humans are;
consequently they need stronger standards, and (2) it may be less
costly to enforce that AI systems are narrowly truthful than to
enforce that humans are narrowly truthful, so a higher standard is
more feasible.
Evaluating the (narrow) truthfulness of a model is non-trivial.
There are two parts: first, determining whether a given statement
is unacceptably likely to be false, and second, determining whether
the model was negligent in uttering such a statement. The former
could be done by having human processes that study a wide range of
information and determine whether a given statement is unacceptably
likely to be false. In addition to all of the usual concerns about
the challenges of evaluating a model that might know more than you,
there is also the challenge that it is not clear exactly what
counts as “unacceptably likely to be false”. For example, if a
model utters a false statement, but expresses low confidence, how
should that be rated? The second part, determining negligence,
needs to account for the fact that the AI system might not have had
all the necessary information, or that it might not have been
capable enough to come to the correct conclusion. One way of
handling this is to compare the AI system to other AI systems built
in a similar fashion.
How might narrow truthfulness be useful? One nice thing it
enables is truthfulness amplification, in
which we can amplify properties of a model by asking a web of
related questions and combining the answers appropriately. For
example, if we are concerned that the AI system is deceiving us on
just this question, we could ask it whether it is deceiving us, or
whether an investigation into its statement would conclude that it
was deceptive. As another example, if we are worried that the AI
system is making a mistake on some question where its statement
isn’t obviously false, we can ask it about its
evidence for its position and how strong the evidence is (where
false statements are more likely to be negligently false).
Section 3 is devoted to the potential benefits and costs if we
successfully ensure that AI systems are narrowly truthful, with the
conclusion that the costs are small relative to the benefits, and
can be partially mitigated. Section 6 discusses other potential
benefits and costs if we attempt to create truthfulness standards
to ensure the AI systems are narrowly truthful. (For example, we
might try to create a truthfulness standard, but instead create an
institution that makes sure that AI systems follow a particular
agenda (by only rating as true the statements that are consistent
with that agenda). Section 4 talks about the governance mechanisms
we might use to implement a truthfulness standard. Section 5
describes potential approaches for building truthful AI systems. As
I mentioned in the highlighted post, these techniques are general
alignment techniques that have been specialized for truthful
AI.
|