TECHNICAL AI ALIGNMENT
LEARNING HUMAN INTENT
Adapting Language Models for Zero-shot Learning by Meta-tuning on
Dataset and Prompt Collections (Ruiqi Zhong et
al) (summarized by Rohin): Large
language models (AN
#102) can be prompted to perform classification tasks. However,
you may not want to simply phrase the prompt as a question like
“Does the following tweet have positive or negative sentiment?”,
because in the training set such questions may have been followed
by something other than an answer (for example, an elaboration of
the question, or a denial that the question is important), and the
model may end up choosing one of these alternatives as the most
likely completion.
The natural solution is to collect a question-answering dataset
and finetune on it. The core idea of this paper is that we can
convert existing NLP classification datasets into a
question-answering format, which we can then finetune on. For
example, given a dataset for movie review classification (where the
goal is to predict whether a review is positive or negative), we
produce questions like “Is the review positive?” or “Does the user
find this movie bad?” The entire classification dataset can then be
turned into question-answer pairs to train on.
They do this for several datasets, producing 441 question types
in total. They then finetune the 0.77B parameter T5 model on a
training set of questions, and evaluate it on questions that come
from datasets not seen during training. Among other things, they
find:
1. Their model does better than UnifiedQA,
which was also trained for question answering using a similar
idea.
2. Pretraining is very important: performance crashes if you
“finetune” on top of a randomly initialized model. This suggests
that the model already “knows” the relevant information, and
finetuning ensures that it uses this knowledge appropriately.
3. If you ensemble multiple questions that get at the same
underlying classification task, you can do better than any of the
questions individually.
4. It is possible to overfit: if you train too long, performance
does decrease.
Finetuned Language Models Are Zero-Shot
Learners (Jason Wei, Maarten Bosma, Vincent Y. Zhao,
Kelvin Guu et al) (summarized by Rohin): This paper
applies the approach from the previous paper on a much larger 137B
parameter model to produce a model that follows
instructions (rather than just answering
questions). Since they are focused on instruction following,
they don’t limit themselves to classification tasks: they also want
to have generative tasks, and so include e.g. summarization
datasets. They also generate such tasks automatically by
“inverting” the classification task: given the label y, the goal is
to generate the input x. For example, for the movie review
classification dataset, they might provide the instruction “Write a
negative movie review”, and then provide one of the movie reviews
classified as negative as an example of what the model should write
in that situation.
A natural approach to classification with a language model is to
ask a question like “Is this movie review positive?” and then
checking the probability assigned to “Yes” and “No” and returning
whichever one was higher. The authors note that this can be
vulnerable to what we might call “probability splitting”
(analogously to vote
splitting). Even if the correct answer is “Yes”, the model
might split probability across “Yes”, “Yup”, “Definitely”,
“Absolutely”, etc such that “No” ends up having higher probability
than “Yes”. To solve this problem, in classification questions they
add a postscript specifying what the options are. During
finetuning, the model should quickly learn that the next word is
always chosen from one of these options, and so will stop assigning
probability to other words, preventing probability splitting.
They find that the finetuned model does much better on held-out
tasks than the original model (both evaluated zero-shot). The
finetuned model also beats zero-shot GPT-3 on 19 of 25 tasks, and
few-shot GPT-3 on 10 of 25 tasks. The finetuned model is always
used zero-shot; unfortunately they don’t report results when using
the finetuned model in a few-shot setting.
They also study the impact of instruction tuning over various
model sizes. At every model size, instruction tuning helps
significantly on the tasks that were seen during finetuning, as you
would expect. However, when considering tasks that
were not seen during finetuning, instruction
tuning actually hurts performance up to models
with 8B parameters, and only helps for the 68B and 137B models
(where it raises performance by about 15 percentage points on
average across heldout tasks).
|