Scaling Language Models: Methods, Analysis & Insights from Training
Gopher (Jack W. Rae et al) (summarized by
Rohin): This paper details the training of the Gopher family of
large language models (LLMs), the biggest of which is named Gopher
and has 280 billion parameters. The algorithmic details are very
similar to the GPT
#102): a Transformer architecture trained on next-word
prediction. The models are trained on a new data distribution that
still consists of text from the Internet but in different
proportions (for example, book data is 27% of Gopher’s training
data but only 16% of GPT-3’s training data).
Like other LLM papers, there are tons of evaluations of Gopher
on various tasks, only some of which I’m going to cover here. One
headline number is that Gopher beat the state of the art (SOTA) at
the time on 100 out of 124 evaluation tasks.
The most interesting aspect of the paper (to me) is that the
entire Gopher family of models were all trained on the same number
of tokens, thus allowing us to study the effect of scaling up model
parameters (and thus training compute) while holding data constant.
Some of the largest benefits of scale were seen in the Medicine,
Science, Technology, Social Sciences, and the Humanities task
categories, while scale has not much effect or even a negative
effect in the Maths, Logical Reasoning, and Common Sense
categories. Surprisingly, we see improved performance
on TruthfulQA (AN
#165) with scale, even though the TruthfulQA benchmark was
designed to show worse performance with increased scale.
We can use Gopher in a dialogue setting by prompting it
appropriately. The prompt specifically instructs Gopher to be
“respectful, polite, and inclusive”; it turns out that this
significantly helps with toxicity. In particular, for the vanilla
Gopher model family, with more scale the models produce more toxic
continuations given toxic user statements; this no longer happens
with Dialogue-Prompted Gopher models, which show slight reductions
in toxicity with scale in the same setting. The authors speculate
that while increased scale leads to an increased ability to mimic
the style of a user statement, this is compensated for by an
increased ability to account for the prompt.
Another alternative the authors explore is to finetune Gopher on
5 billion tokens of dialogue to produce Dialogue-Tuned Gopher.
Interestingly, human raters were indifferent between
Dialogue-Prompted Gopher and Dialogue-Tuned Gopher.
Read more: Blog
post: Language modelling at scale: Gopher, ethical considerations,
Training Compute-Optimal Large Language Models (Jordan
Hoffmann et al) (summarized by Rohin): One application
#87) is to figure out how big a model to train, on how much
data, given some compute budget. This paper performs a more
systematic study than the original paper and finds that existing
models are significantly overtrained. Chinchilla is a new model
built with this insight: it has 4x fewer parameters than Gopher,
but is trained on 4x as much data. Despite using the same amount of
training compute as Gopher (and lower inference compute),
Chinchilla outperforms Gopher across a wide variety of metrics,
validating these new scaling laws.
You can safely skip to the opinion at this point – the rest of
this summary is quantitative details.
We want to find functions N(C) and D(C) that specify the optimal
number of parameters N and the amount of data D to use given some
compute budget C. We’ll assume that these scale with a power of C,
that is, N(C) = k_N * C^a and D(C) = k_D * C^b, for some constants
a, b, k_N, and k_D. Note that since total compute increases
linearly with both N (since each forward / backward pass is linear
in N) and D (since the number of forward / backwards passes is
linear in D), we need to have a + b = 1. (You can see this somewhat
more formally by noting that we have C = k_C * N(C) * D(C) for some
constant k_C, and then substituting in the definitions of N(C) and
This paper uses three different approaches to get three
estimates of a and b. The approach I like best is “isoFLOP
1. Choose a variety of possible values of (N, D, C), train
models with those values, and record the final loss obtained. Note
that not all values of (N, D, C) are possible: given any two values
the third is determined.
2. Draw isoFLOP curves: for each value of C, choose either N or
D to be your remaining independent variable, and fit a parabola to
the losses of the remaining points. The minimum of this parabola
gives you an estimate for the optimal N and D for each particular
value of C.
3. Use the optimal (N, D, C) points to fit N(C) and D(C).
This approach gives an estimate of a = 0.49; the other
approaches give estimates of a = 0.5 and a = 0.46. If we take the
nice round number a = b = 0.5, this suggests that you should scale
up parameters and data equally. With 10x the computation, you
should train a 3.2x larger model with 3.2x as much data. In
contrast, the original
scaling laws paper (AN
#87) estimated that a = 0.74 and b = 0.26. With 10x more
computation, it would suggest training a 5.5x larger model with
1.8x as much data.