HIGHLIGHTS
Program Synthesis with Large Language Models (Jacob
Austin, Augustus Odena et al) (summarized by Rohin): Can
we use large language models to solve programming problems? In
order to answer this question, this paper builds the Mostly Basic
Python Programming (MBPP) dataset. The authors asked crowd workers
to provide a short problem statement, a Python function that solves
the problem, and three test cases checking correctness. On average
across the 974 programs, the reference solution has 7 lines of
code, suggesting the problems are fairly simple. (This is partly
because you can use library functions.) They also edit a subset of
426 problems to improve their quality, for example by making the
problem statement less ambiguous or making the function signature
more normal.
They evaluate pretrained language models on this dataset across
a range of model sizes from 0.244B to 137B parameters. (This
largest model is within a factor of 2 of GPT-3.) They consider both
few-shot and finetuned models. Since we have test cases that can be
evaluated automatically, we can boost performance by generating
lots of samples (80 in this case), evaluating them on the test
cases, and then keeping the ones that succeed. They count a problem
as solved if any sample passes all the test cases, and report as
their primary metric the fraction of problems solved according to
this definition. Note however that the test cases are not
exhaustive: when they wrote more exhaustive tests for 50 of the
problems, they found that about 12% of the so-called “solutions”
did not pass the new tests (but conversely, 88% did). They also
look at the fraction of samples which solve the problem, as a
metric of the reliability or confidence of the model for a given
problem.
Some of their findings:
1. Performance increases approximately log-linearly with model
size. The trend is clearer and smoother by the primary metric
(fraction of problems solved by at least one sample) compared to
the secondary metric (fraction of samples that solve their
problem).
2. Finetuning provides a roughly constant boost across model
sizes. An exception: at the largest model size, finetuning provides
almost no benefit, though this could just be noise.
3. It is important to provide at least one test case to the
model (boosts problems solved from 43% to 55%) but after that
additional test cases don’t make much of a difference (an
additional two examples per problem boosts performance to 59%).
4. In few-shot learning, the examples used in the prompt matter
a lot. In a test of 15 randomly selected prompts for the few-shot
137B model, the worst one got ~1%, while the best one got ~59%,
with the others distributed roughly uniformly between them.
Ensembling all 15 prompts boosts performance to 66%.
5. In rare cases, the model overfits to the test cases. For
example, in a question about checking whether the input is a
Woodall number, there is only one test checking an actual Woodall
number (383), and the model generates a program that simply checks
whether the input is 383.
6. When choosing the best of multiple samples, you want a
slightly higher temperature, in order to have more diversity of
possible programs to check.
7. It is important to have high quality problem descriptions as
input for the model. The 137B model solves 79% of problems in the
edited dataset, but only solves 63% of the original (unedited)
versions of those problems. The authors qualitatively analyze the
edits on the problems that switched from unsolved to solved and
find a variety of things that you would generally expect to
help.
Now for the controversial question everyone loves to talk about:
does the model understand the meaning of the
code, or is it “just learning statistical correlations”? One way to
check this is to see whether the model can
also execute code. Specifically, we provide the
ground truth code for one of the problems in the MBPP dataset along
with one of the test case inputs and ask the model to predict the
output for that test case. Even after finetuning for this task, the
137B model gets only 21% right. This can be boosted to 27% by also
providing example test cases for the code before predicting the
output for a new test case. Overall, this suggests that the model
doesn’t “understand” the code yet.
We can take the model finetuned for execution and see how well
it does on program synthesis. (We can do this because there are
different prompts for execution and synthesis.) For the 8B model,
the finetuning makes basically no difference: it’s equivalent to
the original few-shot setting. However, for the 137B model,
finetuning on execution actually leads to a small but non-trivial
improvement in performance (from ~59% to ~63%, I think). This is
true relative to either the few-shot or finetuned-for-synthesis
setting, since they performed near-identically for the 137B model.
So in fact the 137B model finetuned on execution is actually the
strongest model, according to synthesis performance.
So far we’ve just been looking at how our model performs when
taking the best of multiple samples. However, if our goal is to
actually use models for program synthesis, we aren’t limited to
such simple tricks. Another approach is to have a human
provide feedback in natural language when the
model’s output is incorrect, and then have the model generate a new
program. This feedback is very informal, for example, “Close, but
you need to replace the underscore with an empty string”. This
provides a huge performance boost: the 137B solves ~31% of problems
on its first sample; adding just a single piece of human feedback
per problem boosts performance to ~55%, and having four rounds of
human feedback gets you to over 65%.
The authors also introduce the MathQA-Python dataset, which
provides arithmetic word problems and asks models to write programs
that would output the correct answer to the problem. They only run
a few experiments on this dataset, so I’ve mostly ignored it. The
main upshot is that a finetuned 137B parameter model can solve
83.8% of problems with some sample. They don’t
report metrics with a single sample, which seems like the more
relevant metric for this dataset, but eyeballing other graphs I
think it would be around 45%, which you could probably boost a
little bit by decreasing the sampling temperature.
|