论文精读--GPT3

不像GPT2一样追求zero-shot，而换成了few-shot

Abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

翻译：
最近的工作表明，通过对大量文本语料库进行预训练，然后在特定任务上进行微调，可以在许多NLP任务和基准测试上实现重大增益。尽管在架构上通常是任务不可知的，但这种方法仍然需要成千上万个示例的任务特定微调数据集。相比之下，人类通常可以仅从几个示例或简单指令就完成新的语言任务 - 这是目前的NLP系统仍然在很大程度上难以做到的。在这里，我们展示了扩大语言模型可以大大提高任务不可知的few-shot性能，有时甚至可以达到与以前的最新微调方法相媲美的竞争力。具体来说，我们训练了GPT-3，一个有1750亿参数的自回归语言模型，比任何以前的非稀疏语言模型多10倍，并测试其在少样本设置中的性能。对于所有任务，GPT-3都是没有任何梯度更新或微调的应用，任务和少样本演示完全是通过与模型的文本交互来指定的。GPT-3在许多NLP数据集上取得了强大的性能，包括翻译、问答和完形填空任务，以及在需要即时推理或领域适应的任务上，例如拼词、在句子中使用新词或进行3位数的算术。同时，我们还确定了一些数据集，在这些数据集上GPT-3的少样本学习仍然存在困难，以及一些由于在大规模网络语料库上训练而面临的方法论问题。最后，我们发现GPT-3可以生成新闻文章的样本，这些样本使人类评估者难以将其与人类撰写的文章区分开来。我们讨论了这一发现和GPT-3一般的更广泛的社会影响。

总结：

更大更强；虽然是few-shot，但是不在子任务的样本上做微调，而是用prompt做few-shot

Introduction

Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word vectors [MCCD13, PSM14] and fed to task-specific architectures, then RNNs with multiple layers of representations and contextual state were used to form stronger representations [DL15, MBXS17, PNZtY18] (though still applied to task-specific architectures), and more recently pre-trained recurrent or transformer language models [VSP+17] have been directly fine-tuned, entirely removing the need for task-specific architectures [RNSS18, DCLT18, HR18].

翻译：

近年来，NLP系统中的趋势是使用预训练的语言表示，并以越来越灵活和与任务无关的方式应用于下游任务。首先，使用词向量[MCCD13, PSM14]学习单层表示，并输入到特定任务的架构中，然后使用具有多层表示和上下文状态的RNN来形成更强的表示[DL15, MBXS17, PNZtY18]（尽管仍然应用于特定任务的架构），最近，直接对预训练的循环或转换器语言模型[VSP+17]进行微调，完全消除了对特定任务架构的需求[RNSS18, DCLT18, HR18]。

总结：

大家都喜欢用预训练模型微调

This last paradigm has led to substantial progress on many challenging NLP tasks such as reading comprehension, question answering, textual entailment, and many others, and has continued to advance based on new architectures and algorithms [RSR+19, LOG+19, YDY+19, LCG+19]. However, a major limitation to this approach is that while the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning: to achieve strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands of examples specific to that task. Removing this limitation would be desirable, for several reasons.

First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the applicability of language models. There exists a very wide range of possible useful language tasks, encompassing anything from correcting grammar, to generating examples of an abstract concept, to critiquing a short story. For many of these tasks it is difficult to collect a large supervised training dataset, especially when the process must be repeated for every new task.

Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness of the model and the narrowness of the training distribution. This can create problems for the pre-training plus fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then fine-tuned on very narrow task distributions. For instance [HLW+20] observe that larger models do not necessarily generalize better out-of-distribution. There is evidence that suggests that the generalization achieved under this paradigm can be poor because the model is overly specific to the training distribution and does not generalize well outside it [YdC+19, MPL19]. Thus, the performance of fine-tuned models on specific benchmarks, even when it is nominally at human-level, may exaggerate actual performance on the underlying task [GSL+18, NK19].

Third, humans do not require large supervised datasets to learn most language tasks – a brief directive in natural language (e.g. “please tell me if this sentence describes something happy or something sad”) or at most a tiny number of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”) is often sufficient to enable a human to perform a new task to at least a reasonable degree of competence. Aside from pointing to a conceptual limitation in our current NLP techniques, this adaptability has practical advantages – it allows humans to seamlessly mix together or switch between many tasks and skills, for example performing addition during a lengthy dialogue. To be broadly useful, we would someday like our NLP systems to have this same fluidity and generality.

翻译：

这种最新范式在许多具有挑战性的NLP任务上取得了实质性进展，如阅读理解、问答、文本蕴含以及许多其他任务，并基于新的架构和算法继续推进[RSR+19, LOG+19, YDY+19, LCG+19]。然而，这种方法的一个主要局限性在于，尽管架构是任务无关的，但仍然需要特定于任务的数据集和任务特定的微调：要在期望的任务上实现强大的性能，通常需要在该任务的具体数据集上进行微调，这些数据集包含数千到数十万个示例。消除这一局限性是可取的，原因有几个。

首先，从实际的角度来看，对于每个新任务都需要大量标记示例数据集的需求限制了语言模型的应用范围。存在非常广泛的可能有用的语言任务，包括从纠正语法到生成一个抽象概念的示例，再到评论短篇小说等任何事物。对于许多这样的任务，收集大型监督训练数据集是很困难的，特别是当这个过程必须针对每个新任务重复进行时。

其次，利用训练数据中的虚假相关性的潜在可能性随着模型的表达能力和训练分布的狭窄程度而根本增长。这对于预训练加微调范式可能造成问题，在这种范式中，模型被设计得很大，以便在预训练期间吸收信息，但随后在非常狭窄的任务分布上进行微调。例如，[HLW+20]观察到，更大的模型并不一定能更好地在分布外泛化。有证据表明，在这种范式下实现的泛化可能很差，因为模型过于特定于训练分布，并且不能很好地在分布之外泛化[YdC+19, MPL19]。因此，即使在名义上达到人类水平的特定基准测试上，微调后的模型的性能也可能夸大在实际任务上的实际性能[GSL+18, NK19]。

第三，人类在学习大多数语言任务时并不需要大量的监督数据集——通常，一句自然语言的简短指示（例如“请告诉我这个句子是描述快乐的事情还是悲伤的事情”）或者最多极少数量的演示（例如“这里有两个勇敢行为的例子；请给出第三个勇敢的例子”）就足以使人类至少能够以合理的程度执行新任务。除了指出我们当前NLP技术的一个概念性限制之外，这种适应性还具有实际优势——它允许人类无缝地混合或切换许多任务和技能，例如在冗长的对话中执行加法。为了具有广泛的实用性，我们希望有朝一日我们的NLP系统能够具备同样的流动性和通用性。

总结：

问题一：子任务太多，一个个找对应训练数据集不现实

问题二：微调任务中的数据可能在训练时就看过一些了

问题三：与人相比泛化能力不够

One potential route towards addressing these issues is meta-learning1 – which in the context of language models means the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities at inference time to rapidly adapt to or recognize the desired task (illustrated in Figure 1.1). Recent work [RWC+19] attempts to do this via what we call “in-context learning”, using the text input of a pretrained language model as a form of task specification: the model is conditioned on a natural language instruction and/or a few demonstrations of the task and is then expected to complete further instances of the task simply by predicting what comes next.

翻译：

解决这些问题的一条潜在途径是元学习（meta-learning）——在语言模型的背景下，这意味着模型在训练时发展出一套广泛的技能和模式识别能力，然后在推理时使用这些能力快速适应或识别所需的任务（如图1.1所示）。最近的工作[RWC+19]试图通过我们所说的“上下文学习”（in-context learning）来实现这一点，使用预训练语言模型的文本输入作为一种任务规范：模型被自然语言指令和/或任务的几个示例条件化，然后预期模型仅通过预测接下来会发生什么来完成任务的更多实例。

总结：

meta-learning：训练一个又大又强的模型

in-context learning：不利用子任务的少样本更新权重

每个sequence对应不同任务的数据，模型在大量不同任务的数据上训练，多多少少有在做一个元学习的过程，而且每个sequence是上下文的学习

While it has shown some initial promise, this approach still achieves results far inferior to fine-tuning – for example [RWC+19] achieves only 4% on Natural Questions, and even its 55 F1 CoQa result is now more than 35 points behind the state of the art. Meta-learning clearly requires substantial improvement in order to be viable as a practical method of solving language tasks.

翻译：

尽管这种方法已经显示出一些初步的潜力，但它的结果仍然远远不如微调——例如，[RWC+19]在Natural Questions上的准确率仅为4%，即使是它的55 F1 CoQa结果，现在也落后于最新技术35分以上。元学习显然需要实质性的改进，才能成为一种实用的解决语言任务的方法。

Another recent trend in language modeling may offer a way forward. In recent years the capacity of transformer language models has increased substantially, from 100 million parameters [RNSS18], to 300 million parameters [DCLT18], to 1.5 billion parameters [RWC+19], to 8 billion parameters [SPP+19], 11 billion parameters [RSR+19], and finally 17 billion parameters [Tur20]. Each increase has brought improvements in text synthesis and/or downstream NLP tasks, and there is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a smooth trend of improvement with scale [KMH+20]. Since in-context learning involves absorbing many skills and tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong gains with scale.

翻译：

语言建模领域的另一个近期趋势可能提供了一种前进的方式。近年来，transformer语言模型的能力已经有了实质性的提升，从10亿参数[RNSS18]，到30亿参数[DCLT18]，再到150亿参数[RWC+19]，再到800亿参数[SPP+19]，1100亿参数[RSR+19]，最后到1700亿参数[Tur20]。每次能力的提升都带来了文本合成和/或下游NLP任务的改进，并且有证据表明，与许多下游任务相关联的对数损失随着规模的增长呈现出平滑的改进趋势[KMH+20]。由于上下文学习涉及到在模型的参数内吸收许多技能和任务，因此可以推测上下文学习能力可能会随着规模的扩大而显示出类似的强劲增长。

总结：

模型越搞越大

In this paper, we test this hypothesis by training a 175 billion parameter autoregressive language model, which we call GPT-3, and measuring its in-context learning abilities. Specifically, we evaluate GPT-3 on over two dozen NLP datasets, as well as several novel tasks designed to test rapid adaptation to tasks unlikely to be directly contained in the training set. For each task, we evaluate GPT-3 under 3 conditions: (a) “few-shot learning”, or in-context learning where we allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) “one-shot learning”, where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only an instruction in natural language is given to the model. GPT-3 could also in principle be evaluated in the traditional fine-tuning setting, but we leave this to future work.

翻译：

在本文中，我们通过训练一个具有1750亿参数的自回归语言模型——我们称之为GPT-3，并测量其上下文学习能力来测试这一假设。具体来说，我们在超过二十个NLP数据集上评估GPT-3，以及几个设计用于测试快速适应训练集中不可能直接包含的任务的新任务。对于每个任务，我们在以下三种条件下评估GPT-3：(a)“少样本学习”，或者在我们允许尽可能多的演示适应模型上下文窗口的上下文学习（通常为10到100个），(b)“单样本学习”，我们只允许一个演示，以及(c)“零样本”学习，不允许任何演示，只给模型一个自然语言的指令。从理论上讲，GPT-3也可以在传统的微调设置下进行评估，但我们将其留作未来工作。

总结：

三种评估：few-shot、one-shot、zero-shot

虚线是所有子任务，实线是子任务效果的平均值

At the same time, we also find some tasks on which few-shot performance struggles, even at the scale of GPT-3. This includes natural language inference tasks like the ANLI dataset, and some reading comprehension datasets like RACE or QuAC. By presenting a broad characterization of GPT-3’s strengths and weaknesses, including these limitations, we hope to stimulate study of few-shot learning in language models and draw attention to where progress is most needed.

We also undertake a systematic study of “data contamination” – a growing problem when training high capacity models on datasets such as Common Crawl, which can potentially include content from test datasets simply because such content often exists on the web. In this paper we develop systematic tools to measure data contamination and quantify its distorting effects. Although we find that data contamination has a minimal effect on GPT-3’s performance on most datasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these datasets or we note them with an asterisk, depending on the severity.

翻译：

同时，我们还发现即使在GPT-3的规模下，few-shot性能在一些任务上也存在困难。这包括像ANLI数据集这样的自然语言推理任务，以及像RACE或QuAC这样的某些阅读理解数据集。通过展示GPT-3的优点和缺点的广泛特征，包括这些局限性，我们希望激发对语言模型中少样本学习的研究，并引起人们对最需要进步的地方的关注。

我们还进行了一项关于“数据污染”的系统性研究——这是一个在诸如Common Crawl这样的数据集上训练高容量模型时日益严重的问题，因为这些数据集可能包含来自测试数据集的内容，仅仅因为这些内容通常存在于网络上。在本文中，我们开发了系统性的工具来测量数据污染并量化其扭曲效应。尽管我们发现数据污染对GPT-3在大多数数据集上的性能影响最小，但我们确实识别了几个可能因数据污染而夸大结果的数据集，我们或者不报告这些数据集的结果，或者根据严重程度在结果上标注星号。

Approach

zero-shot：提供一个prompt

one-shot：提供一个prompt后插一个样本进来当作示例，希望通过注意力机制捕获有用信息帮助预测

few-shot：提供一个prompt后插一些样本进来当作示例

Model and Architectures

We use the same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work [KMH+20] suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a function of size; training models of many different sizes allows us to test this hypothesis both for validation loss and for downstream language tasks.

翻译：

我们使用了与GPT-2 [RWC+19]相同的模型和架构，包括其中描述的修改后的初始化、预规范化和可逆的标记化，唯一的例外是我们在transformer的层中使用交替的密集和局部带状稀疏注意力模式，类似于Sparse Transformer [CGRS19]。为了研究机器学习性能与模型大小之间的依赖关系，我们训练了8种不同大小的模型，参数范围从1.25亿到1750亿，最后一个是我们称之为GPT-3的模型。之前的工作[KMH+20]表明，只要有足够的训练数据，验证损失的规模大致上应该是一个关于大小的平滑的幂律函数；训练多种不同大小的模型允许我们同时测试这个假设对于验证损失和下游语言任务是否成立。