论文精读--GPT3

不像GPT2一样追求zero-shot,而换成了few-shot

Abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art finetuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

翻译:
最近的工作表明,通过对大量文本语料库进行预训练,然后在特定任务上进行微调,可以在许多NLP任务和基准测试上实现重大增益。尽管在架构上通常是任务不可知的,但这种方法仍然需要成千上万个示例的任务特定微调数据集。相比之下,人类通常可以仅从几个示例或简单指令就完成新的语言任务 - 这是目前的NLP系统仍然在很大程度上难以做到的。在这里,我们展示了扩大语言模型可以大大提高任务不可知的few-shot性能,有时甚至可以达到与以前的最新微调方法相媲美的竞争力。具体来说,我们训练了GPT-3,一个有1750亿参数的自回归语言模型,比任何以前的非稀疏语言模型多10倍,并测试其在少样本设置中的性能。对于所有任务,GPT-3都是没有任何梯度更新或微调的应用,任务和少样本演示完全是通过与模型的文本交互来指定的。GPT-3在许多NLP数据集上取得了强大的性能,包括翻译、问答和完形填空任务,以及在需要即时推理或领域适应的任务上,例如拼词、在句子中使用新词或进行3位数的算术。同时,我们还确定了一些数据集,在这些数据集上GPT-3的少样本学习仍然存在困难,以及一些由于在大规模网络语料库上训练而面临的方法论问题。最后,我们发现GPT-3可以生成新闻文章的样本,这些样本使人类评估者难以将其与人类撰写的文章区分开来。我们讨论了这一发现和GPT-3一般的更广泛的社会影响。

总结:

更大更强;虽然是few-shot,但是不在子任务的样本上做微调,而是用prompt做few-shot

Introduction

Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word vectors [MCCD13, PSM14] and fed to task-specific architectures, then RNNs with multiple layers of representations and contextual state were used to form stronger representations [DL15, MBXS17, PNZtY18] (though still applied to task-specific architectures), and more recently pre-trained recurrent or transformer language models [VSP+17] have been directly fine-tuned, entirely removing the need for task-specific architectures [RNSS18, DCLT18, HR18].

翻译:

近年来,NLP系统中的趋势是使用预训练的语言表示,并以越来越灵活和与任务无关的方式应用于下游任务。首先,使用词向量[MCCD13, PSM14]学习单层表示,并输入到特定任务的架构中,然后使用具有多层表示和上下文状态的RNN来形成更强的表示[DL15, MBXS17, PNZtY18](尽管仍然应用于特定任务的架构),最近,直接对预训练的循环或转换器语言模型[VSP+17]进行微调,完全消除了对特定任务架构的需求[RNSS18, DCLT18, HR18]。

总结:

大家都喜欢用预训练模型微调

This last paradigm has led to substantial progress on many challenging NLP tasks such as reading comprehension, question answering, textual entailment, and many others, and has continued to advance based on new architectures and algorithms [RSR+19, LOG+19, YDY+19, LCG+19]. However, a major limitation to this approach is that while the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning: to achieve strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands of examples specific to that task. Removing this limitation would be desirable, for several reasons.

First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the applicability of language models. There exists a very wide range of possible useful language tasks, encompassing anything from correcting grammar, to generating examples of an abstract concept, to critiquing a short story. For many of these tasks it is difficult to collect a large supervised training dataset, especially when the process must be repeated for every new task.

Second, the potential to exploit spurious correlations in training data fundamentally grows with the expressiveness of the model and the narrowness of the training distribution. This can create problems for the pre-training plus fine-tuning paradigm, where models are designed to be large to absorb information during pre-training, but are then fine-tuned on very narrow task distributions. For instance [HLW+20] observe that larger models do not necessarily generalize better out-of-distribution. There is evidence that suggests that the generalization achieved under this paradigm can be poor because the model is overly specific to the training distribution and does not generalize well outside it [YdC+19, MPL19]. Thus, the performance of fine-tuned models on specific benchmarks, even when it is nominally at human-level, may exaggerate actual performance on the underlying task [GSL+18, NK19].

Third, humans do not require large supervised datasets to learn most language tasks – a brief directive in natural language (e.g. “please tell me if this sentence describes something happy or something sad”) or at most a tiny number of demonstrations (e.g. “here are two examples of people acting brave; please give a third example of bravery”) is often sufficient to enable a human to perform a new task to at least a reasonable degree of competence. Aside from pointing to a conceptual limitation in our current NLP techniques, this adaptability has practical advantages – it allows humans to seamlessly mix together or switch between many tasks and skills, for example performing addition during a lengthy dialogue. To be broadly useful, we would someday like our NLP systems to have this same fluidity and generality.

翻译:

这种最新范式在许多具有挑战性的NLP任务上取得了实质性进展,如阅读理解、问答、文本蕴含以及许多其他任务,并基于新的架构和算法继续推进[RSR+19, LOG+19, YDY+19, LCG+19]。然而,这种方法的一个主要局限性在于,尽管架构是任务无关的,但仍然需要特定于任务的数据集和任务特定的微调:要在期望的任务上实现强大的性能,通常需要在该任务的具体数据集上进行微调,这些数据集包含数千到数十万个示例。消除这一局限性是可取的,原因有几个。

首先,从实际的角度来看,对于每个新任务都需要大量标记示例数据集的需求限制了语言模型的应用范围。存在非常广泛的可能有用的语言任务,包括从纠正语法到生成一个抽象概念的示例,再到评论短篇小说等任何事物。对于许多这样的任务,收集大型监督训练数据集是很困难的,特别是当这个过程必须针对每个新任务重复进行时。

其次,利用训练数据中的虚假相关性的潜在可能性随着模型的表达能力和训练分布的狭窄程度而根本增长。这对于预训练加微调范式可能造成问题,在这种范式中,模型被设计得很大,以便在预训练期间吸收信息,但随后在非常狭窄的任务分布上进行微调。例如,[HLW+20]观察到,更大的模型并不一定能更好地在分布外泛化。有证据表明,在这种范式下实现的泛化可能很差,因为模型过于特定于训练分布,并且不能很好地在分布之外泛化[YdC+19, MPL19]。因此,即使在名义上达到人类水平的特定基准测试上,微调后的模型的性能也可能夸大在实际任务上的实际性能[GSL+18, NK19]。

第三,人类在学习大多数语言任务时并不需要大量的监督数据集——通常,一句自然语言的简短指示(例如“请告诉我这个句子是描述快乐的事情还是悲伤的事情”)或者最多极少数量的演示(例如“这里有两个勇敢行为的例子;请给出第三个勇敢的例子”)就足以使人类至少能够以合理的程度执行新任务。除了指出我们当前NLP技术的一个概念性限制之外,这种适应性还具有实际优势——它允许人类无缝地混合或切换许多任务和技能,例如在冗长的对话中执行加法。为了具有广泛的实用性,我们希望有朝一日我们的NLP系统能够具备同样的流动性和通用性。

总结:

问题一:子任务太多,一个个找对应训练数据集不现实

问题二:微调任务中的数据可能在训练时就看过一些了

问题三:与人相比泛化能力不够

One potential route towards addressing these issues is meta-learning1 – which in the context of language models means the model develops a broad set of skills and pattern recognition abilities at training time, and then uses those abilities at inference time to rapidly adapt to or recognize the desired task (illustrated in Figure 1.1). Recent work [RWC+19] attempts to do this via what we call “in-context learning”, using the text input of a pretrained language model as a form of task specification: the model is conditioned on a natural language instruction and/or a few demonstrations of the task and is then expected to complete further instances of the task simply by predicting what comes next.

翻译:

解决这些问题的一条潜在途径是元学习(meta-learning)——在语言模型的背景下,这意味着模型在训练时发展出一套广泛的技能和模式识别能力,然后在推理时使用这些能力快速适应或识别所需的任务(如图1.1所示)。最近的工作[RWC+19]试图通过我们所说的“上下文学习”(in-context learning)来实现这一点,使用预训练语言模型的文本输入作为一种任务规范:模型被自然语言指令和/或任务的几个示例条件化,然后预期模型仅通过预测接下来会发生什么来完成任务的更多实例。

总结:

meta-learning:训练一个又大又强的模型

in-context learning:不利用子任务的少样本更新权重

每个sequence对应不同任务的数据,模型在大量不同任务的数据上训练,多多少少有在做一个元学习的过程,而且每个sequence是上下文的学习

While it has shown some initial promise, this approach still achieves results far inferior to fine-tuning – for example [RWC+19] achieves only 4% on Natural Questions, and even its 55 F1 CoQa result is now more than 35 points behind the state of the art. Meta-learning clearly requires substantial improvement in order to be viable as a practical method of solving language tasks.

翻译:

尽管这种方法已经显示出一些初步的潜力,但它的结果仍然远远不如微调——例如,[RWC+19]在Natural Questions上的准确率仅为4%,即使是它的55 F1 CoQa结果,现在也落后于最新技术35分以上。元学习显然需要实质性的改进,才能成为一种实用的解决语言任务的方法。

Another recent trend in language modeling may offer a way forward. In recent years the capacity of transformer language models has increased substantially, from 100 million parameters [RNSS18], to 300 million parameters [DCLT18], to 1.5 billion parameters [RWC+19], to 8 billion parameters [SPP+19], 11 billion parameters [RSR+19], and finally 17 billion parameters [Tur20]. Each increase has brought improvements in text synthesis and/or downstream NLP tasks, and there is evidence suggesting that log loss, which correlates well with many downstream tasks, follows a smooth trend of improvement with scale [KMH+20]. Since in-context learning involves absorbing many skills and tasks within the parameters of the model, it is plausible that in-context learning abilities might show similarly strong gains with scale.

翻译:

语言建模领域的另一个近期趋势可能提供了一种前进的方式。近年来,transformer语言模型的能力已经有了实质性的提升,从10亿参数[RNSS18],到30亿参数[DCLT18],再到150亿参数[RWC+19],再到800亿参数[SPP+19],1100亿参数[RSR+19],最后到1700亿参数[Tur20]。每次能力的提升都带来了文本合成和/或下游NLP任务的改进,并且有证据表明,与许多下游任务相关联的对数损失随着规模的增长呈现出平滑的改进趋势[KMH+20]。由于上下文学习涉及到在模型的参数内吸收许多技能和任务,因此可以推测上下文学习能力可能会随着规模的扩大而显示出类似的强劲增长。

总结:

模型越搞越大

In this paper, we test this hypothesis by training a 175 billion parameter autoregressive language model, which we call GPT-3, and measuring its in-context learning abilities. Specifically, we evaluate GPT-3 on over two dozen NLP datasets, as well as several novel tasks designed to test rapid adaptation to tasks unlikely to be directly contained in the training set. For each task, we evaluate GPT-3 under 3 conditions: (a) “few-shot learning”, or in-context learning where we allow as many demonstrations as will fit into the model’s context window (typically 10 to 100), (b) “one-shot learning”, where we allow only one demonstration, and (c) “zero-shot” learning, where no demonstrations are allowed and only an instruction in natural language is given to the model. GPT-3 could also in principle be evaluated in the traditional fine-tuning setting, but we leave this to future work.

翻译:

在本文中,我们通过训练一个具有1750亿参数的自回归语言模型——我们称之为GPT-3,并测量其上下文学习能力来测试这一假设。具体来说,我们在超过二十个NLP数据集上评估GPT-3,以及几个设计用于测试快速适应训练集中不可能直接包含的任务的新任务。对于每个任务,我们在以下三种条件下评估GPT-3:(a)“少样本学习”,或者在我们允许尽可能多的演示适应模型上下文窗口的上下文学习(通常为10到100个),(b)“单样本学习”,我们只允许一个演示,以及(c)“零样本”学习,不允许任何演示,只给模型一个自然语言的指令。从理论上讲,GPT-3也可以在传统的微调设置下进行评估,但我们将其留作未来工作。

总结:

三种评估:few-shot、one-shot、zero-shot

虚线是所有子任务,实线是子任务效果的平均值 

At the same time, we also find some tasks on which few-shot performance struggles, even at the scale of GPT-3. This includes natural language inference tasks like the ANLI dataset, and some reading comprehension datasets like RACE or QuAC. By presenting a broad characterization of GPT-3’s strengths and weaknesses, including these limitations, we hope to stimulate study of few-shot learning in language models and draw attention to where progress is most needed.

We also undertake a systematic study of “data contamination” – a growing problem when training high capacity models on datasets such as Common Crawl, which can potentially include content from test datasets simply because such content often exists on the web. In this paper we develop systematic tools to measure data contamination and quantify its distorting effects. Although we find that data contamination has a minimal effect on GPT-3’s performance on most datasets, we do identify a few datasets where it could be inflating results, and we either do not report results on these datasets or we note them with an asterisk, depending on the severity.

翻译:

同时,我们还发现即使在GPT-3的规模下,few-shot性能在一些任务上也存在困难。这包括像ANLI数据集这样的自然语言推理任务,以及像RACE或QuAC这样的某些阅读理解数据集。通过展示GPT-3的优点和缺点的广泛特征,包括这些局限性,我们希望激发对语言模型中少样本学习的研究,并引起人们对最需要进步的地方的关注。

我们还进行了一项关于“数据污染”的系统性研究——这是一个在诸如Common Crawl这样的数据集上训练高容量模型时日益严重的问题,因为这些数据集可能包含来自测试数据集的内容,仅仅因为这些内容通常存在于网络上。在本文中,我们开发了系统性的工具来测量数据污染并量化其扭曲效应。尽管我们发现数据污染对GPT-3在大多数数据集上的性能影响最小,但我们确实识别了几个可能因数据污染而夸大结果的数据集,我们或者不报告这些数据集的结果,或者根据严重程度在结果上标注星号。

Approach

zero-shot:提供一个prompt

one-shot:提供一个prompt后插一个样本进来当作示例,希望通过注意力机制捕获有用信息帮助预测

few-shot:提供一个prompt后插一些样本进来当作示例

Model and Architectures

We use the same model and architecture as GPT-2 [RWC+19], including the modified initialization, pre-normalization, and reversible tokenization described therein, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [CGRS19]. To study the dependence of ML performance on model size, we train 8 different sizes of model, ranging over three orders of magnitude from 125 million parameters to 175 billion parameters, with the last being the model we call GPT-3. Previous work [KMH+20] suggests that with enough training data, scaling of validation loss should be approximately a smooth power law as a function of size; training models of many different sizes allows us to test this hypothesis both for validation loss and for downstream language tasks.

翻译:

我们使用了与GPT-2 [RWC+19]相同的模型和架构,包括其中描述的修改后的初始化、预规范化和可逆的标记化,唯一的例外是我们在transformer的层中使用交替的密集和局部带状稀疏注意力模式,类似于Sparse Transformer [CGRS19]。为了研究机器学习性能与模型大小之间的依赖关系,我们训练了8种不同大小的模型,参数范围从1.25亿到1750亿,最后一个是我们称之为GPT-3的模型。之前的工作[KMH+20]表明,只要有足够的训练数据,验证损失的规模大致上应该是一个关于大小的平滑的幂律函数;训练多种不同大小的模型允许我们同时测试这个假设对于验证损失和下游语言任务是否成立。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/411840.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

探究前端路由hash和history的实现原理(包教包会)

今天我们来讲一讲前端中很重要的一个部分路由(router),想必前端小伙伴对‘路由’一词都不会感到陌生。但是如果哪天面试官问你,能大概说一说前端路由的实现原理吗? 你又会如何应对呢? 今天勇宝就带着大家一…

Educational Codeforces Round 160 (Rated for Div. 2) D. Array Collapse(笛卡尔树+DP)

原题链接:D. Array Collapse 题目大意: 给你一个长度为 n n n 的排列 p p p ,排列的定义为 [ 1 , 2 , 3 , . . , n ] [1,2,3,..,n] [1,2,3,..,n] 中每个数都出现 恰好 一次。 你可以做 任意多次 这样的操作: 选出一个任意长度…

8万就能买混动!秦PLUS、启源A05、帝豪L Hi-P谁值得买?

文 | AUTO芯球 作者 | 雷歌 你可以不买比亚迪,但一定要感谢比亚迪。 比亚迪凭着一己之力,将整个混动汽车的价格降到了7万元时代。 秦PLUS价格自9.98万直降2万来到7.98万后,它的直接竞争对手们开始降价,长安启源A05混动降至7.8…

Linux提权—服务漏洞,以MySQL-UDF提权为例

UDF(user defined function,用户自定义函数) 利用条件: 有对MySQL数据库进行创建,插入,删除的权限 secure_file_priv为空 利用过程 secure_file_priv的值为空或者是我们恰巧需要用到的目录,如下: 提权成…

数学建模论文、代码百度网盘链接

1.[2018中国大数据年终总决赛冠军] 金融市场板块划分与轮动规律挖掘与可视化问题 2.[2019第九届MathorCup数模二等奖] 数据驱动的城市轨道交通网络优化策略 3.[2019电工杯一等奖] 露天停车场停车位的优化设计 4.[2019数学中国网络数模一等奖] 基于机器学习的保险业数字化变革…

C#通过泛型方法的重载分别调用主窗体和提示窗体

目录 一、涉及到的知识点 1.泛型方法的重载 2.使用泛型更好地实现通用化 二、示例:泛型方法及其重载 1.源码 2. 生成效果 实际开发项目时,有时会因为调用窗体或提示窗体过多,而难于管理,这时,可以通过泛型方法的…

关于 REST API,你了解多少?

什么是 REST API REST 是 REpresentational State Transfer 的缩写,是分布式超媒体系统的架构风格。Roy Fielding 于 2000 年在他的著名论文中首次提出了这一点。从那时起,它已成为构建基于 Web 的 API(应用程序编程接口)的最广泛…

保障工作效率!实时监管员工工作微信

随着移动办公的普及,员工使用微信进行工作沟通已成为常态。为了实时监管员工工作微信,企业可以通过个微管理系统来提高效率并确保合规: 给员工分配微信子账号 企业管理者可以直接在系统上给员工分配子账号,并轻松查看子账号的工…

leetcode hot100 买卖股票的最佳时机二

注意,本题是针对股票可以进行多次交易,但是下次买入的时候必须保证上次买入的已经卖出才可以。 动态规划可以解决整个股票买卖系列问题。 dp数组含义: dp[i][0]表示第i天不持有股票的最大现金 dp[i][1]表示第i天持有股票的最大现金 递归公…

ElasticSearch之bool多条件查询

写在前面 在实际的业务场景中,不可能只是简单的单值查询 ,更多的是n个条件的综合查询,就像下面的搜索: 针对这种场景我们就需要依赖于bool查询了,本文就一起来看下这部分的内容。 1:bool查询介绍 bool查…

在github的README.md中插入视频;在github的README.md中添加gif演示动画

最近需要再github中上传项目的源代码,应导师的要求,需要再README中加入对实验视频的展示,但是github的README.md其实就是一个markdown文件,据我的理解这个文件里应该无法直接插入视频吧?(如果后续有办法直接…

python中的类与对象(2)

目录 一. 类的基本语法 二. 类属性的应用场景 三. 类与类之间的依赖关系 (1)依赖关系 (2)关联关系 (3)组合关系 四. 类的继承 一. 类的基本语法 先看一段最简单的代码: class Dog():d_…

Python Web开发记录 Day3:BootStrap

名人说:莫道桑榆晚,为霞尚满天。——刘禹锡(刘梦得,诗豪) 创作者:Code_流苏(CSDN)(一个喜欢古诗词和编程的Coder😊) 目录 三、BootStrap1、BootStrap-初体验2、BootStrap…

Inno setup 打包jar包+前端dist+mysql+navicat等应用文件操作

目录 一、 使用exe4j将后端jar包打包成exe应用文件 1.创建一个新的工程 2.选择一个你想要存放的路径 3.进入配置界面 4.选择jar转换exe模式 5.自定义名字和选择输出路径 6.配置初始化 7.配置java环境 8.测试运行结果 二、Inno 打包应用文件exe 1.新建一个工程文件 2…

前端取图片相同颜色作为遮罩或者背景

需求 遮罩层取图片相同/相似的颜色作为遮罩 效果 做法 npm库 grade.js 所提供图像中前 2 个主色生成的互补渐变https://github.com/benhowdle89/grade COLOR THIEF 只需使用Javascript即可从图像中获取调色板。 https://github.com/lokesh/color-thief https://lokeshd…

向导式堆栈管理器Dockge

经过申诉,目前博客的几个域名都恢复了,时间也延长到了 2033 年,后面还会不会出问题,老苏就不知道了 什么是 Dockge ? Dockge 是一款时髦的、易于使用的、响应式的、自托管的 docker-compose.yaml 向导式堆栈管理器,可…

python使用winio控制x86工控机的gpio

视频讲解 https://www.bilibili.com/video/BV1Nu4m1w7iv/?vd_source5ba34935b7845cd15c65ef62c64ba82f pywinio库 https://pypi.org/project/pywinio/ 安装库 pip install pywinio寄存器地址 测试代码 import pywinio winio get_winio() # 设置排针2输出1,0x40是bit6置…

SSMBUG之 url +

1. Failed to configure a DataSource: ‘url’ attribute is not specified and no embedded datasource could be configured. 经查, 书写一切正常. 注意到此时yml文件的图标是一个红色的Y而不是绿色的spring , 推测没有正确加载. 重新创建项目, 所有东西拷贝一份便恢复正常…

04|MySQL事务及ACID

1 事务 事务是一组操作要么全部成功,要么全部失败,目的是为了保证数据最终的一致性。 2 事务的ACID属性 2.1 原子性(Atomicity) 当前事务的操作要么同时成功,要么同时失败。原子性由 undo log日志来实现。 2.2 一致性(Consistent) 使用…

爬虫入门四(抽屉半自动点赞、xpath使用、动作链、打码平台、scrapy框架介绍与安装及创建项目)

文章目录 一、抽屉半自动点赞二、xpath的使用三、动作链四、打码平台介绍超级鹰打码基本测试 五、自动登录超级鹰六、scrapy框架介绍安装创建爬虫项目 一、抽屉半自动点赞 登录抽屉账号保存cookiesimport timeimport jsonfrom selenium import webdriverfrom selenium.webdrive…