论文学习-Bert 和GPT 有什么区别?

Foundation Models, Transformers, BERT and GPT

总结一下:

  • Bert 是学习向量表征,让句子中某个词的Embedding关联到句子中其他重要词。最终学习下来,就是词向量的表征。这也是为什么Bert很容易用到下游任务,在做下游任务的时候,需要增加一些MLP对这些特征进行分类啥的,也就是所谓的微调fine-tune。在Bert的训练中,采用了MASK(完形填空)的思想,用句子中的其他词去预测被挖空的词–Self-Supervised Learning(不需要给句子label,只需要挖空)。这也是Bert不需要Decoder的原因。

  • GPT在做生成,结果是下一个特定词被选中的概率。给一个句子,去生成下一个字,然后再把这个字包含到句子中,重新送入模型,再生成下一个字。周而复始。我能理解这个任务用Decoder可以完成,但为什么这个过程不加入encoder了。–后面看到之后再补充

Bert和GPT都属于预训练模型,在预训练阶段,只不过在目标函数的选取上,Bert采用了完型填空的训练方式,GPT选择的是给定一句话预测下一个字的训练方式。在微调阶段,GPT选择使用两个目标函数结合的方式进行微调,而Bert的话,需要结合任务添加一些层对语义特征进行处理。

GPT选择的方式相对Bert要更困难,预测未来比预测中间状态要难得多,这也是为什么OpenAI要将模型的规模一直做大,才能达到GPT3.5 、GPT4的这种效果。


补充
之前李沐老师的视频里面其实也有讲,但是没记住。论带着问题学习的重要性 -_-

Transformer有两个东西,一个是encoder、一个是decoder。区别在于,encoder在对第i个元素抽取特征时,可以看到整个序列里面的所有元素。而decoder因为有掩码的存在,在对第i个元素抽取特征时,只能看到当前元素和它之前的元素,当前位置后面的词通过一个掩码使得在计算注意力机制的时候变成0。因为是标准的语言模型,只对前预测。对第i个词进行预测的时候,不能看到之后的词。所以GPT(Generative Pre-Training)最开始使用的只是decoder。


学习链接Blog—完全转载

  • https://heidloff.net/article/foundation-models-transformers-bert-and-gpt/
    在这里插入图片描述

Since I’m excited by the incredible capabilities which technologies like ChatGPT and Bard provide, I’m trying to understand better how they work. This post summarizes my current understanding about foundation models, transformers, BERT and GPT.

Note that I’m only learning these concepts and not everything might be fully correct but might help some people to understand the high level concepts.

I know that there are many more and more modern Foundation Models than BERT and GPT, but I want to start ‘simple’ and these two models are probably the most known ones these days.

The technologies below are not trivial and there are lots of articles, papers and full courses even on certain aspects of each technology only. Instead of going into detail, I try to explain what they do and what concepts they use.

Foundation Models

BERT and GPT are both foundation models. Let’s look at the definition and characteristics:

  • Pre-trained on different types of unlabeled datasets (e.g., language and images)
  • Self-supervised learning
  • Generalized data representations which can be used in multiple downstream tasks (e.g., classification and generation)
  • The Transformer architecture is mostly used, but not mandatory

Read my blog Foundation Models at IBM to find out more.

Transformer Architecture

Most foundation models use the transformer architecture. Let’s look at the definition:

A transformer is a deep learning model that adopts the mechanism of self-attention, differentially weighting the significance of each part of the input data. It is used primarily in the fields of natural language processing and computer vision.

In 2017 transformers were introduced: Attention is all you need. They are the next generation of Recurrent Neural Networks and Long Short-Term Memory architectures and have several benefits:

  • Parallel processing: Increases performance and scalability
  • Bidirectionality: Allows understanding of ambiguous words and coreferences

The original transformer architecture defines two main parts, an encoder and a decoder. However, not all foundation models use both parts. BERT only uses encoders, GPT only decoders. More on this later.

Attention

Both encoders and decoders use the concept of ‘attention’. Attention basically means to focus on the important pieces of information and to blend out the unimportant pieces. I like to compare this with ‘fast reading’. Rather than reading full articles or even full books, I often browse chapter titles, first words of paragraphs and scan paragraphs for keywords to find what I’m looking for.

The words of an article, the parts of an image or the words in a sentence that should get most attention change dependent on what you are looking for. Let’s look at a simple example sentence:

“Sarah went to a restaurant to meet her friend that night.”

The following words should get attention for the following queries:

  • What? -> ‘went’, ‘meet’
  • Where? -> ‘a restaurant’
  • Who? -> ‘Sarah’, ‘her friend’
  • When? -> ‘that night’

To determine the attention of words (more exactly tokens) ‘queries’, ‘keys’ and ‘values’ are used by encoders and decoders in transformers. All of them are presented in vectors. Keys are found for certain queries if they are closest to the query vector. Keys are an encoded representation for values, in simple cases they can be the same.

There are different algorithms to implement the attention concept. I think an easy way to understand how this can work is to rank words high that are often used together in sentences. For example, ‘where’ and ‘restaurant’ have probably a closer relation than ‘restaurant’ and ‘faith’. So, for the query ‘where’ the word ‘restaurant’ gets more attention.

Encoders and Decoders

As mentioned, there are encoders and decoders. BERT uses encoders only, GTP uses decoders only. Both options understand language including syntax and semantics. Especially the next generation of large language models like GPT with billions of parameters do this very well.

The two models focus on different scenarios. However, since the field of foundation models is evolving, the differentiation is often fuzzier.

  • BERT (encoder): classification (e.g., sentiment), questions and answers, summarization, named entity recognition
  • GPT (decoder): translation, generation (e.g., stories)

The outputs of the core models are different:

  • BERT (encoder): Embeddings representing words with attention information in a certain context
  • GPT (decoder): Next words with probabilities

Both models are pretrained and can be reused without intensive training. Some of them are available as open source and can be downloaded from communities like Hugging Face, others are commercial. Reuse is important, since trainings are often very resource intensive and expensive which few companies can afford.

The pretrained models can be extended and customized for different domains and specific tasks. Layers can sometimes be reused without modifications and more layers are added on top. If layers need to be modified, the new training is more expensive. The technique to customize these models is called Transfer Learning, since the same generic model can easily be transferred to other domains.

BERT - Encoders

BERT uses the encoder part of the transformer architecture so that it understands semantic and syntactic language information. The output of BERT are embeddings, not predicted next words. To leverage these embeddings, other layer(s) need to be added on top, for example text classification or questions and answers.

BERT uses a genius trick for the training. For supervised training it is often expensive to get labeled data, sometimes it’s impossible. The trick is to use masks as I described in my post Evolution of AI explained via a simple Sample. Let’s take a simple example, an unlabeled sentence:

“Sarah went to a restaurant to meet her friend that night.”

This is converted into:

  • Text: “Sarah went to a restaurant to meet her MASK that night.”
  • Label: “Sarah went to a restaurant to meet her friend that night.”

Note that this is a very simplified description only since there aren’t ‘real’ labels in BERT.

In other words, BERT produces labeled data for originally unlabeled data. This technique is called Self-Supervised Learning. It works very well for huge amounts of data.

In masked language models like BERT, each masked word (token) prediction is conditioned on the rest of the tokens in the sentence. These are received in the encoder which is why you don’t need a decoder.

GPT - Decoders

In language scenarios decoders are used to generate next words, for example when translating text or generating stories. The outputs are words with probabilities.

Decoders also use the attention concepts and even two times. First when training models, they use Masked Multi-Head Attention which means that only the first words of the target sentence are provided so that the model can learn without cheating. This mechanism is like the MASK concept from BERT.

After this the decoder uses Multi-Head Attention as it’s also used in the encoder. Transformer based models that utilize encoders and decoders use a trick to be more efficient. The output of the encoders is feed as input to the decoders, more precisely the keys and values. Decoders can invoke queries to find the closest keys. This allows, for example, to understand the meaning of the original sentence and translate it into other languages even if the number of resulting words and the order changes.

GPT doesn’t use this trick though and only use a decoder. This is possible since these types of models have been trained with massive amounts of data (Large Language Model). The knowledge of encoders is encoded in billions of parameters (also called weights). The same knowledge exists in decoders when trained with enough data.

Note that ChatGPT has evolved these techniques. To prevent hate, profanity and abuse, humans need to label some data first. Additionally Reinforcement Learning is applied to improve the quality of the model (see ChatGPT: Optimizing Language Models for Dialogue).

Resources

There are many good articles, videos and courses. Here are some of the ones I read or watched:

  • Course: Natural Language Processing Demystified
  • YouTube channel: CodeEmporium
  • Article: What Is ChatGPT Doing … and Why Does It Work?
  • Article: 10 Things You Need to Know About BERT and the Transformer Architecture That Are Reshaping the AI Landscape
  • Article: Transformer’s Encoder-Decoder: Let’s Understand The Model Architecture
  • NLP - BERT & Transformer

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/204010.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

C++ Primer 第十六章 模板与泛型编程 重点解读

文章目录 1 定义模板1.1 类模板成员函数的实例化:1.2 在类代码内简化模板类名的使用:1.3 令模板自己的类型参数成为友元(C11)1.4 模板类型别名1.4.1 typedef1.4.2 为模板定义类型别名(C11) 1.5 函数模板与类模板的区别1.6 使用类的…

SpringMVC多种类型数据响应

SpringMVC多种类型数据响应入门 1.概念 RequestMapping 作用:用于建立请求URL和处理请求方法之间的对应关系 位置: 类上,请求URL的第一级访问目录。此处不写的话,就相当于应用的根目录 方法上,请求URL的第二级访问目…

宠物店管理系统服务预约会员小程序效果如何

很多人都会养宠物,随着生活品质提升,尤其以年轻人为主的消费者在宠物食品、医疗、购买消费等方面有较高的消费属性,而在线下也有大量从业者,互联网程度及智慧化门店提升及客户赋能,线下经营的同时还需要线上发展拓客引…

鸿蒙4.0开发笔记之ArkTS装饰器语法基础@Builder组件内自定义构建函数与@Styles自定义组件重用样式(十)

文章目录 一、Builder自定义构建函数1、什么是自定义构建函数2、组件内定义构建函数3、组件外定义构建函数4、Builder装饰器练习 二、Styles重用样式函数1、重用样式的作用2、组件内定义Styles3、组件外定义4、Styles装饰器练习5、注意要点 一、Builder自定义构建函数 1、什么…

寿险公司通过开源治理保障数字创新,安全打通高质量服务新通道

某寿险公司致力于为消费者提供人性化的产品和服务,在中国保险市场中始终保持前列。该寿险公司以挖掘和满足客户需求为出发点,从产品开发、渠道销售、运营流程和售后服务等各环节,借助数字化工具,不断地努力探索并提升服务品质。 精…

NumSharp

github地址:https://github.com/SciSharp/NumSharp High Performance Computation for N-D Tensors in .NET, similar API to NumPy. NumSharp (NS) is a NumPy port to C# targetting .NET Standard. NumSharp is the fundamental package needed for scientific …

Linux Centos系统安装Mysql8.0详解

本文是基于服务器Linux Centos 8.0系统 安装 Mysql8.0真实运维工作实战为例,详细讲解安装的全过程。 1,检查卸载mariadb Mariadb数据库是mysql的分支,mariadb和mysql会有冲突,所以安装Mysql前,首先要检查是否安装了m…

6.如何利用LIO-SAM生成可用于机器人/无人机导航的二维/三维栅格地图--以octomap为例

目录 1 octomap的安装 2 二维导航节点的建立及栅格地图的构建 3 三维栅格地图的建立 1 octomap的安装 这里采用命令安装: sudo apt install ros-melodic-octomap-msgs ros-melodic-octomap-ros ros-melodic-octomap-rviz-plugins ros-melodic-octomap-server 这样…

HPE存储监控

存储设备构成了企业组织的支柱,使应用程序能够运行,可以进行市场研究,并且能够存储关键业务信息以用于日常运营,由于健康的存储网络对于日常业务运营的顺利运行是必不可少的,因此 IT 团队应确保存储基础架构的理想性能…

毫米波传感器系统性能测量(TI文档)

摘要 本应用报告讨论了使用德州仪器高性能毫米波传感器的系统性能测量结果。TI的毫米波传感器是一个77 GHz,高度集成的收发器,具有高速接口(CSI2),可将原始ADC数据发送出去进行处理。毫米波传感器包括整个毫米波射频和模拟基带信号链&#xf…

ubuntu18.04安装miniconda和mysql

MySQL 1.更新软件包 apt-get update 2.mysql安装 apt-get install mysql-server 3.初始化配置mysql mysql_secure_installation 第一个选项是问你要不要安装密码插件,就是说安装了之后你必须用安全度很高的密码,不安装的话,可以随意设…

HAProxy部署Web集群(Nginx)实验

实验前准备 HAProxy服务器:192.168.188.11 内核版本最好要在2.6以上,使用uname -r查看自己的内核版本是否适用 Nginx服务器1:192.168.188.12 Nginx服务器2:192.168.188.13 客户端:192.168.188.1(本机window…

渗透测试|HW蓝队

记录某个对某个钓鱼事件中获取的钓鱼样本进行分析,以及简单的制作学习 样本行为分析 首先看到是 qq 邮箱发来的某个压缩包大概本身是带密码的,反手就丢到虚拟机先看下大概文件,解压后是这样的一个快捷方式 然后打开属性查看快捷方式&#x…

基于STC12C5A60S2系列1T 8051单片机的IIC总线器件24C02实现掉电保存计时时间应用

基于STC12C5A60S2系列1T 8051单片机的IIC总线器件24C02实现掉电保存计时时间应用 STC12C5A60S2系列1T 8051单片机管脚图STC12C5A60S2系列1T 8051单片机I/O口各种不同工作模式及配置STC12C5A60S2系列1T 8051单片机I/O口各种不同工作模式介绍液晶显示器LCD1602简单介绍IIC通信简单…

RocketMQ-快速实战

MQ简介 MQ:MessageQueue,消息队列。是在互联网中使用非常广泛的一系列服务中间件。 Message:消息。消息是在不同进程之间传递的数据。这些进程可以部署在同一台机器上,也可以分布在不同机器上。(数据形式&#xff1a…

2023工作中遇到问题一

1、vue中模拟鼠标点击下拉框 <xxx-select ref"aaa" :value"form.serviceCharge" placeholder"请输入" add"(value) > getBasGoods(value)" :configInfo"configInfo" />vue中模拟鼠标点击 this.$refs.aaa.$el.cli…

经典的回溯算法题leetcode全排列问题思路代码详解

目录 全排列问题 leetcode46题.全排列 leetcode47题.全排列II 对回溯算法感兴趣的朋友也可以多多支持一下我的其他文章。 回溯算法详解-CSDN博客 经典的回溯算法题leetcode组合问题整理及思路代码详解-CSDN博客 经典的回溯算法题leetcode子集问题思路代码详解-CSDN博客 …

AI文章生成器-免费批量原创文章生成的工具

在科技的大潮中&#xff0c;AI技术愈发成熟&#xff0c;文言一心文章生成器悄然崭露头角。这一创新性工具的出现&#xff0c;为广大用户提供了快速、高效的文章生成方式。147SEO的批量原创功能更是锦上添花&#xff0c;让文章创作变得更为轻松。正是在这背后&#xff0c;我们看…

iPhone苹果手机如何将词令网页添加到苹果iPhone手机桌面快捷打开?

iPhone苹果手机如何将词令网页添加到苹果iPhone手机桌面快捷打开&#xff1f; 1、在iPhone苹果手机上找到「Safari浏览器」,并点击打开&#xff1b; 2、打开Safari浏览器后&#xff0c;输入词令官方网站地址&#xff1a;ciling.cn ; 3、打开词令官网后&#xff0c;点击Safari…

数字图像处理(实践篇)十二 基于小波变换的图像降噪

目录 一 基于小波变换的图像降噪 &#xff08;1&#xff09;小波变换基本理论 &#xff08;2&#xff09;小波分析在图像处理中的应用 &#xff08;3&#xff09;小波变换原理 &#xff08;4&#xff09;小波降噪原理 &#xff08;5&#xff09;小波降噪算法的实现 &…