Qwen 论文阅读记录


本文仅作自己初步熟悉大模型,梳理之用,慢慢会更改/增加/删除,部分细节尚未解释,希望不断学习之后,能够完善补充。若有同道之人,欢迎指正探讨。

关于后面的code-qwen and math-qwen,我个人认为依托于前三部分,这两部分大致阅读,尚未细究,暂不记录于此。


1. Abstract(Introduction补充)

QWEN is a comprehensive language model series that encompasses distinct models with varying parameter counts.
QWEN 是一个全面的语言模型系列,包含参数数量不同的多个独立模型。

It includes QWEN, the base pretrained language models, and QWEN-CHAT, the chat models finetuned with human alignment techniques.

The base language models consistently demonstrate superior performance across a multitude of downstream tasks,

and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive.


以上两句话可以根据下图综合理解:

The model series include the base pretrained language models,
chat models finetuned with human alignment techniques, i.e., supervised finetuning (SFT), reinforcement learning with human feedback (RLHF), etc.,
as well as specialized models in coding and math.

在这里插入图片描述

Figure 1: Model Lineage of the Qwen Series.
We have pretrained the language models, namely QWEN, on massive datasets containing trillions of tokens.

We then use SFT and RLHF to align QWEN to human preference and thus we have QWEN-CHAT and specifically its improved version QWEN-CHAT-RLHF.

Additionally, we also develop specialized models for coding and mathematics, such as CODE-QWEN, CODE-QWEN-CHAT, and MATH-QWEN-CHAT based on QWEN with similar techniques.

Note that we previously released the multimodal LLM, QWEN-VL and QWEN-VLCHAT (Bai et al., 2023), which are also based on our QWEN base models.


The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter.
这些聊天模型在创建代理应用方面,具备先进的工具使用和规划能力,即使在执行如使用代码解释器等复杂任务时,与更大的模型相比也展现出了令人印象深刻的性能。


这句话,跟Introduction这句话综合理解:

LLMs are not just limited to language tasks.

They can also function as a generalist agent, collaborating with external systems, tools, and models to achieve the objectives set by humans.
它们还可以充当通用代理,与外部系统、工具和模型协作,以实现人类设定的目标。

For example, LLMs can understand multimodal instructions, execute code, use tools, and more.


2. Pretraining

2.1 Data

To ensure the quality of our pretraining data, we have developed a comprehensive data preprocessing procedure.
为了确保预训练数据的质量,我们开发了一个全面的数据预处理程序。

2.2 Tokenization

we utilize byte pair encoding (BPE) as our tokenization method.

2.3 Model

2.3.1 Architecture

QWEN is designed using a modified version of the Transformer architecture.

Specifically, we have adopted the recent open-source approach of training large language models, LLaMA (Touvron et al., 2023a).

Our modifications to the architecture include:

  • Embedding and output projection. ----嵌入和输出投影。
    Based on preliminary experimental findings, we have opted for the untied embedding approach instead of tying the weights of input embedding and output projection.
    基于初步的实验结果,我们选择了未绑定嵌入方法,而不是将输入嵌入和输出投影的权重绑定在一起。
    This decision was made in order to achieve better performance with the price of memory costs.
    这个决定是为了以牺牲内存成本为代价来获得更好的性能。

“Output projection”指的是在模型的输出层,将隐藏层的表示映射回原始的词汇或符号空间,以产生最终的输出。

“untied embedding approach”指的是输入嵌入和输出投影的权重是独立的,不共享的。这与“tied embedding”或“weight tying”相对,后者将输入嵌入和输出投影的权重设置为相同,以减少参数数量和防止过拟合。


  • Positional embedding.
    We have chosen RoPE (Rotary Positional Embedding) as our preferred option for incorporating positional information into our model.
    我们选择了RoPE,将位置信息融入模型。
    In particular, we have opted to use FP32 precision for the inverse frequency matrix, rather than BF16 or FP16, in order to prioritize model performance and achieve higher accuracy.
    特别是,我们选择了使用FP32精度来处理逆频率矩阵,而不是BF16或FP16,这是为了优先考虑模型性能并实现更高的准确性。

  • Bias.
    For most layers, we remove biases following Chowdhery et al. (2022), but we add biases in the QKV layer of attention to enhance the extrapolation ability of the model (Su, 2023b).
    在大多数层中移除了Bias,但在QKV层保留以提升模型的外推能力。​


根据Chowdhery等人的研究,移除大多数层的偏差项,有助于提高模型的稳定性和性能。这种做法可以减少参数数量,从而降低模型的复杂度和过拟合的风险。此外,移除偏差项有助于模型在训练过程中更好地泛化,从而在不同的数据集上表现得更为一致。

在注意力机制的QKV层中添加偏差项:Su的研究指出,在注意力机制的QKV(Query,Key,Value)层中添加偏差项可以增强模型的外推能力(extrapolation ability)。

QKV层是注意力机制的核心部分,通过添加偏差项,可以更好地捕捉输入数据的特征,从而提升模型在处理未知或未见数据时的表现。这种增强外推能力的做法对于处理复杂任务和应对多样化的数据输入非常重要。

总结来说,移除大多数层的偏差项是为了提高模型的泛化能力和稳定性,而在QKV层中添加偏差项则是为了增强模型的外推能力。


  • Pre-Norm & RMSNorm.
    In modern Transformer models, pre-normalization is the most widely used approach, which has been shown to improve training stability compared to post-normalization.
    Additionally, we have replaced the traditional layer normalization technique described in (Ba et al., 2016) with RMSNorm(Jiang et al., 2023). This change has resulted in equivalent performance while also improving efficiency.

  • Activation function.
    We have selected SwiGLU (Shazeer, 2020) as our activation function, a combination of Swish (Ramachandran et al., 2017) and Gated Linear Unit (Dauphin et al., 2017).
    Our initial experiments have shown that activation functions based on GLU generally outperform other baseline options, such as GeLU (Hendrycks & Gimpel, 2016).
    As is common practice in previous research, we have reduced the dimension of the feed-forward network (FFN) from 4 times the hidden size to 8/3 of the hidden size.

2.3.2 Context length extension

Transformer models have a significant limitation in terms of the context length for their attention mechanism.

As the context length increases, the quadratic-complexity computation leads to a drastic increase in both computation and memory costs.
计算复杂度与输入数据量的平方成正比

为此,作者提到了四个关键词:

  • NTK-aware interpolation (bloc97, 2023)
  • dynamic NTK-aware interpolation(Peng et al., 2023a)

QWEN additionally incorporates two attention mechanisms:

  • LogN-Scaling
  • Window Attention

LogN-Scaling rescales the dot product of the query and value by a factor that depends on the ratio of the context length to the training length, ensuring that the entropy of the attention value remains stable as the context length grows.
LogN-Scaling 通过一个依赖于上下文长度与训练长度之比的因子来重新缩放查询和值的点积,从而确保随着上下文长度的增加,注意力值的熵保持稳定。

Window attention restricts the attention to a limited context window, preventing the model from attending to tokens that are too far away.
Window Attention则将注意力限制在一个有限的上下文窗口内,防止模型关注到距离过远的标记(tokens)。


Swin transformer,待启动新的记录,详细解析其细节。


We also observed that the long-context modeling ability of our model varies across layers, with lower layers being more sensitive in context length extension compared to the higher layers.
我们还观察到,我们模型的 long-context 建模能力,在不同层之间存在差异,与较高层相比,较低层在上下文长度扩展方面更为敏感。

To leverage this observation, we assign different window sizes to each layer, using shorter windows for lower layers and longer windows for higher layers.

2.4 Training

To train QWEN, we follow the standard approach of autoregressive language modeling(自回归语言建模) Radford et al. (2018).

This involves training the model to predict the next token based on the context provided by the previous tokens.

We train models with context lengths of 2048.

To improve computational efficiency and reduce memory usage, we employ Flash Attention in the attention modules (Dao et al., 2022).


FlashAttention的核心原理是,通过将输入分块并在每个块上执行注意力操作,从而减少对高带宽内存(HBM)的读写操作。它利用底层硬件的内存层次知识,例如GPU的内存层次结构,来提高计算速度和减少内存访问开销。
具体来说,FlashAttention使用平铺(tiling)和重计算(recomputation)等经典技术。它首先将输入块从HBM加载到SRAM(快速缓存),在SRAM上执行注意力操作,并将结果更新回HBM。通过这种方式,FlashAttention减少了内存读写量,从而实现了加速。


3. Alignment

Pretrained large language models have been found to be out of sync with human behavior, making them unsuitable for serving as AI assistants in most cases.
已发现,预训练的大型语言模型与人类行为不一致,因此在大多数情况下不适合担任AI助手。

Recent research has shown that the use of alignment techniques, such as supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF), can significantly improve the ability of language models to engage in natural conversation.

3.1 Supervised Fine-tuning

To gain an understanding of human behavior, the initial step is to carry out supervised fine-tuning.

This process fine-tunes a pre-trained model on chat-style data, which includes both human queries and AI responses.
该过程对聊天风格数据的预训练模型进行微调,其中包括人工查询AI响应

Supervised finetuning is similar to text-to-text transfer, but it is capable of creating a helpful AI assistant due to the intricate and varied nature of the datasets used for finetuning.
监督式微调类似于文本到文本的迁移,但由于用于微调的数据集,具有复杂性和多样性,它能够创建出一个有用的AI助手。

In the following sections, we will delve into the details of data construction and training methods.

3.1.1 DATA

To enhance the capabilities of our supervised finetuning datasets, we have annotated conversations in multiple styles.

3.1.2 TRAINING

3.2 Rein Forcement Learning from Human Feedback

While SFT has proven to be effective, we acknowledge that its generalization and creativity capabilities may be limited, and it is prone to overfitting.

To address this issue, we have implemented Reinforcement Learning from Human Feedback (RLHF) to further align SFT models with human preferences.

This process involves training a reward model and using Proximal Policy Optimization (PPO) to conduct policy training.
这个过程包括训练一个奖励模型,并使用近端策略优化(Proximal Policy Optimization,简称PPO)来进行策略训练。

3.2.1 Reward Model

3.2.2 ReinForcement Learning

Our Proximal Policy Optimization (PPO) process involves four models:

  • policy model,
  • value model,
  • reference model,
  • reward model.

3.3 Automatic and Human evaluation of Aligned Models

Besides the widely used few-shot setting, we test our aligned models in the zero-shot setting to demonstrate how well the models follow instructions.

The prompt in a zero-shot setting consists of an instruction and a question without any previous examples in the context.
在零样本设置(zero-shot setting)中,提示(prompt)仅包含一个指令和一个问题,而不包含任何先前的示例作为上下文。


这种设置要求模型,仅根据给定的指令和问题本身,来生成回答或执行相应的任务。

零样本设置对于评估模型的泛化能力和理解能力尤为重要,因为它模拟了模型在面对全新、未见过的任务或问题时的表现。

在这种设置下,指令通常清晰地说明了模型需要执行的任务类型,比如生成文本、回答问题、翻译等。

问题则是模型需要处理的具体内容,它依赖于指令中指定的任务类型。

由于没有提供任何示例,模型必须依靠其预训练期间学到的知识和技能来理解和回答问题。

零样本设置的挑战在于模型需要在没有直接指导或示例的情况下,准确地理解并完成任务。这要求模型具备强大的语言理解和生成能力,以及良好的泛化能力。

因此,零样本设置是评估自然语言处理(NLP)模型性能的一个重要方面。


3.4 Tool use, Code Interpreter, and Agent

The QWEN models, which are designed to be versatile, have the remarkable ability to assist with (semi-) automating daily tasks by leveraging their skills in tool-use and planning.
QWEN模型设计得十分通用,它们具有,通过利用,其工具使用,和,规划方面的,技能,来协助(半)自动化日常任务的,显著能力。

We explore QWEN’s proficiency in the following areas:

  • Utilizing unseen tools through ReAct prompting (Yao et al., 2022) (see Table 6).
    通过ReAct提示利用未见过的工具;
  • Using a Python code interpreter to enhance math reasoning, data analysis, and more.
    使用Python代码解释器来增强数学推理、数据分析等能力;
  • Functioning as an agent that accesses Hugging Face’s extensive collection of multimodal models while engaging with humans (see Table 9).
    作为一个能够与人类交互,并访问Hugging Face庞大多模态模型库的代理;

To enhance QWEN’s capabilities as an agent or copilot, we employ the self-instruct (Wang et al., 2023c) strategy for supervised fine-tuning (SFT).
为了增强QWEN作为代理或协同工具的能力,我们采用自我指示策略进行有监督微调(SFT)。

Specifically, we utilize the in-context learning capability of QWEN for self-instruction.
具体来说,我们利用QWEN的上下文学习能力来进行自我指示。

By providing a few examples, we can prompt QWEN to generate more relevant queries and generate outputs that follow a specific format, such as ReAct.
通过提供几个示例,我们可以引导QWEN生成更多相关的查询,并产生符合特定格式(如ReAct)的输出。

We then apply rules and involve human annotators to filter out any noisy samples.
随后,我们应用规则并引入人工标注者来过滤掉任何噪声样本。

Afterwards, the samples are incorporated into QWEN’s training data, resulting in an updated version of QWEN that is more dependable for self-instruction.
之后,这些样本被纳入QWEN的训练数据中,从而得到一个在自我指示方面更加可靠的更新版QWEN。

We iterate through this process multiple times until we gather an ample number of samples that possess both exceptional quality and a widerange of diversity.
我们多次重复这个过程,直到收集到足够数量且既优质又多样化的样本。


在这里插入图片描述
Figure 2: A high-level overview of Self-Instruct.

The process starts with a small seed set of tasks as the task pool.

Random tasks are sampled from the task pool, and used to prompt an off-the-shelf LM to generate both new instructions and corresponding instances, followed by filtering low-quality or similar generations, and then added back to the initial repository of tasks.
从任务池中随机抽取任务,并用这些任务,提示一个现成的语言模型(LM),来生成新的指令和相应的实例。
随后,过滤掉质量低或相似的生成内容,再将它们添加回最初的任务库中。

The resulting data can be used for the instruction tuning of the language model itself later to follow instructions better.
所得数据可用于后续对语言模型本身的指令调优,以便其能更好地遵循指令。

Tasks shown in the figure are generated by GPT3.

其中,上图中的 Input-First 和 Output-First,具体可以参考下面两张图:

  • Input-First

在这里插入图片描述
Table 7: Prompt used for the input-first approach of instance generation. The model is prompted to generate the instance first, and then generate the corresponding output.
For instructions that don’t require additional input, the output is allowed to be generated directly.

  • Output-First

在这里插入图片描述

Table 8: Prompt used for the output-first approach of instance generation.
The model is prompted to generate the class label first, and then generate the corresponding input.
This prompt is used for generating the instances for classification tasks.

We apply the output-first approach to the classification tasks identified in the former step, and the input-first approach to the remaining non-classification tasks.


Related Work

  1. QWEN TECHNICAL REPORT.

  2. SELF-INSTRUCT: Aligning Language Models with Self-Generated Instructions.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/934898.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

yarn 安装问题

Couldn’t find package “regenerator-runtime” on the “npm” registry. Error: Couldn’t find package “watch-size” on the “npm” regist 标题Error: Couldn’t find package “babel-helper-vue-jsx-merge-props” on the “npm” registry. Error: Couldn’t f…

【开源】基于SpringBoot框架的音乐网站与分享平台(计算机毕业设计)+万字说明文档 T011

系统合集跳转 源码获取链接 一、系统环境 运行环境: 最好是java jdk 1.8,我们在这个平台上运行的。其他版本理论上也可以。 IDE环境: Eclipse,Myeclipse,IDEA或者Spring Tool Suite都可以 tomcat环境: Tomcat 7.x,8.x,9.x版本均可 操作系统…

【SpringMVC】应用分层

阿华代码,不是逆风,就是我疯 你们的点赞收藏是我前进最大的动力!! 希望本文内容能够帮助到你!! 目录 一:场景引入 二:前后端分离三层架构 1:表现层 2:业务…

防火墙旁挂部署+故障切换

一、实验环境 华为ENSP 二、拓扑 三、目的 1、内网PC1访问Server 2、防火墙旁挂部署,对流量进行过滤,防火墙挂掉之后,内网PC1能继续访问到Server 3、防火墙恢复正常后,流量能回切至防火墙转发 四、思路: 1、AR1…

MySQL8版本升级

1.官方升级手册必看 1.1 理解升级过程会做什么 手册网址:https://dev.mysql.com/doc/refman/8.0/en/upgrading.html 升级mysql 系统数据库(默认的库),升级mysql 用户数据库(用户创建的库) 升级步骤分为…

5G中的随机接入过程可以不用收RAR?

有朋友提到了一种不用接收RAR的RA过程,问这个是怎么回事。其实在刚刚写过的LTM cell switch篇章中就有提到,这里把所有相关的内容整理如下。 在RACH-less LTM场景,在进行LTM cell switch之前就要先知道target cell的TA信息,进而才…

QT数据库SQLite:QsqlTableModel使用总结

数据库连接、数据模型与界面组件所涉及的类之间的关系如下所示: 数据库类 QSqlDatabase 类用于建立与数据库的连接,QSqlDatabase 对象就表示这种连接。QSqlDatabase 类的功能主要分为三大部分: 1、创建数据库连接,即创建 QSqlDat…

C++求20亿内质数的合数“哥德巴赫猜想”

数学领域著名的“哥德巴赫猜想”的大致意思是:任何一个大于2的偶数总能表示为两个素数之和。比如:24519,其中5和19都是素数。本实验的任务是设计一个程序,验证20亿以内的偶数都可以分解成两个素数之和。 输入格式: 输入…

物品识别 树莓派 5 YOLO v5 v8 v10 11 计算机视觉

0. 要实现的效果 让树莓派可以识别身边的一些物品,比如电脑,鼠标,键盘,杯子,行李箱,双肩包,床,椅子等 1. 硬件设备 树莓派 5 raspberrypi.com/products/raspberry-pi-5/树莓派官方摄…

模型训练数据-MinerU一款Pdf转Markdown软件

模型训练数据-MinerU一款Pdf转Markdown软件-说明 简介: MinerU是什么 MinerU是上海人工智能实验室OpenDataLab团队推出的开源智能数据提取工具,专注于复杂PDF文档的高效解析与提取。MinerU能将包含图片、公式、表格等元素的多模态PDF文档转化为易于分析…

51c深度学习~合集9

我自己的原文哦~ https://blog.51cto.com/whaosoft/12750420 #傅里叶特征 (Fourier Feature)与核回归 位置编码背后的理论解释 本文探讨了位置编码背后的理论基础,特别是傅里叶特征(Fourier Feature)与核回归(Kern…

数据仓库工具箱—读书笔记01(数据仓库、商业智能及维度建模初步)

数据仓库、商业智能及维度建模初步 记录一下读《数据仓库工具箱》时的思考,摘录一些书中关于维度建模比较重要的思想与大家分享🤣🤣🤣 博主在这里先把这本书"变薄"~有时间的小伙伴可以亲自再读一读,感受一下…

【JVM】JVM基础教程(三)

上一章:【JVM】JVM基础教程(二)-CSDN博客 目录 运行时数据区 应用场景 程序计数器 程序计数器在运行时会出现内存溢出吗? 栈 IDEA的debug工具查看栈帧的内容 栈帧的组成 局部变量表 关于 this 的内存存储 操作数栈 帧…

如何编译安装系统settings设置应用(5.0.0-Release)

本文介绍如何在OpenHarmony 5.0.0 r版本中修改系统设置应用,并且编译安装到开发板上 开发环境 1.dayu200开发板 2.OpenHarmony 5.0.0r 固件 3.API12 full sdk (如果安装full sdk过程中出现报错hvigor ERROR: Cannot find module typescript,请参考 h…

【Unity】Amplify Shader Editor

Amplify Shader Editor (ASE) Amplify Shader Editor,是一个功能强大的基于节点的着色器开发工具,允许开发者在 Unity 中轻松创建和管理复杂的 Shader。 主要功能和特点 基于节点的编辑器: • 提供直观的可视化界面,减少手写 Sh…

Github2024-12-10 Python开源项目日报 Top10

根据Github Trendings的统计,今日(2024-12-10统计)共有10个项目上榜。根据开发语言中项目的数量,汇总情况如下: 开发语言项目数量Python项目10HTML项目1Rust项目1系统设计指南 创建周期:2507 天开发语言:Python协议类型:OtherStar数量:241693 个Fork数量:42010 次关注人…

1.2.3计算机软件

一个完整的计算机系统由硬件和软件组成,用户使用软件,而软件运行在硬件之上,软件进一步的划分为两类:应用软件和系统软件。普通用户通常只会跟应用软件打交道。应用软件是为了解决用户的某种特定的需求而研发出来的。除了每个人都…

ElementEye,网页分析器

介绍 我们经常使用Python写爬虫,爬到网页数据之后,就需要用beautifulSoup进行解析。因为写爬虫并不是我的主营工作,大多数只是用来分析一下想要的数据而已,所以经常会忘记beautifulSoup的用法。 同时,我们总是分析页面…

Qt 联合Halcon配置

文章目录 配置代码窗口绑定 配置 选择添加库 选择外部库 LIBS -LC:/Program Files/MVTec/HALCON-17.12-Progress/lib/x64-win64/ LIBS -lhalconcpp\-lhdevenginecpp\-lhalconINCLUDEPATH C:/Program Files/MVTec/HALCON-17.12-Progress/include DEPENDPATH C:/Program Fil…

图像像素如何排列?是如何存储到diocm里面?读取到内存中是如何存储?

图像像素的排列和存储在DICOM(Digital Imaging and Communications in Medicine,医学数字成像和通信)文件中遵循特定的标准。DICOM 是一种国际标准(ISO 12052),用于处理、存储、打印和传输医学影像信息。 …