Qwen 论文阅读记录

本文仅作自己初步熟悉大模型，梳理之用，慢慢会更改/增加/删除，部分细节尚未解释，希望不断学习之后，能够完善补充。若有同道之人，欢迎指正探讨。

关于后面的code-qwen and math-qwen，我个人认为依托于前三部分，这两部分大致阅读，尚未细究，暂不记录于此。

1. Abstract（Introduction补充）

QWEN is a comprehensive language model series that encompasses distinct models with varying parameter counts.
QWEN 是一个全面的语言模型系列，包含参数数量不同的多个独立模型。

It includes QWEN, the base pretrained language models, and QWEN-CHAT, the chat models finetuned with human alignment techniques.

The base language models consistently demonstrate superior performance across a multitude of downstream tasks,

and the chat models, particularly those trained using Reinforcement Learning from Human Feedback (RLHF), are highly competitive.

以上两句话可以根据下图综合理解：

The model series include the base pretrained language models,
chat models finetuned with human alignment techniques, i.e., supervised finetuning (SFT), reinforcement learning with human feedback (RLHF), etc.,
as well as specialized models in coding and math.

在这里插入图片描述

Figure 1: Model Lineage of the Qwen Series.
We have pretrained the language models, namely QWEN, on massive datasets containing trillions of tokens.

We then use SFT and RLHF to align QWEN to human preference and thus we have QWEN-CHAT and specifically its improved version QWEN-CHAT-RLHF.

Additionally, we also develop specialized models for coding and mathematics, such as CODE-QWEN, CODE-QWEN-CHAT, and MATH-QWEN-CHAT based on QWEN with similar techniques.

Note that we previously released the multimodal LLM, QWEN-VL and QWEN-VLCHAT (Bai et al., 2023), which are also based on our QWEN base models.

The chat models possess advanced tool-use and planning capabilities for creating agent applications, showcasing impressive performance even when compared to bigger models on complex tasks like utilizing a code interpreter.
这些聊天模型在创建代理应用方面，具备先进的工具使用和规划能力，即使在执行如使用代码解释器等复杂任务时，与更大的模型相比也展现出了令人印象深刻的性能。

这句话，跟Introduction这句话综合理解：

LLMs are not just limited to language tasks.

They can also function as a generalist agent, collaborating with external systems, tools, and models to achieve the objectives set by humans.
它们还可以充当通用代理，与外部系统、工具和模型协作，以实现人类设定的目标。

For example, LLMs can understand multimodal instructions, execute code, use tools, and more.

2. Pretraining

2.1 Data

To ensure the quality of our pretraining data, we have developed a comprehensive data preprocessing procedure.
为了确保预训练数据的质量，我们开发了一个全面的数据预处理程序。

2.2 Tokenization

we utilize byte pair encoding (BPE) as our tokenization method.

2.3 Model

2.3.1 Architecture

QWEN is designed using a modified version of the Transformer architecture.

Specifically, we have adopted the recent open-source approach of training large language models, LLaMA (Touvron et al., 2023a).

Our modifications to the architecture include:

Embedding and output projection. ----嵌入和输出投影。
Based on preliminary experimental findings, we have opted for the untied embedding approach instead of tying the weights of input embedding and output projection.
基于初步的实验结果，我们选择了未绑定嵌入方法，而不是将输入嵌入和输出投影的权重绑定在一起。
This decision was made in order to achieve better performance with the price of memory costs.
这个决定是为了以牺牲内存成本为代价来获得更好的性能。

“Output projection”指的是在模型的输出层，将隐藏层的表示映射回原始的词汇或符号空间，以产生最终的输出。

“untied embedding approach”指的是输入嵌入和输出投影的权重是独立的，不共享的。这与“tied embedding”或“weight tying”相对，后者将输入嵌入和输出投影的权重设置为相同，以减少参数数量和防止过拟合。

Positional embedding.
We have chosen RoPE (Rotary Positional Embedding) as our preferred option for incorporating positional information into our model.
我们选择了RoPE，将位置信息融入模型。
In particular, we have opted to use FP32 precision for the inverse frequency matrix, rather than BF16 or FP16, in order to prioritize model performance and achieve higher accuracy.
特别是，我们选择了使用FP32精度来处理逆频率矩阵，而不是BF16或FP16，这是为了优先考虑模型性能并实现更高的准确性。
Bias.
For most layers, we remove biases following Chowdhery et al. (2022), but we add biases in the QKV layer of attention to enhance the extrapolation ability of the model (Su, 2023b).
在大多数层中移除了Bias，但在QKV层保留以提升模型的外推能力。

根据Chowdhery等人的研究，移除大多数层的偏差项，有助于提高模型的稳定性和性能。这种做法可以减少参数数量，从而降低模型的复杂度和过拟合的风险。此外，移除偏差项有助于模型在训练过程中更好地泛化，从而在不同的数据集上表现得更为一致。

在注意力机制的QKV层中添加偏差项：Su的研究指出，在注意力机制的QKV（Query，Key，Value）层中添加偏差项可以增强模型的外推能力（extrapolation ability）。

QKV层是注意力机制的核心部分，通过添加偏差项，可以更好地捕捉输入数据的特征，从而提升模型在处理未知或未见数据时的表现。这种增强外推能力的做法对于处理复杂任务和应对多样化的数据输入非常重要。

总结来说，移除大多数层的偏差项是为了提高模型的泛化能力和稳定性，而在QKV层中添加偏差项则是为了增强模型的外推能力。

Pre-Norm & RMSNorm.
In modern Transformer models, pre-normalization is the most widely used approach, which has been shown to improve training stability compared to post-normalization.
Additionally, we have replaced the traditional layer normalization technique described in (Ba et al., 2016) with RMSNorm(Jiang et al., 2023). This change has resulted in equivalent performance while also improving efficiency.
Activation function.
We have selected SwiGLU (Shazeer, 2020) as our activation function, a combination of Swish (Ramachandran et al., 2017) and Gated Linear Unit (Dauphin et al., 2017).
Our initial experiments have shown that activation functions based on GLU generally outperform other baseline options, such as GeLU (Hendrycks & Gimpel, 2016).
As is common practice in previous research, we have reduced the dimension of the feed-forward network (FFN) from 4 times the hidden size to 8/3 of the hidden size.

2.3.2 Context length extension

Transformer models have a significant limitation in terms of the context length for their attention mechanism.

As the context length increases, the quadratic-complexity computation leads to a drastic increase in both computation and memory costs.
计算复杂度与输入数据量的平方成正比

为此，作者提到了四个关键词：

NTK-aware interpolation (bloc97, 2023)
dynamic NTK-aware interpolation(Peng et al., 2023a)

QWEN additionally incorporates two attention mechanisms:

LogN-Scaling
Window Attention

LogN-Scaling rescales the dot product of the query and value by a factor that depends on the ratio of the context length to the training length, ensuring that the entropy of the attention value remains stable as the context length grows.
LogN-Scaling 通过一个依赖于上下文长度与训练长度之比的因子来重新缩放查询和值的点积，从而确保随着上下文长度的增加，注意力值的熵保持稳定。

Window attention restricts the attention to a limited context window, preventing the model from attending to tokens that are too far away.
Window Attention则将注意力限制在一个有限的上下文窗口内，防止模型关注到距离过远的标记（tokens）。

Swin transformer，待启动新的记录，详细解析其细节。

We also observed that the long-context modeling ability of our model varies across layers, with lower layers being more sensitive in context length extension compared to the higher layers.
我们还观察到，我们模型的 long-context 建模能力，在不同层之间存在差异，与较高层相比，较低层在上下文长度扩展方面更为敏感。

To leverage this observation, we assign different window sizes to each layer, using shorter windows for lower layers and longer windows for higher layers.

2.4 Training

To train QWEN, we follow the standard approach of autoregressive language modeling（自回归语言建模） Radford et al. (2018).

This involves training the model to predict the next token based on the context provided by the previous tokens.

We train models with context lengths of 2048.

To improve computational efficiency and reduce memory usage, we employ Flash Attention in the attention modules (Dao et al., 2022).

FlashAttention的核心原理是，通过将输入分块并在每个块上执行注意力操作，从而减少对高带宽内存（HBM）的读写操作。它利用底层硬件的内存层次知识，例如GPU的内存层次结构，来提高计算速度和减少内存访问开销。
具体来说，FlashAttention使用平铺（tiling）和重计算（recomputation）等经典技术。它首先将输入块从HBM加载到SRAM（快速缓存），在SRAM上执行注意力操作，并将结果更新回HBM。通过这种方式，FlashAttention减少了内存读写量，从而实现了加速。

3. Alignment

Pretrained large language models have been found to be out of sync with human behavior, making them unsuitable for serving as AI assistants in most cases.
已发现，预训练的大型语言模型与人类行为不一致，因此在大多数情况下不适合担任AI助手。

Recent research has shown that the use of alignment techniques, such as supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF), can significantly improve the ability of language models to engage in natural conversation.

3.1 Supervised Fine-tuning

To gain an understanding of human behavior, the initial step is to carry out supervised fine-tuning.

This process fine-tunes a pre-trained model on chat-style data, which includes both human queries and AI responses.
该过程对聊天风格数据的预训练模型进行微调，其中包括人工查询和AI响应。

Supervised finetuning is similar to text-to-text transfer, but it is capable of creating a helpful AI assistant due to the intricate and varied nature of the datasets used for finetuning.
监督式微调类似于文本到文本的迁移，但由于用于微调的数据集，具有复杂性和多样性，它能够创建出一个有用的AI助手。

In the following sections, we will delve into the details of data construction and training methods.

3.1.1 DATA

To enhance the capabilities of our supervised finetuning datasets, we have annotated conversations in multiple styles.
…

3.1.2 TRAINING

…

3.2 Rein Forcement Learning from Human Feedback

While SFT has proven to be effective, we acknowledge that its generalization and creativity capabilities may be limited, and it is prone to overfitting.

To address this issue, we have implemented Reinforcement Learning from Human Feedback (RLHF) to further align SFT models with human preferences.

This process involves training a reward model and using Proximal Policy Optimization (PPO) to conduct policy training.
这个过程包括训练一个奖励模型，并使用近端策略优化（Proximal Policy Optimization，简称PPO）来进行策略训练。

3.2.1 Reward Model

…

3.2.2 ReinForcement Learning

Our Proximal Policy Optimization (PPO) process involves four models:

policy model,
value model,
reference model,
reward model.

3.3 Automatic and Human evaluation of Aligned Models

Besides the widely used few-shot setting, we test our aligned models in the zero-shot setting to demonstrate how well the models follow instructions.

The prompt in a zero-shot setting consists of an instruction and a question without any previous examples in the context.
在零样本设置（zero-shot setting）中，提示（prompt）仅包含一个指令和一个问题，而不包含任何先前的示例作为上下文。

这种设置要求模型，仅根据给定的指令和问题本身，来生成回答或执行相应的任务。

零样本设置对于评估模型的泛化能力和理解能力尤为重要，因为它模拟了模型在面对全新、未见过的任务或问题时的表现。

在这种设置下，指令通常清晰地说明了模型需要执行的任务类型，比如生成文本、回答问题、翻译等。

问题则是模型需要处理的具体内容，它依赖于指令中指定的任务类型。

由于没有提供任何示例，模型必须依靠其预训练期间学到的知识和技能来理解和回答问题。

零样本设置的挑战在于模型需要在没有直接指导或示例的情况下，准确地理解并完成任务。这要求模型具备强大的语言理解和生成能力，以及良好的泛化能力。

因此，零样本设置是评估自然语言处理（NLP）模型性能的一个重要方面。

3.4 Tool use, Code Interpreter, and Agent

The QWEN models, which are designed to be versatile, have the remarkable ability to assist with (semi-) automating daily tasks by leveraging their skills in tool-use and planning.
QWEN模型设计得十分通用，它们具有，通过利用，其工具使用，和，规划方面的，技能，来协助（半）自动化日常任务的，显著能力。

We explore QWEN’s proficiency in the following areas:

Utilizing unseen tools through ReAct prompting (Yao et al., 2022) (see Table 6).
通过ReAct提示利用未见过的工具；
Using a Python code interpreter to enhance math reasoning, data analysis, and more.
使用Python代码解释器来增强数学推理、数据分析等能力；
Functioning as an agent that accesses Hugging Face’s extensive collection of multimodal models while engaging with humans (see Table 9).
作为一个能够与人类交互，并访问Hugging Face庞大多模态模型库的代理；

To enhance QWEN’s capabilities as an agent or copilot, we employ the self-instruct (Wang et al., 2023c) strategy for supervised fine-tuning (SFT).
为了增强QWEN作为代理或协同工具的能力，我们采用自我指示策略进行有监督微调（SFT）。

Specifically, we utilize the in-context learning capability of QWEN for self-instruction.
具体来说，我们利用QWEN的上下文学习能力来进行自我指示。

By providing a few examples, we can prompt QWEN to generate more relevant queries and generate outputs that follow a specific format, such as ReAct.
通过提供几个示例，我们可以引导QWEN生成更多相关的查询，并产生符合特定格式（如ReAct）的输出。

We then apply rules and involve human annotators to filter out any noisy samples.
随后，我们应用规则并引入人工标注者来过滤掉任何噪声样本。

Afterwards, the samples are incorporated into QWEN’s training data, resulting in an updated version of QWEN that is more dependable for self-instruction.
之后，这些样本被纳入QWEN的训练数据中，从而得到一个在自我指示方面更加可靠的更新版QWEN。

We iterate through this process multiple times until we gather an ample number of samples that possess both exceptional quality and a widerange of diversity.
我们多次重复这个过程，直到收集到足够数量且既优质又多样化的样本。

在这里插入图片描述
Figure 2: A high-level overview of Self-Instruct.

The process starts with a small seed set of tasks as the task pool.

Random tasks are sampled from the task pool, and used to prompt an off-the-shelf LM to generate both new instructions and corresponding instances, followed by filtering low-quality or similar generations, and then added back to the initial repository of tasks.
从任务池中随机抽取任务，并用这些任务，提示一个现成的语言模型（LM），来生成新的指令和相应的实例。
随后，过滤掉质量低或相似的生成内容，再将它们添加回最初的任务库中。

The resulting data can be used for the instruction tuning of the language model itself later to follow instructions better.
所得数据可用于后续对语言模型本身的指令调优，以便其能更好地遵循指令。

Tasks shown in the figure are generated by GPT3.

其中，上图中的 Input-First 和 Output-First，具体可以参考下面两张图：

Input-First

在这里插入图片描述
Table 7: Prompt used for the input-first approach of instance generation. The model is prompted to generate the instance first, and then generate the corresponding output.
For instructions that don’t require additional input, the output is allowed to be generated directly.

Output-First

在这里插入图片描述

Table 8: Prompt used for the output-first approach of instance generation.
The model is prompted to generate the class label first, and then generate the corresponding input.
This prompt is used for generating the instances for classification tasks.

We apply the output-first approach to the classification tasks identified in the former step, and the input-first approach to the remaining non-classification tasks.