Transformer全流程细致讲解

文章目录

- 1. Transformer 架构概述
- 2. 编码器（Encoder）
- - 2.1 输入嵌入层（Input Embedding Layer）
  - - 2.1.1 一个简单的示例
  - 2.2 位置编码（Positional Encoding）
  - - 2.2.1 Transformer中采用的位置编码方式
    - 2.2.2 公式中符号的含义
  - 2.3. 多头自注意力层（Multi-Head Self-Attention Layer）
  - - 2.3.1 自注意力机制（Self-Attention Mechanism）
    - 2.3.4 多头自注意力层的构成
  - 2.4 残差连接（Residual Connection）与层归一化（Layer Normalization）
  - 2.5 前馈神经网络层（Feedforward Neural Network Layer）
  - 2.6 一个完整的EncodeLayer层
  - 2.7 一个完整的编码器（包含6个Encodelayer层）
- 3. 解码器（Decoder）
- - 3.1 目标词嵌入层（Target Word Embedding Layer）
  - 3.2 位置编码（Positional Encoding）
  - 3.3 掩码多头自注意力层（Masked Multi-Head Self-Attention Layer）
  - 3.4 编码器-解码器注意力层（Encoder-Decoder Attention Layer）
  - 3.5 一个完整的DecodeLayer层
  - 3.6 一个完整的解码器（包含6个DecodeLayer层）
- 4. Transformer完整代码

1. Transformer 架构概述

Transformer 架构是一种用于自然语言处理和其他序列到序列任务的深度学习架构。它由Vaswani等人在2017年的一篇论文中（Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30）提出，并在之后的几年中得到了广泛应用和发展。Transformer 架构在机器翻译、文本摘要、问答系统等领域取得了令人瞩目的成果。

Transformer 架构的核心思想是自注意力机制（self-attention mechanism），它允许模型在处理输入序列时同时关注到序列中的不同位置。这种机制使得模型能够更好地处理长距离依赖关系，从而在处理序列数据时取得了很大的成功。

Transformer 架构主要由以下几个组件组成：

自注意力层（Self-Attention Layer）：这是Transformer的核心组件之一。在自注意力层中，模型可以计算出每个输入位置对于其他所有位置的注意力权重，从而使得模型能够同时关注到序列中的所有位置。
前馈神经网络（Feedforward Neural Network）：在自注意力层之后，每个位置的特征会通过一个前馈神经网络进行处理。这个前馈神经网络通常是一个全连接的多层感知器（MLP）。
残差连接（Residual Connection）和层归一化（Layer Normalization）：在每个子层（自注意力层和前馈神经网络）之后，会应用残差连接和层归一化，以避免训练过程中的梯度消失或爆炸问题，并有助于模型的训练稳定性。
编码器（Encoder）和解码器（Decoder）：Transformer 架构通常用于序列到序列的任务，因此包括编码器和解码器两个部分。编码器负责将输入序列编码成一组隐藏表示，而解码器则使用这些隐藏表示来生成目标序列。
位置编码（Positional Encoding）：由于Transformer架构不包括任何与序列位置相关的信息，因此在输入序列的词嵌入（Word Embedding）中添加位置编码，以提供序列中每个位置的绝对位置信息。

Transformer的完整框架图如下图所示
在这里插入图片描述

2. 编码器（Encoder）

编码器（Encoder）中常见的几个组成部分：

输入嵌入层（Input Embedding Layer）：将输入序列中的每个单词或符号转换为固定维度的向量表示，通常使用词嵌入（Word Embeddings）来实现。
位置编码（Positional Encoding）：为了保留序列中单词的顺序信息，在输入嵌入中添加位置编码，通常采用正弦和余弦函数来表示单词在序列中的位置信息。
多头自注意力层（Multi-Head Self-Attention Layer）：利用多个头并行计算注意力，以捕捉输入序列中各个单词之间的依赖关系，自注意力机制允许模型在计算每个单词的表示时考虑输入序列中其他单词的信息。
残差连接与层归一化（Residual Connection and Layer Normalization）：在每个子层（如自注意力层和前馈神经网络层）中，将输入与子层的输出相加，并对结果进行归一化，以防止梯度消失或爆炸，并帮助加速训练。
前馈神经网络层（Feedforward Neural Network Layer）：对每个位置的向量进行独立的全连接层操作，通常包括一个或多个隐藏层和激活函数。
再次的残差连接与层归一化：在前馈神经网络层之后再次应用残差连接和层归一化，以便进一步提高模型的训练效率和性能。

如下图所示

在这里插入图片描述

2.1 输入嵌入层（Input Embedding Layer）

将输入序列中的每个单词或符号转换为固定维度的向量表示，如下图所示

在这里插入图片描述

这样我们就可以得到如下词向量矩阵

在这里插入图片描述

上述过程，我们通常可以用Pytorch中自带的命令来实现，如下代码所示

import torch
import torch.nn as nn

vocab_size = 10
embed_size = 5

vocabulary = nn.Embedding(vocab_size, embed_size)
print(f"创建的词典是： \n{vocabulary.weight}")
# 创建的词典是：
# Parameter containing:
# tensor([[-0.7609,  0.0227,  0.3113, -0.2426,  0.1460],
#         [ 0.0174, -0.6369,  0.3123, -0.8302,  1.0143],
#         [ 0.9594, -1.0005, -1.4387, -2.5071, -1.3583],
#         [ 0.4515, -0.8786,  0.2825, -0.3124,  1.6416],
#         [-2.3314, -1.4454,  1.7150, -0.1410, -0.1977],
#         [-0.7882, -0.7660,  0.3603, -1.1835, -0.5963],
#         [-0.1178, -2.5581, -0.0574,  1.8182,  0.3932],
#         [ 0.3312, -0.0667, -0.1411, -0.4649,  0.6535],
#         [ 1.8327, -0.0894, -0.0958,  1.5528, -0.7983],
#         [-0.5804,  0.8677,  1.1671,  0.4742,  1.1498]], requires_grad=True)

input_sequence = torch.LongTensor([[1, 4]])
embedded_output = vocabulary(input_sequence)
print(f"输入为1, 4时输出为：\n{embedded_output}")
# 输入为1, 4时输出为：
# tensor([[[ 0.0174, -0.6369,  0.3123, -0.8302,  1.0143],
#          [-2.3314, -1.4454,  1.7150, -0.1410, -0.1977]]],
#        grad_fn=<EmbeddingBackward0>)

代码解释：

我们使用nn.Embedding(vocab_size, embed_size)来创建一个词典。vocab_size参数表示词汇表的大小，embed_size参数表示每个单词嵌入的维度，也就是上面图中词向量的维度 $d_{model}$ ，即 $\mathrm{embed\_size}=d_{model}$ 。
这个词典接收一个输入序列input_sequence作为输入。这个输入序列通常是一个整数张量，其中每个整数代表词汇表中的一个单词。我们将这个输入序列传递给嵌入层，它会返回一个张量，其中包含了输入序列中每个单词的嵌入表示。
输出是一个张量，其形状为（batch_size，sequence_length，embed_size）。其中，batch_size是输入序列的批量大小，sequence_length是输入序列的长度，embed_size是每个单词嵌入的维度。这个输出张量包含了输入序列中每个单词的嵌入表示。

上述代码对应的就是如下图所示的过程

在这里插入图片描述

2.1.1 一个简单的示例

import torch
import torch.nn as nn

# 输入序列
input_sequence = ["hello", "world", "!"]

# 单词到整数的映射
word_to_idx = {"hello": 0, "world": 1, "!": 2}

# 嵌入矩阵
embedding_matrix = nn.Embedding(num_embeddings=len(word_to_idx), embedding_dim=5)

# 将单词转换为整数
indexed_sequence = [word_to_idx[word] for word in input_sequence]

# 将整数序列转换为 PyTorch 的 Tensor
input_tensor = torch.LongTensor(indexed_sequence)

# 输入嵌入层
embedded_input = embedding_matrix(input_tensor)

print(f"Input Sequence: {input_sequence}")
# Input Sequence: ['hello', 'world', '!']

print(f"Indexed Sequence: {indexed_sequence}")
# Indexed Sequence: [0, 1, 2]

print(f"Embedded Input: \n{embedded_input}")
# Embedded Input:
# tensor([[ 0.1795,  2.0634, -0.1401, -1.2474, -0.6805],
#         [-0.7178, -0.5317, -0.3580,  0.8192,  2.3342],
#         [-0.8869, -1.4101,  0.7391,  0.6432, -0.7909]],
#        grad_fn=<EmbeddingBackward0>)

print(f"Shape of Embedded Input: {embedded_input.shape}")
# Shape of Embedded Input: torch.Size([3, 5])

在这个示例中, 我们定义了一个简单的输入序列，其中包含三个单词。然后，我们创建了一个字典 word_to_idx，将每个单词映射到一个整数。这个映射将在后面用于将单词转换为整数。
我们使用 PyTorch 的 nn.Embedding 类创建了一个嵌入矩阵。这个矩阵的大小是 (vocab_size, embedding_dim)，其中 vocab_size 是单词表的大小（这里是 word_to_idx 的长度），embedding_dim 是每个单词嵌入的维度（这里设为 5）。
输出是一个形状为（3，5）的张量，其中包含了输入序列中每个单词的嵌入表示。这些嵌入表示可以被传递给模型的其他层，如Transformer的自注意力层或前馈神经网络层，以进行进一步的处理。
在实际程序中，一般输入tensor为二维，因为第一个维度为batch_size, 所以在上面的例子中，可以把 input_tensor = torch.LongTensor(indexed_sequence) 改为 input_tensor = torch.LongTensor(indexed_sequence).unsqueeze(0)，在最前面增加一个维度，其它部分不变，再运行一遍也是可以输出结果的，只是此时，我们考虑了batch_size这个维度。在这种情况下，输出的size就是（1， 3， 5）的张量，1代表的是batch_size的大小，3代表的是词向量的个数，5代表的是词向量的维度。

2.2 位置编码（Positional Encoding）

2.2.1 Transformer中采用的位置编码方式

为了保留序列中单词的顺序信息，将位置编码添加到输入嵌入中。Transformer中采用的位置编码方式可以用公式（1）和（2）表示
$PE_{(pos, 2i)}=sin(pos/10000^{2i/d_{model}})=sin(w_i\cdot pos) \tag{1}$

$PE_{(pos, 2i+1)}=cos(pos/10000^{2i/d_{model}})=cos(w_i\cdot pos) \tag{2}$

其中， $w_i = \frac{1}{10000^{2i / d_{model}}}$ 。

2.2.2 公式中符号的含义

下面图示简要介绍了公式（1）和公式（2）中 $p os$ 和 $d_{model}$ 的含义， $p os$ 代表每一个词向量的位置，比如下图中第一行（ $p os = 0$ ）就代表一个词向量。

在这里插入图片描述

公式（1）和（2）中 $i$ 与每一个词向量元素位置有关， $2 i$ 代表的就是偶数位置， $2 i + 1$ 代表的就是奇数位置。以 $p os = 0$ 这个词向量为例，

当 $i = 0$ 时， $2 i$ 和 $2 i + 1$ 代表的就是这个词向量的第一个和第二个元素；
当 $i = 1$ 时， $2 i$ 和 $2 i + 1$ 代表的就是这个词向量的第三个和第四个元素；
当 $i = 2$ 时， $2 i$ 和 $2 i + 1$ 代表的就是这个词向量的第五个和第六个元素。

根据公式（1）和（2），我们就可以对每个词向量产生对应的位置编码向量，每个位置编码向量长度 $d_{PE}$ 和词向量维度 $d_{model}$ 相等。下面以一个简单例子来说明位置编码的方式。

例子：

假设 $p os = 0$ 的词向量的长度为4，即 $d_{model}=4$ , 可以表示第一个词向量为,
$x_0=[x_{00},x_{01},x_{02},x_{03}]\tag{3}$
那么对应这个词向量 $x_0$ 的位置编码向量长度 $d_{PE0}=4$ ，可以表示为，
$PE0=[PE_{(0,0)},PE_{(0,1)},PE_{(0,2)},PE_{(0,3)}]\tag{4}$
根据公式（1）和（2）计算 $PE0_{(0,0)},PE0_{(0,1)},PE0_{(0,2)},PE0_{(0,3)}$ , 可以得到

当 $i = 0$ 时,
$PE0_{(0,0)}=sin(w_0\cdot 0)=sin(\frac{1}{10000^{2\times 0 / 4}}\cdot 0)=0, \\PE0_{(0,1)}=cos(w_0\cdot 0)=cos(\frac{1}{10000^{2\times 0 / 4}}\cdot 0)=1,\tag{5}$
当 $i = 1$ 时，
$PE0_{(0,2)}=sin(w_1\cdot 0)=sin(\frac{1}{10000^{2\times 1 / 4}}\cdot 0)=0, \\PE0_{(0,3)}=cos(w_1\cdot 0)=cos(\frac{1}{10000^{2\times 1 / 4}}\cdot 0)=1, \tag{6}$
即对应词向量 $x_0$ 的位置编码为
$\begin{align} PE0&=[sin(w_0\cdot 0),cos(w_0\cdot 0),sin(w_1\cdot 0),cos(w_1\cdot 0)] \nonumber \\ &=[0,1,0,1] \tag{7} \end{align}$
那么位置编码后的词向量就是
$\begin{align} \hat{x}_0&=x_0+PE0 \nonumber \\ &=[x_{00}+sin(w_0\cdot 0),x_{01}+cos(w_0\cdot 0),x_{02}+sin(w_1\cdot 0),x_{03}+cos(w_1\cdot 0)] \nonumber \\ & \tag{8} \end{align}$
具体叠加过程如下图所示。

在这里插入图片描述

根据上面原理，我们可以定义如下位置编码类，核心就是根据公式（1）和（2）计算。

class PositionalEncoder(nn.Module):
    def __init__(self, d_model, max_seq_len=5000, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.max_seq_len = max_seq_len
        self.dropout = nn.Dropout(p=dropout)

        # 初始化位置编码矩阵
        pe = torch.zeros(max_seq_len, d_model)

        # 计算位置编码
        for pos in range(max_seq_len):
            for i in range(d_model // 2):
                pe[pos, 2 * i] = math.sin(pos / 10000 ** ((2 * i) / d_model))
                pe[pos, 2 * i + 1] = math.cos(pos / 10000 ** ((2 * i) / d_model))

        # 增加 batch_size 维度
        pe = pe.unsqueeze(0)

        # 将pe加入模型，但是不进行更新
        self.register_buffer('pe', pe)

    def forward(self, x):
        seq_len = x.size(1)

        return self.dropout(x + self.pe[:, :seq_len, :])


# 测试代码
d_model = 512
pos_encoding = PositionalEncoder(d_model)
batch_size, seq_len = 64, 10
input_tensor = torch.rand(64, seq_len, d_model)
output_tensor = pos_encoding(input_tensor)
print(output_tensor.shape) # torch.Size([64, 10, 512])

代码解释：

初始化函数__init__：
- 在初始化函数中，我们定义了一个位置编码矩阵，其大小为(max_seq_len, d_model)。max_seq_len表示输入序列的最大长度，d_model表示嵌入向量的维度。
- 然后，我们使用正弦和余弦函数计算位置编码矩阵中的值。每个位置的编码由正弦和余弦函数的组合计算得到，以便在嵌入空间中表达不同位置的信息。
前向传播函数forward：
- 在前向传播函数中，我们接收一个输入张量x，其形状为(batch_size, seq_len, d_model)。
- 然后，我们将位置编码矩阵与输入张量相加，并返回结果。这样，每个位置的嵌入向量都会叠加上对应位置的位置编码，从而在嵌入空间中表示其位置信息。
测试代码：
- 我们使用了一个随机生成的输入张量作为示例输入，其形状为(batch_size, seq_len, d_model)。
- 然后，我们将输入张量传递给位置编码层，并输出得到的张量。

位置编码的作用是为输入序列中的每个位置添加一个特定的编码，以便模型能够区分不同位置的信息。

2.3. 多头自注意力层（Multi-Head Self-Attention Layer）

多头自注意力层（Multi-Head Self-Attention Layer）是Transformer架构中的核心组件之一，用于捕捉序列数据中的内部依赖关系，并将这些依赖关系编码到一个全局上下文表示中。其构造如下图所示，

在这里插入图片描述

2.3.1 自注意力机制（Self-Attention Mechanism）

在了解多头自注意力层之前，首先需要了解自注意力机制。自注意力机制允许模型在计算序列中每个位置的表示时，能够同时关注序列中的其他位置，而不是只关注当前位置。通过自注意力机制，模型可以根据序列中每个位置的重要性来动态地加权计算每个位置的表示。自注意力机制的结构如下图所示：

在这里插入图片描述

在自注意力机制中，首先要进行 $Q$ , $K$ , $V$ 的获取。我们可以根据输入 $\hat{X}$ 来进行线性变换得到 $Q, K, V$ ，即
$\begin{align} \hat{X}W^Q=Q\tag{9}\\ \hat{X}W^K=K\tag{10}\\ \hat{X}W^V=V\tag{11} \end{align}$
这里 $W^Q,W^K,W^V$ 是线性变换矩阵。

图像化表示公式（9），（10）和（11）如下图所示

在这里插入图片描述

$Q$ , $K$ , $V$ 获取之后的self-attention输出计算公式为
$\mathrm{Attention(Q,K,V)=\mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V},\tag{12}$
公式（12）中 $d_k$ 是上图中 $Q, K$ 的列数，即

在这里插入图片描述

公式（12）中的softmax函数定义如下
$\mathrm{softmax}(x)=\frac{e^{x_i}}{\sum_n{e^{x_i}}},\tag{13}$
对于长度为 $n$ 的任意实向量，softmax函数可以将其值在[ 0 , 1 ] 范围内，并且向量中元素的总和为1。

公式（13）中的softmax就是对 $(\frac{QK^T}{\sqrt{d_k}})$ 计算得到的矩阵进行操作，使其每一行的向量值分布在[ 0 , 1 ] 范围内，并且行向量中元素的总和为1。在应用 softmax 的时候，常见的问题是数值稳定性问题，也就是说，可能出现指数溢出误差，即 $\sum{e^{x_i}}$ 可能会变得非常大。这个溢出误差可以通过用数组的每个值减去其最大值来解决。即首先找出向量 $x$ 中的最大值 $x_{max}$ , 然后 $x$ 中每个元素减去其最大值，即 $x-x_{max}$ 。所以公式（13）变成如下
$\mathrm{softmax}(x)=\frac{e^{x_i-x_{max}}}{\sum_n{e^{x_i-x_{max}}}},\tag{14}$
例子1： 我们可以用如下一个简单的例子去模拟自注意力机制的所有计算过程

随机产生一个 $4\times 6$ 大小的矩阵充当 $\hat{X}$ ，
$\hat{X}=\begin{bmatrix} 0.22 & 0.87 & 0.21 & 0.92 & 0.49 & 0.61\\ 0.77 & 0.52 & 0.3 & 0.19 & 0.08 & 0.74\\ 0.44 & 0.16 & 0.88 & 0.27 & 0.41 & 0.3 \\ 0.63 & 0.58 & 0.6 & 0.27 & 0.28 & 0.25 \end{bmatrix}$
同时，我们也随机产生三个线性变换矩阵 $W^Q,W^K,W^V$ ，即
$W^Q=\begin{bmatrix} 0.33 & 0.14 & 0.17\\ 0.96 & 0.96 & 0.19\\ 0.02 & 0.2 & 0.7 \\ 0.78 & 0.02 & 0.58 \\ 0. & 0.52 & 0.64\\ 0.99 & 0.26 & 0.8 \end{bmatrix}$

$W^K=\begin{bmatrix} 0.87 & 0.92 & 0.\\ 0.47 & 0.98 & 0.4 \\ 0.81 & 0.55 & 0.77\\ 0.48 & 0.03 & 0.09\\ 0.11 & 0.25 & 0.96\\ 0.63 & 0.82 & 0.57 \end{bmatrix}$

$W^V=\begin{bmatrix} 0.64 & 0.81 & 0.93\\ 0.91 & 0.82 & 0.09\\ 0.36 & 0.04 & 0.55\\ 0.8 & 0.05 & 0.19\\ 0.37 & 0.24 & 0.8\\ 0.35 & 0.64 & 0.49 \end{bmatrix}$

根据公式（9），（10）和（11)，我们可以得到 $Q, K, V$ ，即
$\begin{align} Q&=\hat{X}W^Q\nonumber \\\nonumber &= \begin{bmatrix} 0.22 & 0.87 & 0.21 & 0.92 & 0.49 & 0.61\\\nonumber 0.77 & 0.52 & 0.3 & 0.19 & 0.08 & 0.74\\\nonumber 0.44 & 0.16 & 0.88 & 0.27 & 0.41 & 0.3 \\\nonumber 0.63 & 0.58 & 0.6 & 0.27 & 0.28 & 0.25\nonumber \end{bmatrix} \times \begin{bmatrix}\nonumber 0.33 & 0.14 & 0.17\\\nonumber 0.96 & 0.96 & 0.19\\ 0.02 & 0.2 & 0.7 \\ 0.78 & 0.02 & 0.58 \\ 0. & 0.52 & 0.64\\ 0.99 & 0.26 & 0.8 \end{bmatrix}\\\nonumber &= \begin{bmatrix} 2.2335 & 1.3398 & 1.6849\\\nonumber 1.6401 & 0.9048 & 1.1931\\ 0.824 & 0.6878 & 1.3802 \\ 1.2348 & 0.981 & 1.1731 \end{bmatrix}\nonumber \end{align}$

$\begin{align} K&=\hat{X}W^K\nonumber \\\nonumber &= \begin{bmatrix} 0.22 & 0.87 & 0.21 & 0.92 & 0.49 & 0.61\\\nonumber 0.77 & 0.52 & 0.3 & 0.19 & 0.08 & 0.74\\ 0.44 & 0.16 & 0.88 & 0.27 & 0.41 & 0.3 \\ 0.63 & 0.58 & 0.6 & 0.27 & 0.28 & 0.25 \end{bmatrix} \times \begin{bmatrix} 0.87 & 0.92 & 0.\\ 0.47 & 0.98 & 0.4 \\ 0.81 & 0.55 & 0.77\\ 0.48 & 0.03 & 0.09\\ 0.11 & 0.25 & 0.96\\ 0.63 & 0.82 & 0.57 \end{bmatrix}\\\nonumber &= \begin{bmatrix} 1.6502 & 1.8208 & 1.4106\nonumber\\\nonumber 1.7235 & 2.0155 & 0.9547\\ 1.5345 & 1.4022 & 1.3305 \\ 1.6246 & 1.7611 & 1.1296 \end{bmatrix}\nonumber \end{align}$

$\begin{align} V&=\hat{X}W^V \nonumber\\\nonumber &= \begin{bmatrix} 0.22 & 0.87 & 0.21 & 0.92 & 0.49 & 0.61\\ 0.77 & 0.52 & 0.3 & 0.19 & 0.08 & 0.74\\ 0.44 & 0.16 & 0.88 & 0.27 & 0.41 & 0.3 \\ 0.63 & 0.58 & 0.6 & 0.27 & 0.28 & 0.25 \end{bmatrix} \times \begin{bmatrix} 0.64 & 0.81 & 0.93\\ 0.91 & 0.82 & 0.09\\ 0.36 & 0.04 & 0.55\\ 0.8 & 0.05 & 0.19\\ 0.37 & 0.24 & 0.8\\ 0.35 & 0.64 & 0.49 \end{bmatrix}\\ &= \begin{bmatrix} 2.1389 & 1.454 & 1.2641\\ 1.5146 & 1.5644 & 1.3906\\ 1.2167 & 0.8267 & 1.4339 \\ 1.5541 & 1.2506 & 1.3659 \end{bmatrix}\nonumber \end{align}$

有了 $Q, K, V$ 之后，根据公式（12），我们首先计算
$\begin{align} \frac{QK^T}{\sqrt{d_k}}&=\frac{1}{\sqrt{3}}\times \begin{bmatrix}\nonumber 2.2335 & 1.3398 & 1.6849\\ 1.6401 & 0.9048 & 1.1931\\ 0.824 & 0.6878 & 1.3802 \\ 1.2348 & 0.981 & 1.1731 \end{bmatrix} \times \begin{bmatrix} 1.6502 & 1.7235 & 1.5345 & 1.6246\\ 1.8208 & 2.0155 & 1.4022 & 1.7611\\ 1.4106 & 0.9547 & 1.3305 & 1.1296\\ \end{bmatrix}\\ &= \begin{bmatrix} 4.90860282 & 4.71024184 & 4.35768554 & 4.55606088\\ 3.48542877 & 3.34250548 & 3.10202422 & 3.23643826\\ 2.63215209 & 2.38105131 & 2.34705428 & 2.37234894\\ 3.1630981 & 3.01685254 & 2.78927635 & 2.92071625 \end{bmatrix}\nonumber \end{align}$
然后我们用公式（14）去对 $\frac{QK^T}{\sqrt{d_k}}$ 的结果每一行进行softmax，即
$\mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})=\begin{bmatrix} 0.32264375 & 0.26459168 & 0.18597858 & 0.226786\\ 0.30048573 & 0.26046721 & 0.20479218 & 0.23425488\\ 0.30293042 & 0.23566289 & 0.22778572 & 0.23362097\\ 0.29968819 & 0.25891427 & 0.20621531 & 0.23518224 \end{bmatrix}$
可以验证，上面矩阵中的每一行之和为1。最后我们就可以用上面矩阵乘以 $V$ 得到这个自注意力机制的输出了，即
$\begin{align} \mathrm{softmax}(\frac{QK^T}{\sqrt{d_k}})V&=\begin{bmatrix}\nonumber 0.32264375 & 0.26459168 & 0.18597858 & 0.226786\\ 0.30048573 & 0.26046721 & 0.20479218 & 0.23425488\\ 0.30293042 & 0.23566289 & 0.22778572 & 0.23362097\\ 0.29968819 & 0.25891427 & 0.20621531 & 0.23518224 \end{bmatrix} \times \begin{bmatrix} 2.1389 & 1.454 & 1.2641\\ 1.5146 & 1.5644 & 1.3906\\ 1.2167 & 0.8267 & 1.4339 \\ 1.5541 & 1.2506 & 1.3659 \end{bmatrix}\\ &= \begin{bmatrix} 1.66958152 & 1.32041829 & 1.35223682\\ 1.65043872 & 1.306642 & 1.35566996\\ 1.64509013 & 1.2896087 & 1.35637198 \\ 1.64955349 & 1.30538921 & 1.35580957 \end{bmatrix}\nonumber \end{align}$

这个计算过程的代码如下，有兴趣的可以自己尝试

import numpy as np
# 随机产生一个大小为4 x 6的词向量矩阵X_hat
np.random.seed(5)
x_hat = np.random.rand(4, 6)
x_hat = np.round(x_hat, 2)  # 保留两位小数
print(f"输入词向量矩阵x_hat为：\n{x_hat}")
# 输入词向量矩阵x_hat为：
# [[0.22 0.87 0.21 0.92 0.49 0.61]
#  [0.77 0.52 0.3  0.19 0.08 0.74]
#  [0.44 0.16 0.88 0.27 0.41 0.3 ]
#  [0.63 0.58 0.6  0.27 0.28 0.25]]

# 随机产生一个大小为6 x 3的线性变换矩阵W_Q
W_Q = np.random.rand(6, 3)
W_Q = np.round(W_Q, 2)  # 保留两位小数
print(f"输入词向量矩阵W_Q为：\n{W_Q}")
# 线性变换矩阵W_Q为：
# [[0.33 0.14 0.17]
#  [0.96 0.96 0.19]
#  [0.02 0.2  0.7 ]
#  [0.78 0.02 0.58]
#  [0.   0.52 0.64]
#  [0.99 0.26 0.8 ]]

# 随机产生一个大小为6 x 3的线性变换矩阵W_K
W_K = np.random.rand(6, 3)
W_K = np.round(W_K, 2)  # 保留两位小数
print(f"输入词向量矩阵W_K为：\n{W_K}")
# 线性变换矩阵W_K为：
# [[0.87 0.92 0.  ]
#  [0.47 0.98 0.4 ]
#  [0.81 0.55 0.77]
#  [0.48 0.03 0.09]
#  [0.11 0.25 0.96]
#  [0.63 0.82 0.57]]

# 随机产生一个大小为6 x 3的线性变换矩阵W_V
W_V = np.random.rand(6, 3)
W_V = np.round(W_V, 2)  # 保留两位小数
print(f"输入词向量矩阵W_V为：\n{W_V}")
# 线性变换矩阵W_V为：
# [[0.64 0.81 0.93]
#  [0.91 0.82 0.09]
#  [0.36 0.04 0.55]
#  [0.8  0.05 0.19]
#  [0.37 0.24 0.8 ]
#  [0.35 0.64 0.49]]

Q = x_hat @ W_Q
print(f"Q为：\n{Q}")
# Q为：
# [[2.2335 1.3398 1.6849]
#  [1.6401 0.9048 1.1931]
#  [0.824  0.6878 1.3802]
#  [1.2348 0.981  1.1731]]

K = x_hat @ W_K
print(f"K为：\n{K}")
# K为：
# [[1.6502 1.8208 1.4106]
#  [1.7235 2.0155 0.9547]
#  [1.5345 1.4022 1.3305]
#  [1.6246 1.7611 1.1296]]

V = x_hat @ W_V
print(f"V为：\n{V}")
# V为：
# [[2.1389 1.454  1.2641]
#  [1.5146 1.5644 1.3906]
#  [1.2167 0.8267 1.4339]
#  [1.5541 1.2506 1.3659]]

Q_KT_d_k = Q @ K.T / np.sqrt(3)
print(f"Q_KT_d_k为： \n{Q_KT_d_k}")


# Q_KT_d_k为：
# [[4.90860282 4.71024184 4.35768554 4.55606088]
#  [3.48542877 3.34250548 3.10202422 3.23643826]
#  [2.63215209 2.38105131 2.34705428 2.37234894]
#  [3.1630981  3.01685254 2.78927635 2.92071625]]

# 在应用 softmax 的时候，常见的问题是数值稳定性问题，也就是说，由于可能出现的指数和溢出误差，
# ∑e^(x) 可能会变得非常大。这个溢出误差可以通过用数组的每个值减去其最大值来解决。
def softmax(x):
    max = np.max(x, axis=1, keepdims=True)  # 返回每一行的最大值，并保持维度不变,例如4 x 5 --> 4 x 1,否则就输出一行四个数，不是二维了
    e_x = np.exp(x - max)  # 每一行的所有元素减去这一行的对应最大值
    sum = np.sum(e_x, axis=1, keepdims=True)
    out = e_x / sum
    return out


soft_Q_KT = softmax(Q_KT_d_k)
print(f"Softmax result is \n{soft_Q_KT}")
# Softmax result is
# [[0.32264375 0.26459168 0.18597858 0.226786  ]
#  [0.30048573 0.26046721 0.20479218 0.23425488]
#  [0.30293042 0.23566289 0.22778572 0.23362097]
#  [0.29968819 0.25891427 0.20621531 0.23518224]]

out_self_attention = soft_Q_KT @ V
print(f"Self attention output is \n{out_self_attention}")
# Self attention output is
# [[1.66958152 1.32041829 1.35223682]
#  [1.65043872 1.306642   1.35566996]
#  [1.64509013 1.2896087  1.35637198]
#  [1.64955349 1.30538921 1.35580957]]

了解了这个计算流程后，那么我们就可以比较简单的理解下面这个更加简洁的计算例子了

例子2：结合输入，继续用一个简单而且简洁的例子来展示自注意力机制的计算流程

假设我们有一个简单的输入序列，由3个单词组成：[“I”, “love”, “AI”]，每个单词的嵌入维度为4。我们将使用自注意力机制来计算每个单词的表示，以展示计算流程。

假设我们的输入序列表示如下：

I: [1, 0, 0, 0]
love: [0, 1, 0, 0]
AI: [0, 0, 1, 0]

import torch
import torch.nn as nn
import torch.nn.functional as F

# 输入序列表示
input_sequence = torch.tensor([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, 0]], dtype=torch.float32)

# 产生W_Q, W_K, W_V
W_Q = nn.Linear(4, 3)
W_K = nn.Linear(4, 3)
W_V = nn.Linear(4, 3)

# 计算Q、K、V
Q = W_Q(input_sequence)
K = W_K(input_sequence)
V = W_V(input_sequence)

# 计算注意力权重
attention_weights = F.softmax(torch.matmul(Q, K.transpose(0, 1)) / torch.sqrt(torch.tensor(Q.size(-1)).float()), dim=1)

# 计算输出序列
output_sequence = torch.matmul(attention_weights, V)

print(f"输入序列: \n{input_sequence}")
# 输入序列: 
# tensor([[1., 0., 0., 0.],
#         [0., 1., 0., 0.],
#         [0., 0., 1., 0.]])

print(f"Q (查询): \n{Q}")
# Q (查询): 
# tensor([[ 0.6386,  0.5694, -0.6500],
#         [ 0.5133,  0.6588, -0.5769],
#         [ 0.1585, -0.2131, -0.7061]], grad_fn=<AddmmBackward0>)

print(f"K (键): \n{K}")
# K (键): 
# tensor([[ 0.6144, -0.0623, -0.2178],
#         [-0.0351, -0.0054, -0.6246],
#         [ 0.5004, -0.2891, -0.6324]], grad_fn=<AddmmBackward0>)

print(f"V (值): \n{V}")
# V (值): 
# tensor([[ 0.1948,  0.3932, -0.4839],
#         [-0.2425,  0.2632,  0.2790],
#         [ 0.0439,  0.0908,  0.1689]], grad_fn=<AddmmBackward0>)

print(f"注意力权重: \n{attention_weights}")
# 注意力权重: 
# tensor([[0.3362, 0.3141, 0.3496],
#         [0.3352, 0.3235, 0.3413],
#         [0.3022, 0.3337, 0.3641]], grad_fn=<SoftmaxBackward0>)

print(f"输出序列: \n{output_sequence}")
# 输出序列: 
# tensor([[ 0.0047,  0.2467, -0.0160],
#         [ 0.0018,  0.2480, -0.0143],
#         [-0.0061,  0.2397,  0.0084]], grad_fn=<MmBackward0>)

通过这个例子，我们可以清楚地看到自注意力机制的计算流程，包括计算Q、K、V以及计算注意力权重和输出序列。

2.3.4 多头自注意力层的构成

多头自注意力层由多个注意力头组成，每个注意力头都独立地学习注意力权重。每个注意力头产生的注意力权重会被合并成一个全局的注意力权重，然后用于加权计算输入序列中每个位置的表示。在现在公开的代码写法中，多头注意力的表示更倾向于如下表示，即先根据输入算出一个总的Q, K, V, 然后根据注意力头数的多少来拆分Q, K, V。

在这里插入图片描述

下面是构建多头自注意力层的一般步骤：

投影（Projection）：将输入序列通过投影矩阵映射到多个不同的表示空间，以供不同的注意力头使用。
分割（Split）：将投影后的表示分割成多个部分，每个部分用于不同的注意力头。
独立注意力计算（Independent Attention Computation）：对每个部分进行独立的注意力计算，即每个注意力头都使用自注意力机制来计算注意力权重。
拼接（Concatenation）：将每个注意力头计算得到的注意力权重合并为一个全局的注意力权重。
线性变换：（Linear transform）：对合并后的多头注意力表示进行线性变换，以进一步整合信息并调整其维度。

Multi-Head Attention的计算流程如下

$Q$ , $K$ , $V$ 的获取: 可以根据输入的词向量矩阵 $\hat{X}$ 来进行线性变换得到 $Q, K, V$ ，即

$\begin{align} \hat{X}W^Q=Q\tag{15}\\ \hat{X}W^K=K\tag{16}\\ \hat{X}W^V=V\tag{17} \end{align}$

这里 $W^Q,W^K,W^V$ 是线性变换矩阵。

为了便于理解，我们让输入的词向量 $\hat{X}$ 维度为 $4\times6$ ，线性变换矩阵 $W^Q,W^K,W^V$ 的维度为 $6\times 6$ , 如下图所示

在这里插入图片描述

根据公式（15），（16）和（17）计算得到的 $Q, K, V$ 维度大小就是 $\times 6$ ，如下图所示
在这里插入图片描述

假设 $He a d$ 的数量为3，那么我们就可以将 $Q, K, V$ 等分成三份,如下图所示

在这里插入图片描述

$\{Q_1,K_1,V_1\},\cdots, \{Q_h,K_h,V_h\}$ 获取之后的各自self-attention输出

输出计算公式为
$\mathrm{Attention}(Q_i,K_i,V_i)=\mathrm{softmax}(\frac{Q_iK_i^T}{\sqrt{d_k}})V_i,\tag{18}$
公式（18）中 $d_k$ 就是上图中 ${Q_h,K_h,V_h\}$ 的列数，即 $d_k=2$ ，如下图所示

在这里插入图片描述

根据公式（18）可以推断每个self-attention的输出维度大小为 $4\times 2$ , 如下图所示

在这里插入图片描述

concatenation的作用就是将三个self-attention的输出拼接起来，如下图所示

在这里插入图片描述

一个简单的例子来模拟multi-head attention 的计算流程

随机产生一个 $4\times 6$ 大小的矩阵充当 $\hat{X}$ ，
$\hat{X}=\begin{bmatrix} 0.22 & 0.87 & 0.21 & 0.92 & 0.49 & 0.61\\ 0.77 & 0.52 & 0.3 & 0.19 & 0.08 & 0.74\\ 0.44 & 0.16 & 0.88 & 0.27 & 0.41 & 0.3 \\ 0.63 & 0.58 & 0.6 & 0.27 & 0.28 & 0.25 \end{bmatrix}$
同时，我们也随机产生三个线性变换矩阵 $W^Q,W^K,W^V$ ，即
$W^Q=\begin{bmatrix} 0.33 & 0.14 & 0.17 & 0.96 & 0.96 & 0.19\\ 0.02 & 0.2 & 0.7 & 0.78 & 0.02 & 0.58\\ 0. & 0.52 & 0.64 & 0.99 & 0.26 & 0.8 \\ 0.87 & 0.92 & 0. & 0.47 & 0.98 & 0.4 \\ 0.81 & 0.55 & 0.77 & 0.48 & 0.03 & 0.09\\ 0.11 & 0.25 & 0.96 & 0.63 & 0.82 & 0.57 \end{bmatrix}$

$W^K=\begin{bmatrix} 0.64 & 0.81 & 0.93 & 0.91 & 0.82 & 0.09\\ 0.36 & 0.04 & 0.55 & 0.80 & 0.05 & 0.19\\ 0.37 & 0.24 & 0.80 & 0.35 & 0.64 & 0.49\\ 0.58 & 0.94 & 0.94 & 0.11 & 0.84 & 0.35\\ 0.10 & 0.38 & 0.51 & 0.96 & 0.37 & 0.01\\ 0.86 & 0.11 & 0.48 & 0.85 & 0.51 & 0.45 \end{bmatrix}$

$W^V=\begin{bmatrix} 0.80 & 0.02 & 0.57 & 0.41 & 0.99 & 0.80\\ 0.05 & 0.19 & 0.45 & 0.70 & 0.33 & 0.36\\ 0.92 & 0.95 & 0.41 & 0.90 & 0.33 & 0.08\\ 0.53 & 0.66 & 0.89 & 0.97 & 0.77 & 0.76\\ 0.71 & 0.70 & 0.77 & 0.97 & 0.37 & 0.08\\ 0.24 & 0.22 & 0.36 & 0.81 & 0.06 & 0.45 \end{bmatrix}$

根据公式（15），（16）和（17）,我们可以得到 $Q, K, V$ ，即
$\begin{align} Q&=\hat{X}W^Q\nonumber\\ \nonumber &= \begin{bmatrix} 0.22 & 0.87 & 0.21 & 0.92 & 0.49 & 0.61\\ 0.77 & 0.52 & 0.3 & 0.19 & 0.08 & 0.74\\ 0.44 & 0.16 & 0.88 & 0.27 & 0.41 & 0.3 \\ 0.63 & 0.58 & 0.6 & 0.27 & 0.28 & 0.25 \end{bmatrix}\times \begin{bmatrix} 0.33 & 0.14 & 0.17 & 0.96 & 0.96 & 0.19\\ 0.02 & 0.2 & 0.7 & 0.78 & 0.02 & 0.58\\ 0. & 0.52 & 0.64 & 0.99 & 0.26 & 0.8 \\ 0.87 & 0.92 & 0. & 0.47 & 0.98 & 0.4 \\ 0.81 & 0.55 & 0.77 & 0.48 & 0.03 & 0.09\\ 0.11 & 0.25 & 0.96 & 0.63 & 0.82 & 0.57 \end{bmatrix}\\ &= \begin{bmatrix} 1.3544 & 1.5824 & 1.7437 & 2.1496 & 1.6997 & 1.4742\\ 0.5760 & 0.7716 & 1.4589 & 2.0357 & 1.6230 & 1.1929\\ 0.7484 & 1.1001 & 1.3537 & 1.9311 & 1.1773 & 1.1963\\ 0.7087 & 0.9811 & 1.3527 & 2.0700 & 1.2504 & 1.2118 \end{bmatrix}\nonumber \end{align}$

$\begin{align} K&=\hat{X}W^K\nonumber \\\nonumber &=\begin{bmatrix} 0.22 & 0.87 & 0.21 & 0.92 & 0.49 & 0.61\\ 0.77 & 0.52 & 0.3 & 0.19 & 0.08 & 0.74\\ 0.44 & 0.16 & 0.88 & 0.27 & 0.41 & 0.3 \\ 0.63 & 0.58 & 0.6 & 0.27 & 0.28 & 0.25 \end{bmatrix}\times \begin{bmatrix} 0.64 & 0.81 & 0.93 & 0.91 & 0.82 & 0.09\\ 0.36 & 0.04 & 0.55 & 0.80 & 0.05 & 0.19\\ 0.37 & 0.24 & 0.80 & 0.35 & 0.64 & 0.49\\ 0.58 & 0.94 & 0.94 & 0.11 & 0.84 & 0.35\\ 0.10 & 0.38 & 0.51 & 0.96 & 0.37 & 0.01\\ 0.86 & 0.11 & 0.48 & 0.85 & 0.51 & 0.45 \end{bmatrix}\\ &=\begin{bmatrix} 1.6389 & 1.3815 & 2.2586 & 2.0598 & 1.6235 & 0.8894\\ 1.5456 & 1.0069 & 1.8167 & 1.9484 & 1.4160 & 0.7154\\ 1.1204 & 1.0166 & 1.8081 & 1.5147 & 1.4635 & 0.7348\\ 1.2336 & 1.0652 & 1.9015 & 1.7583 & 1.3875 & 0.6707 \end{bmatrix}\nonumber \end{align}$

$\begin{align} V&=\hat{X}W^V\nonumber\\\nonumber &= \begin{bmatrix} 0.22 & 0.87 & 0.21 & 0.92 & 0.49 & 0.61\\ 0.77 & 0.52 & 0.3 & 0.19 & 0.08 & 0.74\\ 0.44 & 0.16 & 0.88 & 0.27 & 0.41 & 0.3 \\ 0.63 & 0.58 & 0.6 & 0.27 & 0.28 & 0.25 \end{bmatrix}\times \begin{bmatrix} 0.80 & 0.02 & 0.57 & 0.41 & 0.99 & 0.80\\ 0.05 & 0.19 & 0.45 & 0.70 & 0.33 & 0.36\\ 0.92 & 0.95 & 0.41 & 0.90 & 0.33 & 0.08\\ 0.53 & 0.66 & 0.89 & 0.97 & 0.77 & 0.76\\ 0.71 & 0.70 & 0.77 & 0.97 & 0.37 & 0.08\\ 0.24 & 0.22 & 0.36 & 0.81 & 0.06 & 0.45 \end{bmatrix}\\ &= \begin{bmatrix} 1.3946 & 1.4536 & 2.0187 & 2.7500 & 1.5005 & 1.5189\\ 1.2531 & 0.7434 & 1.2930 & 1.8110 & 1.2532 & 1.3110\\ 1.6758 & 1.4064 & 1.3476 & 1.9870 & 1.1564 & 0.8530\\ 1.4869 & 1.1220 & 1.4120 & 1.9403 & 1.3396 & 1.1009 \end{bmatrix}\nonumber \end{align}$

将 $Q, K, V$ 拆分成 ${Q_1,K_1,V_1\}$ , ${Q_2,K_2,V_2\}$ , ${Q_3,K_3,V_3\}$ 。即
$\{Q_1,K_1,V_1\}=\{\begin{bmatrix} 1.3544 & 1.5824\\ 0.5760 & 0.7716\\ 0.7484 & 1.1001\\ 0.7087 & 0.9811 \end{bmatrix}, \begin{bmatrix} 1.6389 & 1.3815\\ 1.5456 & 1.0069\\ 1.1204 & 1.0166\\ 1.2336 & 1.0652 \end{bmatrix} , \begin{bmatrix} 1.3946 & 1.4536\\ 1.2531 & 0.7434\\ 1.6758 & 1.4064\\ 1.4869 & 1.1220 \end{bmatrix} \}$

$\{Q_2,K_2,V_2\}=\{\begin{bmatrix} 1.7437 & 2.1496\\ 1.4589 & 2.0357\\ 1.3537 & 1.9311\\ 1.3527 & 2.0700 \end{bmatrix}, \begin{bmatrix} 2.2586 & 2.0598\\ 1.8167 & 1.9484\\ 1.8081 & 1.5147\\ 1.9015 & 1.7583 \end{bmatrix} , \begin{bmatrix} 2.0187 & 2.7500\\ 1.2930 & 1.8110\\ 1.3476 & 1.9870\\ 1.4120 & 1.9403 \end{bmatrix}\}$

$\{Q_3,K_3,V_3\}=\{\begin{bmatrix} 1.6997 & 1.4742\\ 1.6230 & 1.1929\\ 1.1773 & 1.1963\\ 1.2504 & 1.2118 \end{bmatrix}, \begin{bmatrix} 1.6235 & 0.8894\\ 1.4160 & 0.7154\\ 1.4635 & 0.7348\\ 1.3875 & 0.6707 \end{bmatrix} , \begin{bmatrix} 1.5005 & 1.5189\\ 1.2532 & 1.3110\\ 1.1564 & 0.8530\\ 1.3396 & 1.1009 \end{bmatrix}\}$

对于 ${Q_1,K_1,V_1\}$ ，我们首先计算
$\begin{align} \frac{Q_1K_1^T}{\sqrt{d_k}}&=\frac{1}{\sqrt{2}}\times \begin{bmatrix}\nonumber 1.3544 & 1.5824\\ 0.5760 & 0.7716\\ 0.7484 & 1.1001\\ 0.7087 & 0.9811 \end{bmatrix} \times \begin{bmatrix} 1.6389 & 1.5456 & 1.1204 & 1.2336\\ 1.3815 & 1.0069 & 1.0166 & 1.0652 \end{bmatrix}\\ &=\begin{bmatrix} 3.1154 & 2.6069 & 2.2105 & 2.3733\\ 1.4213 & 1.1789 & 1.0110 & 1.0836\\ 1.9420 & 1.6012 & 1.3837 & 1.4814\\ 1.7797 & 1.4731 & 1.2667 & 1.3572 \end{bmatrix}\nonumber \end{align}$
应用softmax去对结果每一行进行softmax，即
$\mathrm{softmax}(\frac{Q_1K_1^T}{\sqrt{d_k}})= \begin{bmatrix} 0.4029 & 0.2423 & 0.1630 & 0.1918\\ 0.3163 & 0.2482 & 0.2098 & 0.2257\\ 0.3431 & 0.2440 & 0.1963 & 0.2165\\ 0.3344 & 0.2461 & 0.2002 & 0.2192 \end{bmatrix}$
最后我们就可以用上面矩阵乘以 $V_1$ 得到这个自注意力机制的输出了，即
$\begin{align} \mathrm{softmax}(\frac{Q_1K_1^T}{\sqrt{d_k}})V_1 &= \begin{bmatrix}\nonumber 0.4029 & 0.2423 & 0.1630 & 0.1918\\ 0.3163 & 0.2482 & 0.2098 & 0.2257\\ 0.3431 & 0.2440 & 0.1963 & 0.2165\\ 0.3344 & 0.2461 & 0.2002 & 0.2192 \end{bmatrix} \times \begin{bmatrix} 1.3946 & 1.4536\\ 1.2531 & 0.7434\\ 1.6758 & 1.4064\\ 1.4869 & 1.1220 \end{bmatrix}\\ &= \begin{bmatrix} 1.4239 & 1.2102\\ 1.4393 & 1.1926\\ 1.4353 & 1.1992\\ 1.4363 & 1.1967 \end{bmatrix}\nonumber \end{align}$
类似的，我们可以计算得到
$\mathrm{softmax}(\frac{Q_2K_2^T}{\sqrt{d_k}})V_2= \begin{bmatrix} 1.6599 & 2.2933\\ 1.6423 & 2.2714\\ 1.6340 & 2.2611\\ 1.6382 & 2.2661 \end{bmatrix}$

$\mathrm{softmax}(\frac{Q_3K_3^T}{\sqrt{d_k}})V_3=\begin{bmatrix} 1.3315 & 1.2298\\ 1.3292 & 1.2256\\ 1.3262 & 1.2206\\ 1.3268 & 1.2216 \end{bmatrix}$

对这三个self-attention输出进行拼接得到如下矩阵
$\begin{bmatrix} 1.4239 & 1.2102 & 1.6599 & 2.2933 & 1.3315 & 1.2298\\ 1.4393 & 1.1926 & 1.6423 & 2.2714 & 1.3292 & 1.2256\\ 1.4353 & 1.1992 & 1.6340 & 2.2611 & 1.3262 & 1.2206\\ 1.4363 & 1.1967 & 1.6382 & 2.2661 & 1.3268 & 1.2216 \end{bmatrix}$
对 $Z$ 做一个线性变换就得到最终的输出。这里我们产生一个矩阵维度为 $6\times 6$ 的随机矩阵 $W^O$ 用于做线性变换,即
$W^O=\begin{bmatrix} 0.81 & 0.26 & 0.06 & 0.24 & 0.09 & 0.81\\ 0.17 & 0.20 & 0.81 & 0.81 & 0.59 & 0.91\\ 0.06 & 0.96 & 0.57 & 0.30 & 0.83 & 0.66\\ 0.99 & 0.11 & 0.58 & 0.47 & 0.65 & 0.24\\ 0.03 & 0.54 & 0.36 & 0.89 & 0.46 & 0.42\\ 0.63 & 0.53 & 0.96 & 0.79 & 0.50 & 0.21 \end{bmatrix}$
那么这个multi-head的最终输出为
$\begin{align} \mathrm{output}&=ZW^O\nonumber\\\nonumber &= \begin{bmatrix} 1.4239 & 1.2102 & 1.6599 & 2.2933 & 1.3315 & 1.2298\\ 1.4393 & 1.1926 & 1.6423 & 2.2714 & 1.3292 & 1.2256\\ 1.4353 & 1.1992 & 1.6340 & 2.2611 & 1.3262 & 1.2206\\ 1.4363 & 1.1967 & 1.6382 & 2.2661 & 1.3268 & 1.2216 \end{bmatrix} \times \begin{bmatrix} 0.81 & 0.26 & 0.06 & 0.24 & 0.09 & 0.81\\ 0.17 & 0.20 & 0.81 & 0.81 & 0.59 & 0.91\\ 0.06 & 0.96 & 0.57 & 0.30 & 0.83 & 0.66\\ 0.99 & 0.11 & 0.58 & 0.47 & 0.65 & 0.24\\ 0.03 & 0.54 & 0.36 & 0.89 & 0.46 & 0.42\\ 0.63 & 0.53 & 0.96 & 0.79 & 0.50 & 0.21 \end{bmatrix}\\ &= \begin{bmatrix} 4.5438 & 3.8288 & 5.0019 & 5.0544 & 4.9380 & 4.7180\\ 4.5279 & 3.8066 & 4.9610 & 5.0229 & 4.8970 & 4.6958\\ 4.5117 & 3.7934 & 4.9495 & 5.0133 & 4.8830 & 4.6883\\ 4.5179 & 3.7986 & 4.9539 & 5.0164 & 4.8890 & 4.6912 \end{bmatrix}\nonumber \end{align}$
上面例子的计算流程代码如下所示，有兴趣的可以自己尝试

# 随机产生一个大小为4 x 6的词向量矩阵X_hat
np.random.seed(5)
x_hat = np.random.rand(4, 6)
x_hat = np.round(x_hat, 2)  # 保留两位小数
print(f"输入词向量矩阵x_hat为：\n{x_hat}")
# 输入词向量矩阵x_hat为：
# [[0.22 0.87 0.21 0.92 0.49 0.61]
#  [0.77 0.52 0.3  0.19 0.08 0.74]
#  [0.44 0.16 0.88 0.27 0.41 0.3 ]
#  [0.63 0.58 0.6  0.27 0.28 0.25]]

# 随机产生一个大小为6 x 6的线性变换矩阵W_Q
W_Q = np.random.rand(6, 6)
W_Q = np.round(W_Q, 2)  # 保留两位小数
print(f"线性变换矩阵W_Q为：\n{W_Q}")
# 线性变换矩阵W_Q为：
# [[0.33 0.14 0.17 0.96 0.96 0.19]
#  [0.02 0.2  0.7  0.78 0.02 0.58]
#  [0.   0.52 0.64 0.99 0.26 0.8 ]
#  [0.87 0.92 0.   0.47 0.98 0.4 ]
#  [0.81 0.55 0.77 0.48 0.03 0.09]
#  [0.11 0.25 0.96 0.63 0.82 0.57]]

# 随机产生一个大小为6 x 6的线性变换矩阵W_K
W_K = np.random.rand(6, 6)
W_K = np.round(W_K, 2)  # 保留两位小数
print(f"线性变换矩阵W_K为：\n{W_K}")
# 线性变换矩阵W_K为：
# [[0.64 0.81 0.93 0.91 0.82 0.09]
#  [0.36 0.04 0.55 0.8  0.05 0.19]
#  [0.37 0.24 0.8  0.35 0.64 0.49]
#  [0.58 0.94 0.94 0.11 0.84 0.35]
#  [0.1  0.38 0.51 0.96 0.37 0.01]
#  [0.86 0.11 0.48 0.85 0.51 0.45]]

# 随机产生一个大小为6 x 6的线性变换矩阵W_V
W_V = np.random.rand(6, 6)
W_V = np.round(W_V, 2)  # 保留两位小数
print(f"线性变换矩阵W_V为：\n{W_V}")
# 线性变换矩阵W_V为：
# [[0.8  0.02 0.57 0.41 0.99 0.8 ]
#  [0.05 0.19 0.45 0.7  0.33 0.36]
#  [0.92 0.95 0.41 0.9  0.33 0.08]
#  [0.53 0.66 0.89 0.97 0.77 0.76]
#  [0.71 0.7  0.77 0.97 0.37 0.08]
#  [0.24 0.22 0.36 0.81 0.06 0.45]]

Q = x_hat @ W_Q
print(f"Q为：\n{Q}")
# Q为：
# [[1.3544 1.5824 1.7437 2.1496 1.6997 1.4742]
#  [0.576  0.7716 1.4589 2.0357 1.623  1.1929]
#  [0.7484 1.1001 1.3537 1.9311 1.1773 1.1963]
#  [0.7087 0.9811 1.3527 2.07   1.2504 1.2118]]

K = x_hat @ W_K
print(f"K为：\n{K}")
# K为：
# [[1.6389 1.3815 2.2586 2.0598 1.6235 0.8894]
#  [1.5456 1.0069 1.8167 1.9484 1.416  0.7154]
#  [1.1204 1.0166 1.8081 1.5147 1.4635 0.7348]
#  [1.2336 1.0652 1.9015 1.7583 1.3875 0.6707]]

V = x_hat @ W_V
print(f"V为：\n{V}")
# V为：
# [[1.3946 1.4536 2.0187 2.75   1.5005 1.5189]
#  [1.2531 0.7434 1.293  1.811  1.2532 1.311 ]
#  [1.6758 1.4064 1.3476 1.987  1.1564 0.853 ]
#  [1.4869 1.122  1.412  1.9403 1.3396 1.1009]]

Q_1, K_1, V_1 = Q[:, 0:2], K[:, 0:2], V[:, 0:2]
Q_2, K_2, V_2 = Q[:, 2:4], K[:, 2:4], V[:, 2:4]
Q_3, K_3, V_3 = Q[:, 4:6], K[:, 4:6], V[:, 4:6]


# 在应用 softmax 的时候，常见的问题是数值稳定性问题，也就是说，由于可能出现的指数和溢出误差，
# ∑e^(x) 可能会变得非常大。这个溢出误差可以通过用数组的每个值减去其最大值来解决。
def softmax(x):
    max = np.max(x, axis=1, keepdims=True)  # 返回每一行的最大值，并保持维度不变,例如4 x 5 --> 4 x 1,否则就输出一行四个数，不是二维了
    e_x = np.exp(x - max)  # 每一行的所有元素减去这一行的对应最大值
    sum = np.sum(e_x, axis=1, keepdims=True)
    out = e_x / sum
    return out


Q_KT_d_k_1 = Q_1 @ K_1.T / np.sqrt(2)
print(f"Q_KT_d_k_1为： \n{Q_KT_d_k_1}")
# Q_KT_d_k_1为：
# [[3.11537937 2.60687586 2.2105131  2.37330514]
#  [1.42126469 1.1788811  1.01099226 1.08361422]
#  [1.94195628 1.60118513 1.38371535 1.48142601]
#  [1.77970156 1.47307052 1.2667208  1.35716422]]
soft_Q_KT_1 = softmax(Q_KT_d_k_1)
print(f"Softmax result is \n{soft_Q_KT_1}")
# Softmax result is
# [[0.40288203 0.24229119 0.16300445 0.19182233]
#  [0.31628863 0.24820911 0.20984785 0.22565442]
#  [0.34312552 0.2440383  0.19634148 0.2164947 ]
#  [0.3344468  0.24612678 0.20023608 0.21919034]]

out_self_attention_1 = soft_Q_KT_1 @ V_1
print(f"Self attention output 1 is \n{out_self_attention_1}")
# Self attention output 1 is
# [[1.42385785 1.2102227 ]
#  [1.43931553 1.19259007]
#  [1.43526227 1.19922704]
#  [1.43631071 1.1966661 ]]

out_self_attention_2 = softmax(Q_2 @ K_2.T / np.sqrt(2)) @ V_2
print(f"Self attention output 2 is \n{out_self_attention_2}")
# Self attention output 2 is
# [[1.65989199 2.29334469]
#  [1.6423284  2.27141789]
#  [1.63397616 2.26112136]
#  [1.63815253 2.2660779 ]]
out_self_attention_3 = softmax(Q_3 @ K_3.T / np.sqrt(2)) @ V_3
print(f"Self attention output 3 is \n{out_self_attention_3}")
# Self attention output 3 is
# [[1.33149842 1.22979722]
#  [1.32918253 1.2256465 ]
#  [1.32621018 1.22056725]
#  [1.32678984 1.22156985]]
concat_123 = np.concatenate((out_self_attention_1, out_self_attention_2, out_self_attention_3), axis=1)
print(f"Concat attention output is \n{concat_123}")
# Concat attention output is
# [[1.42385785 1.2102227  1.65989199 2.29334469 1.33149842 1.22979722]
#  [1.43931553 1.19259007 1.6423284  2.27141789 1.32918253 1.2256465 ]
#  [1.43526227 1.19922704 1.63397616 2.26112136 1.32621018 1.22056725]
#  [1.43631071 1.1966661  1.63815253 2.2660779  1.32678984 1.22156985]]

W_O = W_V = np.random.rand(6, 6)
W_O = np.round(W_O, 2)  # 保留两位小数
print(f"线性变换矩阵W_O为：\n{W_O}")
# 线性变换矩阵W_O为：
# [[0.81 0.26 0.06 0.24 0.09 0.81]
#  [0.17 0.2  0.81 0.81 0.59 0.91]
#  [0.06 0.96 0.57 0.3  0.83 0.66]
#  [0.99 0.11 0.58 0.47 0.65 0.24]
#  [0.03 0.54 0.36 0.89 0.46 0.42]
#  [0.63 0.53 0.96 0.79 0.5  0.21]]

output = concat_123 @ W_O
print(f"output 为：\n{output}")
# output 为：
# [[4.54378468 3.82881348 5.00193498 5.05441927 4.93795088 4.71804571]
#  [4.52786208 3.8065825  4.9610328  5.0229318  4.89696796 4.69582201]
#  [4.51172342 3.7934082  4.94948667 5.01333192 4.88298697 4.68827983]
#  [4.51794388 3.79856753 4.9539017  5.01639962 4.88902644 4.69119859]]

理解了上面的流程后，就不难看懂大部分公开的多头注意力机制的代码了，比如下面一个典型的多头注意力机制代码，

class MultiheadAttention(nn.Module):
    def __init__(self, d_model, num_heads=8, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # W_Q, W_K, W_V
        self.q_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(p=dropout)

        # W_O
        self.out_linear = nn.Linear(d_model, d_model)

    def forward(self, query, key, value, mask=None):
        batch_size = key.size(0)

        # 对输入进行线性变换得到Q, K, V, 然后按头数拆分，最后调整成形状[batch_size, num_heads, seq_len, d_k]
        # 目的就是为了后续计算过程中，各个head之间的独立计算，具体可参考上面详细例子示意
        q = self.q_linear(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.k_linear(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.v_linear(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        scores = torch.matmul(q, k.transpose(-2, -1) / math.sqrt(self.d_k))

        if mask is not None:  # mask 是为了在decode过程中用
            mask = mask.unsqueeze(1)  # 增加一个head维度
            scores = scores.masked_fill(mask == 0, float('-inf'))

        scores = F.softmax(scores, dim=-1)  # -1代表最后一个维度，这里就是为了表示对每一行的元素进行操作

        # 这里实现的就是多个head输出后，然后concatenation
        attention_output = torch.matmul(scores, v).transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

        # 线性变换得到最后输出
        output = self.out_linear(attention_output)

        return output


# 测试代码
inputs = torch.tensor([
    [[1, 2, 3, 4, 5, 6],
     [2, 3, 4, 5, 6, 7],
     [3, 4, 5, 6, 7, 8],
     [3, 4, 5, 6, 7, 8],
     [3, 4, 5, 6, 7, 8],
     [4, 5, 6, 7, 8, 9]]
], dtype=torch.float32)

num_heads = 2
d_model = 6

multihead_self_attention = MultiheadAttention(d_model, num_heads)
output = multihead_self_attention(query=inputs, key=inputs, value=inputs)

print(output.shape)  # torch.Size([1, 6, 6])

2.4 残差连接（Residual Connection）与层归一化（Layer Normalization）

残差连接（Residual Connection）和层归一化（Layer Normalization）是深度神经网络中常用的技术，特别是在Transformer等架构中。它们的作用是帮助深层网络更好地训练和优化，缓解了梯度消失和梯度爆炸等问题。

残差连接（Residual Connection）：
- 残差连接是指将输入直接与层的输出相加，以形成残差块（Residual Block）。在残差块中，输入信号会绕过一部分网络层，并与网络的输出相加。这使得神经网络能够更轻松地学习恒等映射，从而更有效地训练。
- 数学上，如果我们将一个神经网络的输入表示为 $x$ ，一个包含了若干网络层的函数表示为 $F (x)$ ，那么残差连接的输出为 $x + F (x)$ 。这个输出被传递到后续的网络层中，从而形成了网络的前向传播过程。
- Transformer中的残差连接（Residual Connection）可表示为 $Z=\hat{X}+\hat{X'}$ ，其中 $\hat{X}$ 为多头注意力的输入， $\hat{X'}$ 为多头注意力的输出。其过程如下图所示
层归一化（Layer Normalization）：
- 层归一化是一种归一化技术，用于规范化神经网络中的每个层的输出。它计算每个层的均值和方差，并将每个层的输出进行归一化。这有助于加速训练过程，并提高网络的泛化能力。
- 数学上，给定一个神经网络层的输入 $x$ ，层归一化将对输入进行归一化处理，计算出新的输出 $y$ ： $y=\gamma\cdot\frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}+\beta$ , 其中， $\mu$ 和 $\sigma$ 分别是输入 $x$ 的均值和标准差， $\gamma$ 和 $\beta$ 是可学习的缩放因子和偏移因子， $\epsilon$ 是一个很小的数，用于数值稳定性。
- 层归一化在每个样本的每个特征维度上进行归一化，而不是在每个样本的整个批次上进行归一化，这使得它更适用于序列数据和小批次训练。例如，我们可以对上图中输出 $Z$ 进行层归一化，就是对 $Z$ 的每一行进行归一化操作，如下图所示

在这里插入图片描述
那么就可以根据公式进行如下代码编写

```
class LayerNorm(nn.Module):
    def __init__(self, d_model, eps=1e-6):
        super().__init__()

        self.gamma = nn.Parameter(torch.ones(d_model))
        self.beta = nn.Parameter(torch.zeros(d_model))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)  # -1代表最后一个维度，对每一行元素进行操作
        var = x.var(-1, keepdim=True) # -1代表最后一个维度，对每一行元素进行操作

        return self.gamma * (x - mean) / torch.sqrt(var + self.eps) + self.beta


# 测试代码
input_tensor = torch.tensor([[1.0, 2.0, 3.0],
                             [4.0, 5.0, 6.0]])
layer_norm = LayerNorm(d_model=3)
normed_reuslt = layer_norm(input_tensor)
print(normed_reuslt)
# tensor([[-1.0000,  0.0000,  1.0000],
#         [-1.0000,  0.0000,  1.0000]], grad_fn=<AddBackward0>)
```
- 我们定义了一个名为LayerNorm的PyTorch模块，它表示Transformer模型中的层归一化层。
- 在__init__方法中，我们定义了两个可学习的参数gamma和beta，它们分别用于缩放和平移层归一化的输出。这些参数的形状与输入张量的特征数量（即最后一个维度的大小）相同。
- 在forward方法中，我们首先计算输入张量沿着最后一个维度的均值和标准差。然后，我们使用公式 $\mathrm{LayerNorm(x)}=\gamma\cdot\frac{x-\mu}{\sqrt{\sigma^2+\epsilon}}+\beta$ , 其中， $\mu$ 和 $\sigma$ 分别是输入 $x$ 的均值和标准差， $\gamma$ 和 $\beta$ 是可学习的缩放因子和偏移因子， $\epsilon$ 是一个很小的数，用于数值稳定性。
以一个简单示例来详细说明层归一化的计算流程，假设我们有一个简单的输入张量，形状为(2, 3)，如下所示：
- ```
input_tensor = [[1.0, 2.0, 3.0],
                [4.0, 5.0, 6.0]]
```
  我们将使用层归一化对这个输入张量进行归一化处理。
```
import torch
import torch.nn.functional as F

# 输入张量
input_tensor = torch.tensor([[1.0, 2.0, 3.0],
                             [4.0, 5.0, 6.0]])

# 计算均值和标准差
mean = input_tensor.mean(dim=-1, keepdim=True)
std = input_tensor.std(dim=-1, keepdim=True)

# 定义gamma和beta参数
gamma = torch.tensor([1.0, 1.0, 1.0])
beta = torch.tensor([0.0, 0.0, 0.0])

# 层归一化
normalized_tensor = gamma * (input_tensor - mean) / (std + 1e-6) + beta

print(f"输入张量: \n{input_tensor}")
# 输入张量: 
# tensor([[1., 2., 3.],
#         [4., 5., 6.]])

print(f"均值: \n{mean}")
# 均值: 
# tensor([[2.],
#         [5.]])

print(f"标准差: \n{std}")
# 标准差: 
# tensor([[1.],
#         [1.]])

print(f"归一化后的张量: \n{normalized_tensor}")
# 归一化后的张量: 
# tensor([[-1.0000,  0.0000,  1.0000],
#         [-1.0000,  0.0000,  1.0000]])
```

2.5 前馈神经网络层（Feedforward Neural Network Layer）

在Transformer模型中，前馈神经网络层（Feedforward Neural Network Layer）是每个Transformer模块中的一个关键组件之一。它负责对每个位置的隐藏表示进行非线性变换，从而帮助模型学习适应不同任务的特征表示。其结构如下图所示
在这里插入图片描述

具体来说，Transformer中的前馈神经网络层通常由两个全连接层组成，通常情况下，这两个全连接层之间还会添加一个激活函数，如ReLU。Transformer中的前馈神经网络层通常具有以下特点：

全连接层：由两个全连接层组成，其中每个全连接层的权重矩阵是可学习的参数。
激活函数：通常在全连接层之间使用激活函数，例如ReLU。这有助于引入非线性变换，使得模型能够学习更复杂的函数关系。
维度变换：前馈神经网络层通常会改变隐藏表示的维度。例如，输入的隐藏表示可能是一个维度为512的向量，而输出的隐藏表示可能是一个维度为2048的向量。
正则化：通常会在前馈神经网络层中使用正则化技术，如Dropout，以防止过拟合。

下面是一个前馈神经网络层的示例代码及用法

class FeedForward(nn.Module):
    def __init__(self, d_model, hidden_size=2048, dropout=0.1):
        super().__init__()

        self.linear1 = nn.Linear(d_model, hidden_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=dropout)
        self.linear2 = nn.Linear(hidden_size, d_model)

    def forward(self, x):
        x = self.linear1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.linear2(x)

        return x


# 测试代码
d_model = 512
x = torch.randn(64, 10, 512)
ff_layer = FeedForward(d_model)
output = ff_layer(x)
print(output.shape)  # torch.Size([64, 10, 512])

代码解释

FeedForward类：

我们定义了一个名为FeedForward的PyTorch模块，表示一个前馈神经网络层。
在__init__方法中，我们定义了两个线性层linear1和linear2，分别将输入映射到隐藏层和将隐藏层映射到输出层。此外，我们还定义了一个Dropout层dropout，用于随机丢弃部分神经元以防止过拟合。
在forward方法中，我们首先通过第一个线性层进行线性变换，并应用ReLU激活函数。然后，我们对激活后的结果应用了Dropout操作，最后通过第二个线性层获得输出。

feedforward之后的再次的残差连接与层归一化，如下图所示。这一部分，基于之前的基础就比较简单，层归一化就是对输出 $O$ 的每一行进行处理。

在这里插入图片描述

2.6 一个完整的EncodeLayer层

基于之前的介绍，我们就可以完整的写出一个EncodeLayer层，其结构如下图灰色区域所示

在这里插入图片描述

详细代码如下所示

class EncodeLayer(nn.Module):
    def __init__(self, d_model, num_heads=8, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads

        self.layer_norm = LayerNorm(d_model)
        self.multi_attention = MultiheadAttention(d_model, num_heads)
        self.ff_layer = FeedForward(d_model)

        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x):
        _x = x  # 先存储输入，用于后面和attention输出进行残差连接
        x = self.layer_norm(x)  # 输入到attention之前新进行层归一化
        x = self.dropout(self.multi_attention(query=x, key=x, value=x))  # attention 输出
        __x = x  # 存储attention的输出， 用于后面和feedforward的输出残差连接
        x = self.layer_norm(_x + x)  # 对 add之后的结果 norm
        x = self.dropout(self.ff_layer(x))  # feedforward 输出
        x = __x + x  # 残差连接

        return x
        
# 测试代码
# d_model = 512
# x = torch.randn(64, 10, 512)
# encode_layer = EncodeLayer(d_model)
# output = encode_layer(x)
# print(output.shape)  # torch.Size([64, 10, 512])

2.7 一个完整的编码器（包含6个Encodelayer层）

现在，我们就可以完整的写出一个Encode层，其结构如下图所示

在这里插入图片描述

详细代码如下所示

class Encode(nn.Module):
    def __init__(self, d_model, vocab_size=2000,  num_encode_layer=6, num_heads=8, dropout=0.1):
        super().__init__()

        self.vocab_size = vocab_size
        self.d_model = d_model
        self.num_encode_layer = num_encode_layer
        self.num_heads = num_heads
        self.dropout = dropout

        self.embed = nn.Embedding(vocab_size, d_model)  # 定义词典大小

        self.position_encode = PositionalEncoder(d_model)
        self.encode_layer = EncodeLayer(d_model)

        # 六个EncodeLayer层
        self.encode_layers = nn.ModuleList([copy.deepcopy(self.encode_layer) for i in range(num_encode_layer)])

        self.layer_norm = LayerNorm(d_model)

    def forward(self, src):
        x = self.embed(src)
        x = self.position_encode(x)

        # 六个EncodeLayer层依次输出
        for i in range(self.num_encode_layer):
            x = self.encode_layers[i](x)

        return self.layer_norm(x)


# 测试代码
d_model = 512
x = torch.LongTensor([[1, 2, 4]])  # 输入是形状为： [batch_size, seq_length]， 这里batch_zie = 1, seq_len = 3
encode = Encode(d_model=d_model)
output = encode(x)
print(output.shape)  # torch.Size([1, 3, 512])

3. 解码器（Decoder）

解码器的结构图如下所示

在这里插入图片描述

3.1 目标词嵌入层（Target Word Embedding Layer）

在Transformer模型中，目标词嵌入层（Target Word Embedding Layer）负责将目标词（或者说输出序列中的词）映射到低维度的词嵌入空间中。这个词嵌入空间通常是一个固定大小的向量空间，其中每个单词都被表示为一个密集的向量。

目标词嵌入层的作用主要有两个：

将单词映射到向量空间中：目标词嵌入层将输出序列中的每个词映射到一个对应的词嵌入向量中。这个词嵌入向量捕捉了词的语义信息和上下文信息，使得模型能够更好地理解和处理输入序列。
学习可训练的词嵌入：在训练过程中，目标词嵌入层的词嵌入向量是可训练的参数，模型可以通过反向传播算法来调整这些参数，从而使得词嵌入向量能够更好地适应任务的特征。

在Transformer模型中，目标词嵌入层通常是一个独立的可训练的词嵌入矩阵，其维度为（词汇表大小，词嵌入维度）。模型通过查找这个词嵌入矩阵来获取每个目标词的词嵌入向量，并将这些词嵌入向量作为输入传递给后续的解码器层进行处理。

一个简单的示例

import torch
import torch.nn as nn

class TargetWordEmbedding(nn.Module):
    def __init__(self, vocab_size, embed_size):
        super(TargetWordEmbedding, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)

    def forward(self, target_tokens):
        # target_tokens的形状为（批量大小，目标序列长度）
        embedded_target = self.embedding(target_tokens)
        # embedded_target的形状为（批量大小，目标序列长度，嵌入维度）
        return embedded_target

# 示例用法
vocab_size = 10000  # 词汇表大小
embed_size = 512  # 嵌入维度

# 创建目标词嵌入层实例
target_embedding = TargetWordEmbedding(vocab_size, embed_size)

# 生成一个示例目标序列张量
target_tokens = torch.tensor([[1, 2, 3], [4, 5, 6]])  # （批量大小，目标序列长度）

# 前向传播
embedded_target = target_embedding(target_tokens)
print(embedded_target.shape)  # 输出张量的形状 torch.Size([2, 3, 512])

TargetWordEmbedding类：
- 我们定义了一个名为TargetWordEmbedding的PyTorch模块，表示目标词嵌入层。在初始化方法__init__中，我们使用nn.Embedding定义了一个嵌入层，其输入大小为词汇表大小（vocab_size），输出大小为嵌入维度（embed_size）。
- 在forward方法中，我们将目标词索引序列target_tokens作为输入传递给嵌入层，得到每个目标词的词嵌入向量。
示例用法：
- 我们定义了词汇表大小vocab_size和嵌入维度embed_size，并使用这些参数创建了一个TargetWordEmbedding实例。
- 我们生成了一个示例的目标序列张量target_tokens，其形状为(2, 3)，表示一个批量大小为2的目标序列，每个序列长度为3。
前向传播：
- 我们通过调用target_embedding实例并传递目标序列张量target_tokens来进行前向传播，得到嵌入后的目标序列张量embedded_target。
- 我们打印了嵌入后的目标序列张量的形状，以确认代码的正确性。

3.2 位置编码（Positional Encoding）

在Transformer模型中，解码器位置编码（Decoder Positional Encoding）与编码器位置编码类似，它的作用是为解码器提供位置信息，以便模型能够理解目标序列中每个位置的语义和顺序关系。

3.3 掩码多头自注意力层（Masked Multi-Head Self-Attention Layer）

如下图红色虚线框所示

在这里插入图片描述

在Transformer模型中，解码器的掩码多头自注意力层（Masked Multi-Head Self-Attention Layer）是解码器中的一个关键组件，用于处理目标序列中不同位置之间的关联性，并帮助解码器生成目标序列。

解码器的掩码多头自注意力层（Masked Multi-Head Self-Attention Layer）与编码器的多头自注意力层类似，但在注意力计算过程中需要考虑到遮挡机制（Masking），以确保解码器在生成目标序列时只依赖于之前已生成的部分序列。

具体来说，解码器的多头自注意力层通常包括以下几个步骤：

投影：首先，通过线性变换将目标序列投影到多个注意力头的维度空间。这个线性变换可以使用矩阵乘法来实现。
分头：将投影后的目标序列分成多个头，以便并行计算注意力。
计算注意力分数：对每个头进行注意力计算，得到注意力分数。与编码器的多头自注意力层相比，解码器的注意力分数需要进行遮挡操作，以确保解码器只能关注已生成的部分序列。 这个操作，简单的理解就是，我们先产生如下mask, 黑色值为0，其他部分为1。

在这里插入图片描述

还记得在多头注意力部分的mask代码吗，如下所示

        if mask is not None:  # mask 是为了在decode过程中用
            mask = mask.unsqueeze(1)  # 增加一个head维度
            scores = scores.masked_fill(mask == 0, float('-inf'))

这段代码的意思就是，mask为0的部分，就会给注意力输出scores对应的值设为负无穷，那么这部分负无穷的值，softmax之后的权重就会接近0。

加权求和：根据注意力分数对值进行加权求和，得到每个头的注意力输出。
多头连接：将每个头的注意力输出连接起来，或者通过线性变换并加权求和得到最终的多头自注意力表示。

解码器的多头自注意力层有助于解码器在生成目标序列时对不同位置的信息进行整合和关联，从而提高了解码器的性能和生成质量。

3.4 编码器-解码器注意力层（Encoder-Decoder Attention Layer）

编码器-解码器注意力层的输入包括解码器当前位置的表示和编码器所有位置的表示，如下图红色虚线框所示。

在这里插入图片描述

其计算过程与自注意力层（self-attention layer）类似，但有一点不同：**在计算注意力分数时，解码器位置的查询来自于解码器当前位置的表示，而键和值来自于编码器所有位置的表示。**这样，解码器当前位置可以根据编码器的所有信息来进行注意力计算，以便更好地生成输出。

下面是编码器-解码器注意力层的主要步骤：

计算查询、键和值： 对于解码器当前位置的表示，通过线性变换得到查询（query），对于编码器所有位置的表示，也通过线性变换得到键（key）和值（value）。
计算注意力分数： 将解码器位置的查询与编码器所有位置的键进行点积，然后对每个注意力分数进行缩放以避免梯度消失或爆炸。
计算注意力权重： 对注意力分数进行 softmax 归一化，得到注意力权重，这表示解码器当前位置与编码器各个位置的关注程度。
加权求和： 使用注意力权重将编码器所有位置的值加权求和，得到解码器当前位置的上下文表示。
输出计算： 将上下文表示与解码器当前位置的表示拼接或相加，然后通过线性变换得到最终的输出表示。

3.5 一个完整的DecodeLayer层

class DecodeLayer(nn.Module):
    def __init__(self, d_model, dropout=0.1):
        super().__init__()
        self.layer_norm = LayerNorm(d_model)

        self.dropout = nn.Dropout(p=dropout)

        self.multi_attention = MultiheadAttention(d_model)

        self.ff_layer = FeedForward(d_model)

    def forward(self, x, encode_output, trg_mask):
        _x = x                  # 用于掩码多头注意力机制的残差连接
        x = self.layer_norm(x)
        x = _x + self.dropout(self.multi_attention(x, x, x, trg_mask))   # 残差连接
        _x = x                  # 用于编码-解码多头注意力机制的残差连接
        x = self.layer_norm(x)
        x = _x + self.dropout(self.multi_attention(x, encode_output, encode_output)) # 残差连接
        _x = x                  # 用于feedfoward输出的残差连接
        x = self.layer_norm(x)
        x = _x + self.dropout(self.ff_layer(x))

        return x


# 测试代码
# 产生掩码
def create_mask(size):
    # size: the sequence length of input
    np_mask = np.triu(np.ones((1, size, size)), k=1).astype('uint8')
    trg_mask = torch.from_numpy(np_mask == 0)
    return trg_mask


trg_mask = create_mask(size=5)  # size = sequence length

d_model = 512
input_encode = torch.LongTensor([[2, 7, 3, 4, 8]])   # encode的输入为整数索引张量
encode = Encode(d_model)
encode_output = encode(input_encode)

input_decode_layer = torch.randn(1, 5, 512)
decode_layer = DecodeLayer(d_model)
output = decode_layer(input_decode_layer, encode_output=encode_output, trg_mask=trg_mask)
print(output.shape)  # torch.Size([1, 5, 512])

3.6 一个完整的解码器（包含6个DecodeLayer层）

class Decode(nn.Module):
    def __init__(self, d_model, vocab_size=2000, num_decode_layer=6, num_heads=8, dropout=0.1):
        super().__init__()
        self.num_decode_layer = num_decode_layer
        self.embed = nn.Embedding(vocab_size, d_model)

        self.position_encode = PositionalEncoder(d_model)

        self.decode_layer = DecodeLayer(d_model)

        self.decode_layers = nn.ModuleList([copy.deepcopy(self.decode_layer) for i in range(num_decode_layer)])

        self.layer_norm = LayerNorm(d_model)

    def forward(self, trg, encode_output, trg_mask):
        x = self.embed(trg)
        x = self.position_encode(x)
        for i in range(self.num_decode_layer):
            x = self.decode_layers[i](x, encode_output, trg_mask)

        return self.layer_norm(x)

# 测试代码
# 产生掩码
def create_mask(size):
    # size: the sequence length of input
    np_mask = np.triu(np.ones((1, size, size)), k=1).astype('uint8')
    trg_mask = torch.from_numpy(np_mask == 0)
    return trg_mask

trg_mask = create_mask(size=50)  # size = sequence length

d_model = 512
input_encode = torch.randint(1, 5, (64, 50))   # 输入是形状为： [batch_size, seq_length]
input_decode = torch.randint(1, 5, (64, 50))

encode = Encode(d_model=d_model)
encode_output = encode(input_encode)

decode = Decode(d_model=d_model)
output = decode(input_decode, encode_output=encode_output, trg_mask=trg_mask)

print(output.shape)  # torch.Size([64, 50, 512])

4. Transformer完整代码

import torch
import torch.nn as nn
import torch.nn.functional as F
import math
import copy
import numpy as np

class PositionalEncoder(nn.Module):
    def __init__(self, d_model, max_seq_len=5000, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.max_seq_len = max_seq_len
        self.dropout = nn.Dropout(p=dropout)

        # 初始化位置编码矩阵
        pe = torch.zeros(max_seq_len, d_model)

        # 计算位置编码
        for pos in range(max_seq_len):
            for i in range(d_model // 2):
                pe[pos, 2 * i] = math.sin(pos / 10000 ** ((2 * i) / d_model))
                pe[pos, 2 * i + 1] = math.cos(pos / 10000 ** ((2 * i) / d_model))

        # 增加 batch_size 维度
        pe = pe.unsqueeze(0)

        # 将pe加入模型，但是不进行更新
        self.register_buffer('pe', pe)

    def forward(self, x):
        seq_len = x.size(1)

        return self.dropout(x + self.pe[:, :seq_len, :])
        
class MultiheadAttention(nn.Module):
    def __init__(self, d_model, num_heads=8, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # W_Q, W_K, W_V
        self.q_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)

        self.dropout = nn.Dropout(p=dropout)

        # W_O
        self.out_linear = nn.Linear(d_model, d_model)

    def forward(self, query, key, value, mask=None):
        batch_size = key.size(0)

        # 对输入进行线性变换得到Q, K, V, 然后按头数拆分，最后调整成形状[batch_size, num_heads, seq_len, d_k]
        # 目的就是为了后续计算过程中，各个head之间的独立计算，具体可参考上面详细例子示意
        q = self.q_linear(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        k = self.k_linear(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        v = self.v_linear(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        scores = torch.matmul(q, k.transpose(-2, -1) / math.sqrt(self.d_k))

        if mask is not None:  # mask 是为了在decode过程中用
            mask = mask.unsqueeze(1)  # 增加一个head维度
            scores = scores.masked_fill(mask == 0, float('-inf'))

        scores = F.softmax(scores, dim=-1)  # -1代表最后一个维度，这里就是为了表示对每一行的元素进行操作

        # 这里实现的就是多个head输出后，然后concatenation
        attention_output = torch.matmul(scores, v).transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

        # 线性变换得到最后输出
        output = self.out_linear(attention_output)

        return output        

class LayerNorm(nn.Module):
    def __init__(self, d_model, eps=1e-6):
        super().__init__()

        self.gamma = nn.Parameter(torch.ones(d_model))
        self.beta = nn.Parameter(torch.zeros(d_model))
        self.eps = eps

    def forward(self, x):
        mean = x.mean(-1, keepdim=True)  # -1代表最后一个维度，对每一行元素进行操作
        var = x.var(-1, keepdim=True) # -1代表最后一个维度，对每一行元素进行操作

        return self.gamma * (x - mean) / torch.sqrt(var + self.eps) + self.beta
        
class FeedForward(nn.Module):
    def __init__(self, d_model, hidden_size=2048, dropout=0.1):
        super().__init__()

        self.linear1 = nn.Linear(d_model, hidden_size)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(p=dropout)
        self.linear2 = nn.Linear(hidden_size, d_model)

    def forward(self, x):
        x = self.linear1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.linear2(x)

        return x


class EncodeLayer(nn.Module):
    def __init__(self, d_model, num_heads=8, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.num_heads = num_heads

        self.layer_norm = LayerNorm(d_model)
        self.multi_attention = MultiheadAttention(d_model, num_heads)
        self.ff_layer = FeedForward(d_model)

        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x):
        _x = x  # 先存储输入，用于后面和attention输出进行残差连接
        x = self.layer_norm(x)  # 输入到attention之前新进行层归一化
        x = self.dropout(self.multi_attention(query=x, key=x, value=x))  # attention 输出
        __x = x  # 存储attention的输出， 用于后面和feedforward的输出残差连接
        x = self.layer_norm(_x + x)  # 对 add之后的结果 norm
        x = self.dropout(self.ff_layer(x))  # feedforward 输出
        x = __x + x  # 残差连接

        return x


class Encode(nn.Module):
    def __init__(self, d_model, vocab_size=2000,  num_encode_layer=6, num_heads=8, dropout=0.1):
        super().__init__()

        self.vocab_size = vocab_size
        self.d_model = d_model
        self.num_encode_layer = num_encode_layer
        self.num_heads = num_heads
        self.dropout = dropout

        self.embed = nn.Embedding(vocab_size, d_model)  # 定义词典大小

        self.position_encode = PositionalEncoder(d_model)
        self.encode_layer = EncodeLayer(d_model)

        # 六个EncodeLayer层
        self.encode_layers = nn.ModuleList([copy.deepcopy(self.encode_layer) for i in range(num_encode_layer)])

        self.layer_norm = LayerNorm(d_model)

    def forward(self, src):
        x = self.embed(src)
        x = self.position_encode(x)

        # 六个EncodeLayer层依次输出
        for i in range(self.num_encode_layer):
            x = self.encode_layers[i](x)

        return self.layer_norm(x)


class DecodeLayer(nn.Module):
    def __init__(self, d_model, dropout=0.1):
        super().__init__()
        self.layer_norm = LayerNorm(d_model)

        self.dropout = nn.Dropout(p=dropout)

        self.multi_attention = MultiheadAttention(d_model)

        self.ff_layer = FeedForward(d_model)

    def forward(self, x, encode_output, trg_mask):
        _x = x                  # 用于掩码多头注意力机制的残差连接
        x = self.layer_norm(x)
        x = _x + self.dropout(self.multi_attention(x, x, x, trg_mask))   # 残差连接
        _x = x                  # 用于编码-解码多头注意力机制的残差连接
        x = self.layer_norm(x)
        x = _x + self.dropout(self.multi_attention(x, encode_output, encode_output)) # 残差连接
        _x = x                  # 用于feedfoward输出的残差连接
        x = self.layer_norm(x)
        x = _x + self.dropout(self.ff_layer(x))

        return x
        
class Decode(nn.Module):
    def __init__(self, d_model, vocab_size=2000, num_decode_layer=6, num_heads=8, dropout=0.1):
        super().__init__()
        self.num_decode_layer = num_decode_layer
        self.embed = nn.Embedding(vocab_size, d_model)

        self.position_encode = PositionalEncoder(d_model)

        self.decode_layer = DecodeLayer(d_model)

        self.decode_layers = nn.ModuleList([copy.deepcopy(self.decode_layer) for i in range(num_decode_layer)])

        self.layer_norm = LayerNorm(d_model)

    def forward(self, trg, encode_output, trg_mask):
        x = self.embed(trg)
        x = self.position_encode(x)
        for i in range(self.num_decode_layer):
            x = self.decode_layers[i](x, encode_output, trg_mask)

        return self.layer_norm(x)
        
def create_mask(size):
    # size: the sequence length of input
    np_mask = np.triu(np.ones((1, size, size)), k=1).astype('uint8')
    trg_mask = torch.from_numpy(np_mask == 0)
    return trg_mask        

if __name__ == "__main__":
    trg_mask = create_mask(size=50)  # size = sequence length

    d_model = 512
    input_encode = torch.randint(1, 5, (64, 50))  # 输入是形状为： [batch_size, seq_length]
    input_decode = torch.randint(1, 5, (64, 50))

    encode = Encode(d_model=d_model)
    encode_output = encode(input_encode)

    decode = Decode(d_model=d_model)
    output = decode(input_decode, encode_output=encode_output, trg_mask=trg_mask)

    print(output.shape)  # torch.Size([64, 50, 512])