程序员学长 | 快速学会一个算法，Transformer（下）

本文来源公众号“程序员学长”，仅用于学术分享，侵权删，干货满满。

原文链接：快速学习一个算法，Transformer（二）

今天我们来继续分享 Transformer 模型的第二部分，解码器部分。

建议大家先看完第一部分。程序员学长 | 快速学会一个算法，Transformer（上）-CSDN博客

解码器 Decoder

上篇文章我们已经介绍了编码器中的大部分概念，也基本知道了编码器的原理。现在让我们来看下，编码器和解码器是如何协同工作的。

编码器一般有多层，第一个编码器的输入是一个序列文本，最后一个编码器输出是一组序列向量，这组序列向量会作为解码器的 K、V 输入，其中 K=V=解码器输出的序列向量表示。这些注意力向量将会输入到每个解码器的 Encoder-Decoder Attention 层，这有助于解码器把注意力集中到输入序列的合适位置，如下图所示。

解码（decoding ）阶段的每一个时间步都输出一个翻译后的单词（这里的例子是英语翻译），解码器当前时间步的输出又重新作为下一个时间步解码器的输入。然后重复这个过程，直到输出一个结束符。如下图所示。

下面，我们来关注一下解码器中包含的核心组件。

Decoder 与 Encoder 一样会先跟 Positional Encoding 相加再进入 layer，不同的是 Decoder 有三个子层 Masked Multi-head Attention、Multi-head Attention、Feed Forward。此外，中间层 Multi-head Attention 的输入 q 来自于本身前一层的输出，而 k, v 则是来自于 Encoder 的输出。

下面我们来重点介绍一下 Masked Multi-head Attention 和 Multi-head Attention。

Masked Multi-head Attention

在标准的多头注意力机制中，每个位置的查询（Query）会与所有位置的键（Key）进行点积计算，得到注意力分数，然后与值（Value）加权求和，生成最终的输出。

然而，在解码器中，生成序列时不能访问未来的信息。因此需要使用掩码（Mask）机制来屏蔽掉未来位置的信息。

例如在 “我看到 ______ 追着一只老鼠” 中，我们会直观地填充猫，因为这是最有可能的。因此，在编码单词时，它需要知道整个句子中的所有内容。这就是为什么在自注意力层中，查询是针对所有单词执行的。但是在解码时，当试图预测句子中的下一个单词时，从逻辑上讲，它不应该知道我们试图预测的单词之后有哪些单词。因此需要使用掩码（Mask）机制来屏蔽掉未来位置的信息。

掩码机制通过引入一个上三角矩阵来屏蔽未来的位置。具体来说，对于每个查询位置 t，只允许它与 t 及其之前的位置进行注意力计算，而屏蔽掉 t 之后的位置。这可以通过在计算注意力分数时将未来位置的分数设置为负无穷来实现，从而在 softmax 函数中得到接近于零的权重。

Multi-head Attention

在解码器中的 Multi-head Attention 也叫做 Encoder-Decoder Attention，它的 Query 来自解码器的 self-attention，而 Key、Value 则是编码器的输出。

下面，我们来看一下解码器的 python 代码实现。

# code implementation of DECODER
class DecoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(DecoderLayer, self).__init__()
        self.masked_self_attention = MultiHeadAttention(d_model, num_heads)
        self.enc_dec_attention = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForward(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, encoder_output, src_mask, tgt_mask):
        
        # Masked self-attention layer
        self_attention_output= self.masked_self_attention(x, x, x, tgt_mask)
        self_attention_output = self.dropout(self_attention_output)
        x = x + self_attention_output
        x = self.norm1(x)
        
        # Encoder-decoder attention layer
        enc_dec_attention_output= self.enc_dec_attention(x, encoder_output, 
        encoder_output, src_mask)
        enc_dec_attention_output = self.dropout(enc_dec_attention_output)
        x = x + enc_dec_attention_output
        x = self.norm2(x)
        
        # Feed-forward layer
        feed_forward_output = self.feed_forward(x)
        feed_forward_output = self.dropout(feed_forward_output)
        x = x + feed_forward_output
        x = self.norm3(x)
        
        return x# Define the DecoderLayer parameters
d_model = 512  # Dimensionality of the model
num_heads = 8  # Number of attention heads
d_ff = 2048    # Dimensionality of the feed-forward network
dropout = 0.1  # Dropout probability
batch_size = 1 # Batch Size
max_len = 100  # Max length of Sequence

# Define the DecoderLayer instance
decoder_layer = DecoderLayer(d_model, num_heads, d_ff, dropout)


src_mask = torch.rand(batch_size, max_len, max_len) > 0.5
tgt_mask = torch.tril(torch.ones(max_len, max_len)).unsqueeze(0) == 0

# Pass the input tensors through the DecoderLayer
output = decoder_layer(input_sequence, encoder_output, src_mask, tgt_mask)

# Output shape
print("Output shape:", output.shape)