【自然语言处理】多头注意力Multi-Head Attention机制

多头注意力（Multi-Head Attention）机制是Transformer模型中的一个关键组件，广泛用于自然语言处理任务（如机器翻译、文本生成等）以及图像处理任务。它的核心思想是通过多个不同的注意力头来捕获输入的不同特征，从而提高模型的表现力。以下是详细的解释：

一、多头注意力机制（Multi-Head Attention）

多头注意力机制是对单个注意力机制（详见【模型】Self-Attention）的扩展，它允许模型从多个角度“看”待输入数据。

具体来说，多头注意力机制通过以下步骤进行：

线性映射：首先，对查询 Q、键 K 和值 V 通过不同的线性变换（矩阵乘法），将它们分别投影到 h 个不同的子空间（即不同的头）。假设有 h 个注意力头，每个头的维度是 dk。

对于每个头 i，分别计算：
在这里插入图片描述其中 WiQ、WiK、WiV 是头 i 的投影矩阵；
查询 Q、键 K 和值 V从输入中通过线性变换生成，X 是输入序列，WQ、WK和WV分别是Q、K、V 的权重矩阵。

并行计算多个注意力：对于每个头，分别使用缩放点积注意力机制来计算注意力输出。
拼接注意力头的输出：将所有 h 个注意力头的输出拼接起来，形成一个大向量。
线性变换：将拼接后的结果通过另一个线性变换 WO 进行投影，得到最终的多头注意力输出。

投影矩阵 WiQ, WiK, WiV 以及用于拼接头输出后的权重矩阵 WO，最初都是随机初始化的，然后通过训练逐渐学习得到。

二、多头注意力的优点

捕获不同的特征：每个注意力头都可以关注输入的不同部分，从而捕获更多元的信息，提升模型的表示能力。
并行计算：多个注意力头可以并行计算，因此在计算效率上比单个注意力机制更高效。
扩展表示能力：通过多个头，模型能够学习到更加复杂的关系，适合处理更复杂的任务。

三、在 Transformer 中的应用

在 Transformer 模型中，多头注意力机制被广泛应用在**编码器（Encoder）和解码器（Decoder）**中。具体来说：

自注意力（Self-Attention）：输入序列中的每个元素都可以与序列中的其他元素建立联系。多头自注意力允许每个位置“查看”其他位置的信息。
编码器-解码器注意力（交叉注意力）：在解码器部分，模型通过多头注意力机制来“关注”编码器的输出，从而生成目标序列。

使用pytorch实现Multi-Head Attention 在机器翻译中的简单示例：

mport torch
import torch.nn as nn
import torch.optim as optim
import numpy as np

# 英语和法语的词汇表
english_vocab = {"<pad>": 0, "i": 1, "am": 2, "a": 3, "student": 4, "he": 5, "is": 6, "teacher": 7,
                 "she": 8, "loves": 9, "apples": 10, "we": 11, "are": 12, "friends": 13}
french_vocab = {"<pad>": 0, "je": 1, "suis": 2, "un": 3, "étudiant": 4, "il": 5, "est": 6,
                "professeur": 7, "elle": 8, "aime": 9, "les": 10, "pommes": 11, "nous": 12, "sommes": 13, "amis": 14}

# 翻转字典以便通过索引查找单词
english_idx2word = {i: w for w, i in english_vocab.items()}
french_idx2word = {i: w for w, i in french_vocab.items()}

# 数据对
pairs = [
    ["i am a student", "je suis un étudiant"],
    ["he is a teacher", "il est un professeur"],
    ["she loves apples", "elle aime les pommes"],
    ["we are friends", "nous sommes amis"]
]

# Tokenization: 将句子转为词汇索引
def tokenize_sentence(sentence, vocab):
    return [vocab[word] for word in sentence.split()]

# 将句子对转为索引表示
tokenized_pairs = [(tokenize_sentence(p[0], english_vocab), tokenize_sentence(p[1], french_vocab)) for p in pairs]

# Padding函数
def pad_sequence(seq, max_len):
    return seq + [0] * (max_len - len(seq))

# 获取最大序列长度
max_len_src = max([len(pair[0]) for pair in tokenized_pairs])
max_len_tgt = max([len(pair[1]) for pair in tokenized_pairs])

# Padding后的输入和目标序列
src_sentences = torch.tensor([pad_sequence(pair[0], max_len_src) for pair in tokenized_pairs])
tgt_sentences = torch.tensor([pad_sequence(pair[1], max_len_tgt) for pair in tokenized_pairs])

# 模型超参数
d_model = 16   # 词嵌入维度
num_heads = 2  # 注意力头的数量
d_k = d_model // num_heads  # 每个头的维度

# 词嵌入层
class EmbeddingLayer(nn.Module):
    def __init__(self, vocab_size, d_model):
        super(EmbeddingLayer, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)

    def forward(self, x):
        return self.embedding(x)

# Multi-Head Attention 实现
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.q_linear = nn.Linear(d_model, d_model)
        self.k_linear = nn.Linear(d_model, d_model)
        self.v_linear = nn.Linear(d_model, d_model)

        self.out_proj = nn.Linear(d_model, d_model)

    def scaled_dot_product_attention(self, Q, K, V):
        # Q * K^T / sqrt(d_k)
        scores = torch.matmul(Q, K.transpose(-2, -1)) / np.sqrt(self.d_k)
        attn_weights = torch.softmax(scores, dim=-1)
        output = torch.matmul(attn_weights, V)
        return output

    def forward(self, Q, K, V):
        batch_size = Q.size(0)

        # 线性变换后，拆分成多个 heads
        Q = self.q_linear(Q).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        K = self.k_linear(K).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        V = self.v_linear(V).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # 每个头分别计算注意力
        attention_output = self.scaled_dot_product_attention(Q, K, V)

        # 拼接所有头的输出
        concat_output = attention_output.transpose(1, 2).contiguous().view(batch_size, -1, self.num_heads * self.d_k)

        # 最终的线性变换
        output = self.out_proj(concat_output)
        return output

class Encoder(nn.Module):
    def __init__(self, src_vocab_size, d_model, num_heads):
        super(Encoder, self).__init__()
        self.embedding = EmbeddingLayer(src_vocab_size, d_model)
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.fc = nn.Linear(d_model, d_model)

    def forward(self, src):
        src_embedded = self.embedding(src)
        attention_output = self.attention(src_embedded, src_embedded, src_embedded)
        output = self.fc(attention_output)
        return output

class Decoder(nn.Module):
    def __init__(self, tgt_vocab_size, d_model, num_heads):
        super(Decoder, self).__init__()
        self.embedding = EmbeddingLayer(tgt_vocab_size, d_model)
        self.self_attention = MultiHeadAttention(d_model, num_heads)  # 自注意力
        self.cross_attention = MultiHeadAttention(d_model, num_heads)  # 编码器-解码器的交叉注意力
        self.fc = nn.Linear(d_model, tgt_vocab_size)

    def forward(self, tgt, encoder_output):
        tgt_embedded = self.embedding(tgt)

        # 解码器的自注意力
        tgt_self_attention_output = self.self_attention(tgt_embedded, tgt_embedded, tgt_embedded)

        # 编码器-解码器的注意力 (将编码器的输出作为 Key 和 Value)
        attention_output = self.cross_attention(tgt_self_attention_output, encoder_output, encoder_output)

        # 最后的线性层
        output = self.fc(attention_output)
        return output

# 完整的 Encoder-Decoder 结构
class Translator(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model, num_heads):
        super(Translator, self).__init__()
        self.encoder = Encoder(src_vocab_size, d_model, num_heads)
        self.decoder = Decoder(tgt_vocab_size, d_model, num_heads)

    def forward(self, src, tgt):
        encoder_output = self.encoder(src)
        output = self.decoder(tgt, encoder_output)
        return output

# 创建基于 encoder-decoder 的模型
model = Translator(len(english_vocab), len(french_vocab), d_model, num_heads)

# 损失和优化器
criterion = nn.CrossEntropyLoss(ignore_index=0)  # 忽略 padding
optimizer = optim.Adam(model.parameters(), lr=0.001)

# 训练模型
for epoch in range(100):
    optimizer.zero_grad()
    output = model(src_sentences, tgt_sentences)

    # reshape 输出以适配损失函数
    output = output.view(-1, output.size(-1))
    tgt_sentences_flat = tgt_sentences.view(-1)

    loss = criterion(output, tgt_sentences_flat)
    loss.backward()
    optimizer.step()

    if (epoch + 1) % 10 == 0:
        print(f"Epoch [{epoch+1}/100], Loss: {loss.item():.4f}")

# 预测函数
def predict(model, src_sentence, max_len=10):
    model.eval()  # 进入评估模式
    
    # 将源句子转换为索引并进行 padding
    src_tensor = torch.tensor([pad_sequence(tokenize_sentence(src_sentence, english_vocab), max_len_src)])
    
    # 编码器输出：将源句子输入到编码器
    encoder_output = model.encoder(src_tensor)
    
    # 初始化目标句子的开始符号 (<pad> 实际中应替换为 <bos>)
    tgt_sentence = [french_vocab["<pad>"]]
    
    # 开始逐步生成翻译
    for _ in range(max_len):
        tgt_tensor = torch.tensor([pad_sequence(tgt_sentence, max_len_tgt)])  # 对目标句子进行 padding
        
        # 将当前目标句子输入到解码器，得到预测输出
        output = model.decoder(tgt_tensor, encoder_output)
        
        # 获取最后一个时间步的输出（即预测的下一个词）
        next_word_idx = torch.argmax(output, dim=-1).squeeze().tolist()[-1]
        
        # 如果预测的词是结束符号或 <pad>，则停止生成
        if next_word_idx == french_vocab["<pad>"]:
            break
        
        # 将预测的词添加到目标句子中
        tgt_sentence.append(next_word_idx)
    
    # 将词索引转换回句子
    translated_sentence = [french_idx2word[idx] for idx in tgt_sentence if idx != 0]
    return " ".join(translated_sentence)

# 测试翻译
for pair in pairs:
    english_sentence = pair[0]
    prediction = predict(model, english_sentence)
    print(f"English: {english_sentence} -> French (Predicted): {prediction}")