聊聊我构建SMoE模型的过程

这篇博客详细讲述了从头开始构建一个稀疏混合专家（sparse mixture of experts）语言模型的过程。该项目深受 Andrej Karpathy 的 “makemore” 项目启发，并借鉴了许多可重用的组件。与 makemore 类似，makeMoE 也是一种按字符顺序生成文本的自动语言模型，但它采用了所谓的 “稀疏混合专家” 架构。

文章后面重点介绍了这种架构的核心要素及其实施方法。希望你通过阅读这篇文章并运行仓库中的代码，直观地理解其运作原理。

随着 Mixtral 的发布和有关 Llama 3 可能是一个混合专家型 LLM 的讨论，人们对这种模型架构越来越感兴趣。但在这种 “稀疏混合专家” 语言模型中，很多元素是与传统的 Transformer 模型共享的。虽然看起来简单，但实际上训练稳定性是这些模型面临的一个主要挑战。像本项目这样的小规模、可自行修改的实现，可能有助于快速尝试新的方法。

在此实现中，我对 makemore 架构进行了一些重要修改：

引入了 “稀疏混合专家” 架构，而不是单纯的前馈神经网络。
实现了 “Top-k 门控” 和 “带噪声的 Top-k 门控”。
在初始化方面，虽然这里使用了 Kaiming He 初始化方法，但这个项目的特点是可以灵活更换，例如尝试 Xavier 或 Glorot 初始化方法。

但是，以下方面保持了与 makemore 的一致：

数据集的选择、预处理（如 tokenization）方式，以及 Andrej 最初选择的语言建模任务 —— 生成类似莎士比亚的文本。
因果自注意力（Casusal self attention）的实现方法。
训练循环和推理逻辑的设计。

稀疏混合专家语言模型，正如其名，依赖于一种被称为自注意力的技术来理解语境。在深入探讨混合专家模块的细节之前，我们先来回顾一下自注意力的基础知识。

代码示例展示了自注意力的工作原理和核心思想，尤其是一种叫做比例点积自注意力的常见形式。在这种方式中，查询（query）、键（key）和值（value）这三组数据都源自同一个输入序列。为了保证自动文本生成过程的连贯性，尤其是在只有解码器的模型中，代码中实现了一种掩蔽技术。这种技术非常关键，因为它隐藏了当前字符之后的信息，使模型的注意力只集中在之前的序列部分。这种注意力机制被称为因果自注意力。

值得注意的是，稀疏混合专家模型并不仅限于 Transformer 模型中的解码器部分。实际上，这个领域的许多重要研究，特别是 Shazeer 等人的工作，都是基于 T5 架构的，它包含了 Transformer 模型中的编码器和解码器两个部分。

因果自注意力和多头因果自注意力的代码结构如下：多头自注意力通过并行运用多个注意力头来提高效率，每个头关注嵌入维度的不同部分。多头自注意力不仅提高了学习效率，还因其并行实现的特点提升了模型训练的效率。值得一提的是，为了防止模型过度拟合现象，我在整个实现过程中使用了 dropout 这种正则化技术。


#Causal scaled dot product self-Attention Head  
n\_embd = 64  
n\_head = 4  
n\_layer = 4  
head\_size = 16  
dropout = 0.1  
  
class Head(nn.Module):  
    """ one head of self-attention """  
  
    def \_\_init\_\_(self, head\_size):  
        super().\_\_init\_\_()  
        self.key = nn.Linear(n\_embd, head\_size, bias=False)  
        self.query = nn.Linear(n\_embd, head\_size, bias=False)  
        self.value = nn.Linear(n\_embd, head\_size, bias=False)  
        self.register\_buffer('tril', torch.tril(torch.ones(block\_size, block\_size)))  
  
        self.dropout = nn.Dropout(dropout)  
  
    def forward(self, x):  
        B,T,C = x.shape  
        k = self.key(x)   # (B,T,C)  
        q = self.query(x) # (B,T,C)  
        # compute attention scores ("affinities")  
        wei = q @ k.transpose(-2,-1) \* C\*\*-0.5 # (B, T, C) @ (B, C, T) -> (B, T, T)  
        wei = wei.masked\_fill(self.tril\[:T, :T\] == 0, float('-inf')) # (B, T, T)  
        wei = F.softmax(wei, dim=-1) # (B, T, T)  
        wei = self.dropout(wei)  
        # perform the weighted aggregation of the values  
        v = self.value(x) # (B,T,C)  
        out = wei @ v # (B, T, T) @ (B, T, C) -> (B, T, C)  
        return out

多头自注意力（Multi-head Self Attention）的实现方式如下：


#Multi-Headed Self Attention  
class MultiHeadAttention(nn.Module):  
    """ multiple heads of self-attention in parallel """  
  
    def \_\_init\_\_(self, num\_heads, head\_size):  
        super().\_\_init\_\_()  
        self.heads = nn.ModuleList(\[Head(head\_size) for \_ in range(num\_heads)\])  
        self.proj = nn.Linear(n\_embd, n\_embd)  
        self.dropout = nn.Dropout(dropout)  
  
    def forward(self, x):  
        out = torch.cat(\[h(x) for h in self.heads\], dim=-1)  
        out = self.dropout(self.proj(out))  
        return out

首先，我们创建一个称为 “专家模块” 的组件，这实际上是一个结构简单的多层感知器（Multi Layer Perceptron）。在稀疏专家混合（Sparse Mixture of Experts，MoE）架构中，虽然每个 Transformer 块中的自注意力机制保持不变，但块的结构有了显著的改变。原先的标准前馈神经网络被多个 “稀疏激活” 的前馈网络所取代，这些网络被称为 “专家”。

所谓的 “稀疏激活” 是指，序列中的每个 Token 只被分配给有限的几个专家处理 —— 通常是一个或两个 —— 而不是全部可用的专家。**这种方式有助于提升训练和推理速度，因为每次前向传递只需要激活少数专家。**然而，所有的专家网络都需要存储在 GPU 内存中，这在参数总量达到数千亿或数万亿时，会带来一些部署上的挑战。


#Expert module  
class Expert(nn.Module):  
    """ An MLP is a simple linear layer followed by a non-linearity i.e. each Expert """  
  
    def \_\_init\_\_(self, n\_embd):  
        super().\_\_init\_\_()  
        self.net = nn.Sequential(  
            nn.Linear(n\_embd, 4 \* n\_embd),  
            nn.ReLU(),  
            nn.Linear(4 \* n\_embd, n\_embd),  
            nn.Dropout(dropout),  
        )  
  
    def forward(self, x):  
        return self.net(x)

通过一个简单的例子来理解 Top-k 门控（Top-k Gating）的直觉：

门控网络，也就是决定哪个专家网络将接收来自多头自注意力的每个 Token 输出的 “路由器”。假设有 4 个专家，而某个 Token 需要被发送到排名前两位的专家那里。首先，我们通过一个线性层把 Token 输入到门控网络。这一层会把输入张量的形状从（2，4，32）转换为（2，4，4），这里的（2，4，32）代表（批量大小，Token 数量，n_embed），其中 n_embed 是输入的通道维度，而（2，4，4）则代表（批量大小，Token 数量，专家网络数量）。接下来，我们会在这些张量的最后一个维度上找到最高的两个值及其对应的索引，这就是我们所说的 “Top-k 选择”。


#Understanding how gating works  
num\_experts = 4  
top\_k=2  
n\_embed=32  
  
  
#Example multi-head attention output for a simple illustrative example, consider n\_embed=32, context\_length=4 and batch\_size=2  
mh\_output = torch.randn(2, 4, n\_embed)  
  
topkgate\_linear = nn.Linear(n\_embed, num\_experts) # nn.Linear(32, 4)  
  
logits = topkgate\_linear(mh\_output)  
top\_k\_logits, top\_k\_indices = logits.topk(top\_k, dim=-1)  # Get top-k experts  
top\_k\_logits, top\_k\_indices


#output:  
(tensor(\[\[\[ 0.0246, -0.0190\],  
          \[ 0.1991,  0.1513\],  
          \[ 0.9749,  0.7185\],  
          \[ 0.4406, -0.8357\]\],  
   
         \[\[ 0.6206, -0.0503\],  
          \[ 0.8635,  0.3784\],  
          \[ 0.6828,  0.5972\],  
          \[ 0.4743,  0.3420\]\]\], grad\_fn=<TopkBackward0>),  
 tensor(\[\[\[2, 3\],  
          \[2, 1\],  
          \[3, 1\],  
          \[2, 1\]\],  
   
         \[\[0, 2\],  
          \[0, 3\],  
          \[3, 2\],  
          \[3, 0\]\]\]))

在稀疏门控机制中，我们通过只保留最后一个维度中每个索引对应的前 k 个值来得到输出。其余的部分被填充为负无穷（-inf），然后通过一个 softmax 激活函数进行处理。这个过程会将负无穷的值变为零，同时让前两个最重要的值更加明显，并确保它们的总和为 1。这种总和为 1 的特性对于专家输出的加权是非常重要的。


zeros = torch.full\_like(logits, float('-inf')) #full\_like clones a tensor and fills it with a specified value (like infinity) for masking or calculations.  
sparse\_logits = zeros.scatter(-1, top\_k\_indices, top\_k\_logits)  
sparse\_logits

gating\_output= F.softmax(sparse\_logits, dim=-1)  
gating\_output

#ouput  
tensor(\[\[\[0.0000, 0.0000, 0.5109, 0.4891\],  
         \[0.0000, 0.4881, 0.5119, 0.0000\],  
         \[0.0000, 0.4362, 0.0000, 0.5638\],  
         \[0.0000, 0.2182, 0.7818, 0.0000\]\],  
  
        \[\[0.6617, 0.0000, 0.3383, 0.0000\],  
         \[0.6190, 0.0000, 0.0000, 0.3810\],  
         \[0.0000, 0.0000, 0.4786, 0.5214\],  
         \[0.4670, 0.0000, 0.0000, 0.5330\]\]\], grad\_fn=<SoftmaxBackward0>)

接下来，我们将上述代码进行推广和模块化，并添加了带噪声的 Top-k 门控，以实现负载均衡。

# First define the top k router module   
class TopkRouter(nn.Module):  
    def \_\_init\_\_(self, n\_embed, num\_experts, top\_k):  
        super(TopkRouter, self).\_\_init\_\_()  
        self.top\_k = top\_k  
        self.linear =nn.Linear(n\_embed, num\_experts)  
      
    def forward(self, mh\_ouput):  
        # mh\_ouput is the output tensor from multihead self attention block  
        logits = self.linear(mh\_output)  
        top\_k\_logits, indices = logits.topk(self.top\_k, dim=-1)  
        zeros = torch.full\_like(logits, float('-inf'))  
        sparse\_logits = zeros.scatter(-1, indices, top\_k\_logits)  
        router\_output = F.softmax(sparse\_logits, dim=-1)  
        return router\_output, indices

现在，让我们用一些样本输入来测试这个功能：

#Testing this out:  
num\_experts = 4  
top\_k = 2  
n\_embd = 32  
  
mh\_output = torch.randn(2, 4, n\_embd)  # Example input  
top\_k\_gate = TopkRouter(n\_embd, num\_experts, top\_k)  
gating\_output, indices = top\_k\_gate(mh\_output)  
gating\_output.shape, gating\_output, indices  
#And it works!!

#output  
(torch.Size(\[2, 4, 4\]),  
 tensor(\[\[\[0.5284, 0.0000, 0.4716, 0.0000\],  
          \[0.0000, 0.4592, 0.0000, 0.5408\],  
          \[0.0000, 0.3529, 0.0000, 0.6471\],  
          \[0.3948, 0.0000, 0.0000, 0.6052\]\],  
   
         \[\[0.0000, 0.5950, 0.4050, 0.0000\],  
          \[0.4456, 0.0000, 0.5544, 0.0000\],  
          \[0.7208, 0.0000, 0.0000, 0.2792\],  
          \[0.0000, 0.0000, 0.5659, 0.4341\]\]\], grad\_fn=<SoftmaxBackward0>),  
 tensor(\[\[\[0, 2\],  
          \[3, 1\],  
          \[3, 1\],  
          \[3, 0\]\],  
   
         \[\[1, 2\],  
          \[2, 0\],  
          \[0, 3\],  
          \[2, 3\]\]\]))

尽管最近发布的混合模型论文没有提到，我认为在训练 MoE 模型时，带噪声的 Top-k 门控是一个非常重要的工具。我们的目标不是让所有 Token 都被分配给相同的一组专家，而是希望在专家的利用和探索之间达到平衡。为此，在门控网络的输出中添加标准正态分布的噪声，可以帮助实现负载均衡，从而使训练过程更加高效。

#Changing the above to accomodate noisy top-k gating  
class NoisyTopkRouter(nn.Module):  
    def \_\_init\_\_(self, n\_embed, num\_experts, top\_k):  
        super(NoisyTopkRouter, self).\_\_init\_\_()  
        self.top\_k = top\_k  
        #layer for router logits  
        self.topkroute\_linear = nn.Linear(n\_embed, num\_experts)  
        self.noise\_linear =nn.Linear(n\_embed, num\_experts)  
  
      
    def forward(self, mh\_output):  
        # mh\_ouput is the output tensor from multihead self attention block  
        logits = self.topkroute\_linear(mh\_output)  
  
        #Noise logits  
        noise\_logits = self.noise\_linear(mh\_output)  
  
        #Adding scaled unit gaussian noise to the logits  
        noise = torch.randn\_like(logits)\*F.softplus(noise\_logits)  
        noisy\_logits = logits + noise  
  
        top\_k\_logits, indices = noisy\_logits.topk(self.top\_k, dim=-1)  
        zeros = torch.full\_like(noisy\_logits, float('-inf'))  
        sparse\_logits = zeros.scatter(-1, indices, top\_k\_logits)  
        router\_output = F.softmax(sparse\_logits, dim=-1)  
        return router\_output, indices

现在，让我们再次对这个实现进行测试。

#Testing this out, again:  
num\_experts = 8  
top\_k = 2  
n\_embd = 16  
  
mh\_output = torch.randn(2, 4, n\_embd)  # Example input  
noisy\_top\_k\_gate = NoisyTopkRouter(n\_embd, num\_experts, top\_k)  
gating\_output, indices = noisy\_top\_k\_gate(mh\_output)  
gating\_output.shape, gating\_output, indices  
#It works!!

#output  
(torch.Size(\[2, 4, 8\]),  
 tensor(\[\[\[0.4181, 0.0000, 0.5819, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000\],  
          \[0.4693, 0.5307, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000\],  
          \[0.0000, 0.4985, 0.5015, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000\],  
          \[0.0000, 0.0000, 0.0000, 0.2641, 0.0000, 0.7359, 0.0000, 0.0000\]\],  
   
         \[\[0.0000, 0.0000, 0.0000, 0.6301, 0.0000, 0.3699, 0.0000, 0.0000\],  
          \[0.0000, 0.0000, 0.0000, 0.4766, 0.0000, 0.0000, 0.0000, 0.5234\],  
          \[0.0000, 0.0000, 0.0000, 0.6815, 0.0000, 0.0000, 0.3185, 0.0000\],  
          \[0.4482, 0.5518, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000\]\]\],  
        grad\_fn=<SoftmaxBackward0>),  
 tensor(\[\[\[2, 0\],  
          \[1, 0\],  
          \[2, 1\],  
          \[5, 3\]\],  
   
         \[\[3, 5\],  
          \[7, 3\],  
          \[3, 6\],  
          \[1, 0\]\]\]))

**创建一个稀疏专家混合模块的过程主要涉及门控网络的输出。**在获取这些结果之后，我们会将前 k 个专家的输出与对应的顶部 k 个值进行选择性相乘。这种选择性的相乘形成了一个加权总和，这就是 SparseMoe 模块的输出。**这个过程中的关键挑战是避免不必要的乘法运算。**重要的是仅对那些顶部的 k 个专家进行前向计算，然后计算这个加权和。如果对每个专家都进行前向计算，那么使用稀疏 MoE 的目的就会失效，因为它将不再具有稀疏性。

class SparseMoE(nn.Module):  
    def \_\_init\_\_(self, n\_embed, num\_experts, top\_k):  
        super(SparseMoE, self).\_\_init\_\_()  
        self.router = NoisyTopkRouter(n\_embed, num\_experts, top\_k)  
        self.experts = nn.ModuleList(\[Expert(n\_embed) for \_ in range(num\_experts)\])  
        self.top\_k = top\_k  
  
    def forward(self, x):  
        gating\_output, indices = self.router(x)  
        final\_output = torch.zeros\_like(x)  
  
        # Reshape inputs for batch processing  
        flat\_x = x.view(-1, x.size(-1))  
        flat\_gating\_output = gating\_output.view(-1, gating\_output.size(-1))  
  
        # Process each expert in parallel  
        for i, expert in enumerate(self.experts):  
            # Create a mask for the inputs where the current expert is in top-k  
            expert\_mask = (indices == i).any(dim=-1)  
            flat\_mask = expert\_mask.view(-1)  
  
            if flat\_mask.any():  
                expert\_input = flat\_x\[flat\_mask\]  
                expert\_output = expert(expert\_input)  
  
                # Extract and apply gating scores  
                gating\_scores = flat\_gating\_output\[flat\_mask, i\].unsqueeze(1)  
                weighted\_output = expert\_output \* gating\_scores  
  
                # Update final output  
                # We need to scatter\_add the weighted outputs to their original positions in the batch  
                final\_output.masked\_scatter\_(expert\_mask.unsqueeze(-1), weighted\_output)  
  
        return final\_output.view\_as(x)

为了验证上述实现是否有效，使用样本输入进行测试是一个很好的方法。实际运行以下代码后，我们可以看到它确实有效！

import torch  
import torch.nn as nn  
  
#Let's test this out  
num\_experts = 8  
top\_k = 2  
n\_embd = 16  
dropout=0.1  
  
mh\_output = torch.randn(4, 8, n\_embd)  # Example multi-head attention output  
sparse\_moe = SparseMoE(n\_embd, num\_experts, top\_k)  
final\_output = sparse\_moe(mh\_output)  
print("Shape of the final output:", final\_output.shape)

Shape of the final output: torch.Size(\[4, 8, 16\])

强调一点，从路由器 / 门控网络得到的 top_k 专家的输出值的大小同样非常关键。这些 top_k 索引决定了哪些专家被激活，并且在这些 top_k 维度中的值的大小决定了它们的权重分配。这种加权求和的思想在下面的图中得到了更详细的展示。

多头自注意力和稀疏专家混合被整合，形成了一个稀疏专家混合的 Transformer 块。就像标准的 Transformer 块一样，我们添加了跳过连接（skip connections）来确保训练的稳定性，防止像梯度消失这样的问题发生。此外，还采用了层归一化（layer normalization），以进一步稳定学习过程。

#Create a self attention + mixture of experts block, that may be repeated several number of times   
class Block(nn.Module):  
    """ Mixture of Experts Transformer block: communication followed by computation (multi-head self attention + SparseMoE) """  
  
    def \_\_init\_\_(self, n\_embed, n\_head, num\_experts, top\_k):  
        # n\_embed: embedding dimension, n\_head: the number of heads we'd like  
        super().\_\_init\_\_()  
        head\_size = n\_embed // n\_head  
        self.sa = MultiHeadAttention(n\_head, head\_size)  
        self.smoe = SparseMoE(n\_embed, num\_experts, top\_k)  
        self.ln1 = nn.LayerNorm(n\_embed)  
        self.ln2 = nn.LayerNorm(n\_embed)  
  
    def forward(self, x):  
        x = x + self.sa(self.ln1(x))  
        x = x + self.smoe(self.ln2(x))  
        return x

最后，我们将这些内容整合起来，创建了一个稀疏专家混合语言模型。

class SparseMoELanguageModel(nn.Module):  
  
    def \_\_init\_\_(self):  
        super().\_\_init\_\_()  
        # each token directly reads off the logits for the next token from a lookup table  
        self.token\_embedding\_table = nn.Embedding(vocab\_size, n\_embed)  
        self.position\_embedding\_table = nn.Embedding(block\_size, n\_embed)  
        self.blocks = nn.Sequential(\*\[Block(n\_embed, n\_head=n\_head, num\_experts=num\_experts,top\_k=top\_k) for \_ in range(n\_layer)\])  
        self.ln\_f = nn.LayerNorm(n\_embed) # final layer norm  
        self.lm\_head = nn.Linear(n\_embed, vocab\_size)  
  
    def forward(self, idx, targets=None):  
        B, T = idx.shape  
  
        # idx and targets are both (B,T) tensor of integers  
        tok\_emb = self.token\_embedding\_table(idx) # (B,T,C)  
        pos\_emb = self.position\_embedding\_table(torch.arange(T, device=device)) # (T,C)  
        x = tok\_emb + pos\_emb # (B,T,C)  
        x = self.blocks(x) # (B,T,C)  
        x = self.ln\_f(x) # (B,T,C)  
        logits = self.lm\_head(x) # (B,T,vocab\_size)  
  
        if targets is None:  
            loss = None  
        else:  
            B, T, C = logits.shape  
            logits = logits.view(B\*T, C)  
            targets = targets.view(B\*T)  
            loss = F.cross\_entropy(logits, targets)  
  
        return logits, loss  
  
    def generate(self, idx, max\_new\_tokens):  
        # idx is (B, T) array of indices in the current context  
        for \_ in range(max\_new\_tokens):  
            # crop idx to the last block\_size tokens  
            idx\_cond = idx\[:, -block\_size:\]  
            # get the predictions  
            logits, loss = self(idx\_cond)  
            # focus only on the last time step  
            logits = logits\[:, -1, :\] # becomes (B, C)  
            # apply softmax to get probabilities  
            probs = F.softmax(logits, dim=-1) # (B, C)  
            # sample from the distribution  
            idx\_next = torch.multinomial(probs, num\_samples=1) # (B, 1)  
            # append sampled index to the running sequence  
            idx = torch.cat((idx, idx\_next), dim=1) # (B, T+1)  
        return idx

在深度神经网络的有效训练中，初始化是一个关键步骤。这里我们使用了 Kaiming He 初始化方法，因为专家网络中使用了 ReLU 激活函数。你也可以尝试使用在 Transformer 中更为常见的 Glorot 初始化。Jeremy Howard 的 Fastai 第二部分提供了一堂很棒的课程，从头实现了这些方法：https://course.fast.ai/Lessons/lesson17.html。文献中提到，Glorot 初始化通常用于 Transformer 模型，这可能是一个提升模型性能的机会。

def kaiming\_init\_weights(m):  
    if isinstance (m, (nn.Linear)):   
        init.kaiming\_normal\_(m.weight)  
  
model = SparseMoELanguageModel()  
model.apply(kaiming\_init\_weights)

我使用了 mlflow 来跟踪和记录训练过程中的重要指标和超参数。我展示的训练循环中包含了这部分代码。如果你不想使用 mlflow，makeMoE GitHub 仓库中的笔记本中也提供了不包含 MLFlow 的代码。我个人发现，特别是在进行实验时，使用 mlflow 跟踪参数和指标非常方便。

#Using MLFlow  
m = model.to(device)  
# print the number of parameters in the model  
print(sum(p.numel() for p in m.parameters())/1e6, 'M parameters')  
  
# create a PyTorch optimizer  
optimizer = torch.optim.AdamW(model.parameters(), lr=learning\_rate)  
#mlflow.set\_experiment("makeMoE")  
with mlflow.start\_run():  
    #If you use mlflow.autolog() this will be automatically logged. I chose to explicitly log here for completeness  
    params = {"batch\_size": batch\_size , "block\_size" : block\_size, "max\_iters": max\_iters, "eval\_interval": eval\_interval,  
              "learning\_rate": learning\_rate, "device": device, "eval\_iters": eval\_iters, "dropout" : dropout, "num\_experts": num\_experts, "top\_k": top\_k }  
    mlflow.log\_params(params)  
    for iter in range(max\_iters):  
  
        # every once in a while evaluate the loss on train and val sets  
        if iter % eval\_interval == 0 or iter == max\_iters - 1:  
            losses = estimate\_loss()  
            print(f"step {iter}: train loss {losses\['train'\]:.4f}, val loss {losses\['val'\]:.4f}")  
            metrics = {"train\_loss": losses\['train'\], "val\_loss": losses\['val'\]}  
            mlflow.log\_metrics(metrics, step=iter)  
  
  
        # sample a batch of data  
        xb, yb = get\_batch('train')  
  
        # evaluate the loss  
        logits, loss = model(xb, yb)  
        optimizer.zero\_grad(set\_to\_none=True)  
        loss.backward()  
        optimizer.step()

8.996545 M parameters  
step 0: train loss 5.3223, val loss 5.3166  
step 100: train loss 2.7351, val loss 2.7429  
step 200: train loss 2.5125, val loss 2.5233  
.  
.  
.  
  
step 4999: train loss 1.5712, val loss 1.7508

记录训练和验证损失可以帮助我们更好地了解训练进展。图表显示，在大约 4500 步时，当验证损失略有上升时，我本应该停止训练。

现在，我们可以使用这个模型逐个字符地生成文本，采用的是自回归的方式。对于一个稀疏激活的约 900 万参数模型来说，效果已经相当不错了。

# generate from the model. Not great. Not too bad either  
context = torch.zeros((1, 1), dtype=torch.long, device=device)  
print(decode(m.generate(context, max\_new\_tokens=2000)\[0\].tolist()))

DUKE VINCENVENTIO:  
If it ever fecond he town sue kigh now,  
That thou wold'st is steen 't.  
  
SIMNA:  
Angent her; no, my a born Yorthort,  
Romeoos soun and lawf to your sawe with ch a woft ttastly defy,  
To declay the soul art; and meart smad.  
  
CORPIOLLANUS:  
Which I cannot shall do from by born und ot cold warrike,  
What king we best anone wrave's going of heard and good  
Thus playvage; you have wold the grace.  
...

我希望这个解释有助于你理解稀疏专家混合模型的架构及其组合方式。

整个代码是在 Databricks 平台上，使用单个 A100 显卡开发的。如果你在 Databricks 上运行这个模型，你可以在任何大型 GPU 集群上进行扩展，选择你喜欢的云服务提供商。我选择使用 MLFlow（Databricks 中预安装，也可以在其他地方通过 pip 安装），因为它能方便地跟踪和记录所有必要的指标。当然，使用 MLFlow 完全是可选的。

请注意，这个实现的重点是在于可读性和可修改性，而非最高性能，因此还有很多改进的空间。

基于此，你可以尝试以下几点：

提高专家混合模块的效率。我认为在正确专家的稀疏激活方面可以做出显著改进。
尝试不同的神经网络初始化策略。我提到的 Fastai 第二部分是一个很好的资源。
尝试从字符级别转换到子词标记化。
对专家数量和 top_k（每个 Token 激活的专家数量）进行贝叶斯超参数搜索，这可以归类为神经网络架构搜索。
专家容量在这里没有讨论或实现，但探索这一点是非常有价值的。
如何学习AI大模型？

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

在这里插入图片描述