Chapter 4.4:Adding shortcut connections

4 Implementing a GPT model from Scratch To Generate Text

4.4 Adding shortcut connections

接下来，让我们讨论 shortcut connections（快捷连接）背后的概念，也称为 skip connections（跳跃连接）或 residual connections（残差连接）。最初，在残差网络（ResNet）中提出，用于缓解梯度消失问题。梯度消失问题指的是梯度（在训练过程中指导权重更新）在向后传播通过各层时逐渐变小，导致难以有效训练较早的层，如下图所示

对比一个由 5 层组成的深度神经网络，左侧没有快捷连接，右侧带有快捷连接。快捷连接涉及将某一层的输入与其输出相加，从而有效地创建一条绕过某些层的替代路径

工作原理：
1. 创建更短的梯度路径，跳过中间层。
2. 通过将某一层的输出与后面某一层的输出相加实现。

前向方法中添加快捷连接

class ExampleDeepNeuralNetwork(nn.Module):
    def __init__(self, layer_sizes, use_shortcut):
        super().__init__()
        self.use_shortcut = use_shortcut
        self.layers = nn.ModuleList([
            nn.Sequential(nn.Linear(layer_sizes[0], layer_sizes[1]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[1], layer_sizes[2]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[2], layer_sizes[3]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[3], layer_sizes[4]), GELU()),
            nn.Sequential(nn.Linear(layer_sizes[4], layer_sizes[5]), GELU())
        ])

    def forward(self, x):
        for layer in self.layers:
            # Compute the output of the current layer
            layer_output = layer(x)
            # Check if shortcut can be applied
            if self.use_shortcut and x.shape == layer_output.shape:
                x = x + layer_output
            else:
                x = layer_output
        return x

该代码实现了一个包含 5 层的深度神经网络，每层由一个 Linear layer（线性层）和一个 GELU activation function（GELU 激活函数）组成。在前向传播过程中，我们迭代地将输入传递到各层，如果 self.use_shortcut 属性设置为 True，则可以选择性地添加上面图中的 shortcut connections（快捷连接）。

def print_gradients(model, x):
    # Forward pass
    output = model(x)
    target = torch.tensor([[0.]])

    # Calculate loss based on how close the target
    # and output are
    loss = nn.MSELoss()
    loss = loss(output, target)
    
    # Backward pass to calculate the gradients
    loss.backward()

    for name, param in model.named_parameters():
        if 'weight' in name:
            # Print the mean absolute gradient of the weights
            print(f"{name} has gradient mean of {param.grad.abs().mean().item()}")

接下来，我们实现一个计算模型向后传递中的梯度的函数：

def print_gradients(model, x):
    # Forward pass
    output = model(x)
    target = torch.tensor([[0.]])

    # Calculate loss based on how close the target
    # and output are
    loss = nn.MSELoss()
    loss = loss(output, target)
    
    # Backward pass to calculate the gradients
    loss.backward()

    for name, param in model.named_parameters():
        if 'weight' in name:
            # Print the mean absolute gradient of the weights
            print(f"{name} has gradient mean of {param.grad.abs().mean().item()}")

在前面的代码中，我们定义了一个损失函数来计算模型输出与目标值（如 0）的接近程度，并通过调用 loss.backward() 自动计算每一层的损失梯度。使用 model.named_parameters() 可以遍历权重参数，例如对于 3×3 的权重矩阵，计算其 3×3 梯度值的平均绝对梯度，从而得到每一层的单一梯度值，便于比较各层梯度。.backward() 方法的优势在于自动完成梯度计算，无需手动实现数学过程，极大地简化了深度神经网络的训练和使用。

接下来我们首先打印没有使用shortcut的梯度

# 未使用shortcut
layer_sizes = [3, 3, 3, 3, 3, 1]  
sample_input = torch.tensor([[1., 0., -1.]])

torch.manual_seed(123)
model_without_shortcut = ExampleDeepNeuralNetwork(
    layer_sizes, use_shortcut=False
)
print_gradients(model_without_shortcut, sample_input)

"""输出"""
layers.0.0.weight has gradient mean of 0.00020173587836325169
layers.1.0.weight has gradient mean of 0.0001201116101583466
layers.2.0.weight has gradient mean of 0.0007152041653171182
layers.3.0.weight has gradient mean of 0.001398873864673078
layers.4.0.weight has gradient mean of 0.005049646366387606

接着打印使用shortcut的梯度

# 使用shortcut
torch.manual_seed(123)
model_with_shortcut = ExampleDeepNeuralNetwork(
    layer_sizes, use_shortcut=True
)
print_gradients(model_with_shortcut, sample_input)

"""输出"""
layers.0.0.weight has gradient mean of 0.22169792652130127
layers.1.0.weight has gradient mean of 0.20694106817245483
layers.2.0.weight has gradient mean of 0.32896995544433594
layers.3.0.weight has gradient mean of 0.2665732502937317
layers.4.0.weight has gradient mean of 1.3258541822433472