残差网络
参考:https://blog.csdn.net/2301_80750681/article/details/142882802
以下是使用PyTorch实现的三层残差网络示例,包含三个残差块和完整的网络结构:
import torch
import torch.nn as nn
class BasicBlock(nn.Module):
expansion = 1
def __init__(self, in_channels, out_channels, stride=1, downsample=None):
super(BasicBlock, self).__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
stride=1, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
self.downsample = downsample
def forward(self, x):
identity = x
out = self.conv1(x)
out = self.bn1(out)
out = self.relu(out)
out = self.conv2(out)
out = self.bn2(out)
if self.downsample:
identity = self.downsample(x)
out += identity
out = self.relu(out)
return out
class ResNet3Layer(nn.Module):
def __init__(self, num_classes=10):
super(ResNet3Layer, self).__init__()
self.in_channels = 64
# 初始卷积层
self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
self.bn1 = nn.BatchNorm2d(64)
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
# 三个残差块
self.layer1 = self._make_layer(BasicBlock, 64, 1, stride=1)
self.layer2 = self._make_layer(BasicBlock, 128, 1, stride=2)
self.layer3 = self._make_layer(BasicBlock, 256, 1, stride=2)
# 分类层
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(256 * BasicBlock.expansion, num_classes)
def _make_layer(self, block, out_channels, blocks, stride=1):
downsample = None
if stride != 1 or self.in_channels != out_channels * block.expansion:
downsample = nn.Sequential(
nn.Conv2d(self.in_channels, out_channels * block.expansion,
kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels * block.expansion)
)
layers = []
layers.append(block(self.in_channels, out_channels, stride, downsample))
self.in_channels = out_channels * block.expansion
for _ in range(1, blocks):
layers.append(block(self.in_channels, out_channels))
return nn.Sequential(*layers)
def forward(self, x):
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
x = self.maxpool(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.fc(x)
return x
# 示例用法
model = ResNet3Layer(num_classes=10)
print(model)
关键实现细节说明:
- 残差块设计:每个
BasicBlock
包含两个3×3卷积层,通过downsample
参数处理维度匹配问题 - 网络结构:
- 初始卷积层(7×7卷积)用于提取基础特征
- 最大池化层进行初步下采样
- 三个残差块分别实现64→128→256通道的特征提取
- 全局平均池化替代全连接层减少参数量
- 维度匹配:通过1×1卷积调整shortcut连接的维度,保证残差相加的有效性
- 参数配置:
- 每个残差块的步长(stride)分别为1、2、2,实现特征图尺寸的逐步缩小
- 使用Batch Normalization加速训练收敛
该网络适用于CIFAR-10等小尺寸图像分类任务,可通过调整num_classes
参数适配不同数据集。实际训练时建议配合数据增强和正则化技术。
残差网络的数学推导核心在于其残差映射设计和梯度传播特性,主要包含以下关键点:
1. 残差前向传播公式
残差块的基本结构可表示为:
x
l
+
1
=
x
l
+
F
(
x
l
,
W
l
)
\mathbf{x}_{l+1} = \mathbf{x}_l + \mathcal{F}(\mathbf{x}_l, \mathbf{W}_l)
xl+1=xl+F(xl,Wl)
其中:
- x l \mathbf{x}_l xl:第 l l l层的输入
- F \mathcal{F} F:残差函数(通常包含卷积、BN、激活等操作)
- W l \mathbf{W}_l Wl:可学习参数
对于
L
L
L层深度网络,累积表达式为:
x
L
=
x
0
+
∑
i
=
0
L
−
1
F
(
x
i
,
W
i
)
\mathbf{x}_L = \mathbf{x}_0 + \sum_{i=0}^{L-1} \mathcal{F}(\mathbf{x}_i, \mathbf{W}_i)
xL=x0+i=0∑L−1F(xi,Wi)
这表明深层特征可分解为浅层特征与残差之和
2. 反向传播梯度推导
通过链式法则计算梯度:
∂
L
∂
x
l
=
∂
L
∂
x
L
⋅
∏
i
=
l
L
−
1
(
1
+
∂
F
(
x
i
,
W
i
)
∂
x
i
)
\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} = \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L} \cdot \prod_{i=l}^{L-1} \left( 1 + \frac{\partial \mathcal{F}(\mathbf{x}_i, \mathbf{W}_i)}{\partial \mathbf{x}_i} \right)
∂xl∂L=∂xL∂L⋅i=l∏L−1(1+∂xi∂F(xi,Wi))
其中:
- 常数项1保证梯度直接传递(恒等映射路径)
- 残差项 ∂ F ∂ x i \frac{\partial \mathcal{F}}{\partial \mathbf{x}_i} ∂xi∂F通过权重层传播
3. 解决梯度问题的数学机制
当残差项趋近于0时:
∂
L
∂
x
l
≈
∂
L
∂
x
L
⋅
1
\frac{\partial \mathcal{L}}{\partial \mathbf{x}_l} \approx \frac{\partial \mathcal{L}}{\partial \mathbf{x}_L} \cdot 1
∂xl∂L≈∂xL∂L⋅1
即使深层梯度
∂
L
∂
x
L
\frac{\partial \mathcal{L}}{\partial \mathbf{x}_L}
∂xL∂L较小,浅层仍能获得有效梯度更新,从根本上缓解梯度消失问题
4. 网络退化问题的解决
假设最优映射为
H
∗
(
x
)
H^*(x)
H∗(x),传统网络需直接拟合:
H
(
x
)
=
H
∗
(
x
)
H(x) = H^*(x)
H(x)=H∗(x)
而残差网络改为拟合:
F
(
x
)
=
H
∗
(
x
)
−
x
\mathcal{F}(x) = H^*(x) - x
F(x)=H∗(x)−x
这使得当
F
(
x
)
=
0
\mathcal{F}(x)=0
F(x)=0时,网络退化为恒等映射,保证性能不劣化
5. 维度匹配的数学处理
当输入输出维度不匹配时,引入1×1卷积:
y
=
F
(
x
,
W
i
)
+
W
s
x
\mathbf{y} = \mathcal{F}(\mathbf{x}, \mathbf{W}_i) + \mathbf{W}_s\mathbf{x}
y=F(x,Wi)+Wsx
其中
W
s
\mathbf{W}_s
Ws为线性变换矩阵,保证残差相加的维度一致性
通过上述数学设计,残差网络实现了:
- 梯度稳定传播(反向过程)
- 深层特征的有效累积(前向过程)
- 网络退化现象的根本性解决
残差网络(ResNet)相比普通直接卷积网络的核心优势体现在以下方面:
1. 解决梯度消失与网络退化问题
通过跳跃连接(Shortcut Connection)的残差结构,反向传播时梯度可绕过非线性层直接传递。数学上,第
l
l
l层的梯度为:
∂
L
∂
x
l
=
∂
L
∂
x
L
⋅
∏
i
=
l
L
−
1
(
1
+
∂
F
(
x
i
,
W
i
)
∂
x
i
)
\frac{\partial \mathcal{L}}{\partial x_l} = \frac{\partial \mathcal{L}}{\partial x_L} \cdot \prod_{i=l}^{L-1} \left( 1 + \frac{\partial F(x_i, W_i)}{\partial x_i} \right)
∂xl∂L=∂xL∂L⋅i=l∏L−1(1+∂xi∂F(xi,Wi))
当残差项
∂
F
∂
x
i
≈
0
\frac{\partial F}{\partial x_i} \approx 0
∂xi∂F≈0时,梯度
∂
L
∂
x
l
≈
∂
L
∂
x
L
\frac{\partial \mathcal{L}}{\partial x_l} \approx \frac{\partial \mathcal{L}}{\partial x_L}
∂xl∂L≈∂xL∂L,避免链式求导的指数衰减。
2. 优化目标简化
残差网络学习残差映射 F ( x ) = H ( x ) − x F(x) = H(x) - x F(x)=H(x)−x,而非直接学习目标函数 H ( x ) H(x) H(x)。当最优映射接近恒等变换时,残差 F ( x ) → 0 F(x) \to 0 F(x)→0比直接学习 H ( x ) → x H(x) \to x H(x)→x更容易收敛。
3. 支持极深网络结构
普通CNN在超过20层时会出现性能退化(训练/测试误差同时上升),而ResNet通过残差块堆叠可构建超过1000层的网络,且准确率随深度增加持续提升(如ResNet-152在ImageNet上Top-5错误率仅3.57%)。
4. 参数效率与计算优化
- 维度调整:使用1×1卷积调整通道数,参数量仅需 C i n × C o u t C_{in} \times C_{out} Cin×Cout,远少于3×3卷积的 9 C i n C o u t 9C_{in}C_{out} 9CinCout。
- 瓶颈结构:通过“1×1→3×3→1×1”的Bottleneck设计(如ResNet-50),在保持性能的同时减少计算量。
5. 实际性能优势
- 分类任务:ResNet-50在ImageNet上的Top-1准确率达76.5%,比VGG-16提升约8%。
- 训练效率:引入BN层后,ResNet训练速度比普通CNN快2-3倍,且收敛更稳定。
对比总结
特性 | 普通CNN | ResNet |
---|---|---|
最大有效深度 | ~20层 | >1000层 |
梯度传播稳定性 | 易消失/爆炸 | 通过跳跃连接稳定 |
训练误差随深度变化 | 先降后升(退化) | 持续下降 |
参数量(同精度) | 较高 | 更低(瓶颈结构) |
这些设计使得ResNet成为计算机视觉任务的基础架构,广泛应用于图像分类、目标检测等领域。
以下是使用PyTorch实现的残差网络(ResNet)在MNIST手写数字识别中的示例:
import torch
import torch.nn as nn
import torchvision.transforms as transforms
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader
class ResidualBlock(nn.Module):
def __init__(self, in_channels, out_channels, stride=1):
super().__init__()
self.conv1 = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1),
nn.BatchNorm2d(out_channels),
nn.ReLU()
)
self.conv2 = nn.Sequential(
nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(out_channels)
)
self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride),
nn.BatchNorm2d(out_channels)
)
def forward(self, x):
residual = self.shortcut(x)
out = self.conv1(x)
out = self.conv2(out)
out += residual
out = nn.ReLU()(out)
return out
class ResNetMNIST(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.Sequential(
nn.Conv2d(1, 64, kernel_size=3, stride=1, padding=1),
nn.BatchNorm2d(64),
nn.ReLU()
)
self.res_blocks = nn.Sequential(
ResidualBlock(64, 64),
ResidualBlock(64, 128, stride=2),
ResidualBlock(128, 256, stride=2)
)
self.fc = nn.Sequential(
nn.AdaptiveAvgPool2d((1,1)),
nn.Flatten(),
nn.Linear(256, 10)
)
def forward(self, x):
x = self.conv1(x)
x = self.res_blocks(x)
x = self.fc(x)
return x
# 数据预处理
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
# 加载数据集
train_set = MNIST(root='./data', train=True, download=True, transform=transform)
test_set = MNIST(root='./data', train=False, download=True, transform=transform)
# 创建数据加载器
train_loader = DataLoader(train_set, batch_size=128, shuffle=True)
test_loader = DataLoader(test_set, batch_size=128, shuffle=False)
# 初始化模型和优化器
model = ResNetMNIST()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.CrossEntropyLoss()
# 训练循环
for epoch in range(10):
model.train()
for images, labels in train_loader:
outputs = model(images)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# 测试准确率
model.eval()
correct = 0
with torch.no_grad():
for images, labels in test_loader:
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
correct += (predicted == labels).sum().item()
acc = 100 * correct / len(test_set)
print(f'Epoch {epoch+1}, Test Accuracy: {acc:.2f}%')
关键实现细节说明:
- 残差块设计:每个残差块包含两个3×3卷积层,通过
shortcut
连接处理维度变化 - 网络结构:
- 初始卷积层(3×3)提取基础特征
- 三个残差块实现64→128→256通道的特征提取
- 全局平均池化替代全连接层减少参数量
- 数据预处理:
- 标准化处理: μ = 0.1307 \mu=0.1307 μ=0.1307, σ = 0.3081 \sigma=0.3081 σ=0.3081
- 输入维度:1×28×28(通道×高×宽)
- 训练配置:
- Adam优化器(学习率0.001)
- 交叉熵损失函数
- 批量大小128,训练10个epoch
该模型在MNIST测试集上通常能达到**99%+**的准确率。实际训练时可添加数据增强(随机旋转、平移)提升泛化能力,或使用学习率调度器优化收敛过程。