【动手学深度学习】多层感知机模型选择、欠拟合和过拟合研究详情

🌊1. 研究目的

🌊2. 研究准备

🌊3. 研究内容

🌍3.1 多层感知机模型选择、⽋拟合和过拟合

🌍3.2 基础练习

🌊4. 研究体会

🌊1. 研究目的

多层感知机模型选择：比较不同多层感知机模型的性能，选择最适合解决给定问题的模型；
欠拟合和过拟合：研究模型在训练数据上出现欠拟合或过拟合的情况，以便了解模型的泛化能力和优化方法的效果；
模型正则化和调参：通过实验观察和比较，研究正则化技术和调参对模型的影响，以改善模型的泛化性能；
模型复杂度与性能：探究多层感知机模型的复杂度对训练和测试性能的影响，以及如何找到合适的模型复杂度。

🌊2. 研究准备

根据GPU安装pytorch版本实现GPU运行研究代码；
配置环境用来运行 Python、Jupyter Notebook和相关库等相关库。

🌊3. 研究内容

启动jupyter notebook，使用新增的pytorch环境新建ipynb文件，为了检查环境配置是否合理，输入import torch以及torch.cuda.is_available() ，若返回TRUE则说明研究环境配置正确，若返回False但可以正确导入torch则说明pytorch配置成功，但研究运行是在CPU进行的，结果如下：

🌍3.1 多层感知机模型选择、⽋拟合和过拟合

（1）使用jupyter notebook新增的pytorch环境新建ipynb文件，完成基本数据操作的研究代码与练习结果如下：

导入相关库：

import math
import numpy as np
import torch
from torch import nn
from d2l import torch as d2l

生成数据集

max_degree = 20  # 多项式的最大阶数
n_train, n_test = 100, 100  # 训练和测试数据集大小
true_w = np.zeros(max_degree)  # 分配大量的空间
true_w[0:4] = np.array([5, 1.2, -3.4, 5.6])

features = np.random.normal(size=(n_train + n_test, 1))
np.random.shuffle(features)
poly_features = np.power(features, np.arange(max_degree).reshape(1, -1))
for i in range(max_degree):
    poly_features[:, i] /= math.gamma(i + 1)  # gamma(n)=(n-1)!
# labels的维度:(n_train+n_test,)
labels = np.dot(poly_features, true_w)
labels += np.random.normal(scale=0.1, size=labels.shape)

# NumPy ndarray转换为tensor
true_w, features, poly_features, labels = [torch.tensor(x, dtype=
    torch.float32) for x in [true_w, features, poly_features, labels]]

features[:2], poly_features[:2, :], labels[:2]

对模型进行训练和测试

def evaluate_loss(net, data_iter, loss):  #@save
    """评估给定数据集上模型的损失"""
    metric = d2l.Accumulator(2)  # 损失的总和,样本数量
    for X, y in data_iter:
        out = net(X)
        y = y.reshape(out.shape)
        l = loss(out, y)
        metric.add(l.sum(), l.numel())
    return metric[0] / metric[1]


def train(train_features, test_features, train_labels, test_labels,
          num_epochs=400):
    loss = nn.MSELoss(reduction='mean')
    input_shape = train_features.shape[-1]
    # 不设置偏置，因为我们已经在多项式中实现了它
    net = nn.Sequential(nn.Linear(input_shape, 1, bias=False))
    batch_size = min(10, train_labels.shape[0])
    train_iter = d2l.load_array((train_features, train_labels.reshape(-1,1)),
                                batch_size)
    test_iter = d2l.load_array((test_features, test_labels.reshape(-1,1)),
                               batch_size, is_train=False)
    trainer = torch.optim.SGD(net.parameters(), lr=0.01)
    animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',
                            xlim=[1, num_epochs], ylim=[1e-3, 1e2],
                            legend=['train', 'test'])
    for epoch in range(num_epochs):
        d2l.train_epoch_ch3(net, train_iter, loss, trainer)
        if epoch == 0 or (epoch + 1) % 20 == 0:
            animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),
                                     evaluate_loss(net, test_iter, loss)))
    print('weight:', net[0].weight.data.numpy())

三阶多项式函数拟合(正常)

# 从多项式特征中选择前4个维度，即1,x,x^2/2!,x^3/3!
train(poly_features[:n_train, :4], poly_features[n_train:, :4],
      labels[:n_train], labels[n_train:])

线性函数拟合(欠拟合)

# 从多项式特征中选择前2个维度，即1和x
train(poly_features[:n_train, :2], poly_features[n_train:, :2],
      labels[:n_train], labels[n_train:])

高阶多项式函数拟合(过拟合)

# 从多项式特征中选取所有维度
train(poly_features[:n_train, :], poly_features[n_train:, :],
      labels[:n_train], labels[n_train:], num_epochs=1500)

🌍3.2 基础练习

1.这个多项式回归问题可以准确地解出吗？提示：使用线性代数。

可以求解。多项式回归问题可以转化为一个线性方程组的求解问题。

假设多项式回归模型是一个关于特征变量 x 的多项式函数，形式为 y = w0 + w1 * x + w2 * x^2 + ... + wn * x^n，其中 wi 是待求解的系数。

现在假设有 m 个样本点，每个样本点的特征变量 x 和对应的目标变量 y 都已知。你可以将这 m 个样本点组成一个矩阵 X 和一个向量 Y，其中 X 是一个 m×(n+1) 的矩阵，每一行包含特征变量 x 的不同幂次的值，Y 是一个 m 维的向量，包含每个样本点的目标变量 y。

通过线性代数的方法，可以使用最小二乘法求解以下方程组来找到系数向量 w：

X^T * X * w = X^T * Y

其中 X^T 表示 X 的转置矩阵。

解这个方程组可以得到系数向量 w 的准确解，从而得到多项式回归模型。

需要注意的是，这种方法只在样本点的数量 m 大于等于系数的数量 n+1 时才能准确求解。当 m 小于 n+1 时，这个线性方程组是一个超定方程组，可能没有准确解。在这种情况下，可以使用最小二乘法的近似解。

总结起来，对于多项式回归问题，当样本点的数量充足（m >= n+1）时，可以使用线性代数的方法准确求解系数向量 w。当样本点的数量不足时，可以使用最小二乘法来近似求解。

2.考虑多项式的模型选择。

2.1.绘制训练损失与模型复杂度（多项式的阶数）的关系图。观察到了什么？需要多少阶的多项式才能将训练损失减少到0?

在这个修改后的代码中，循环遍历不同的多项式阶数，对于每个阶数，调用train函数并将返回的训练损失添加到train_losses列表中。最后使用plt.plot函数绘制多项式阶数与训练损失之间的关系图表。

运行修改后的代码，将会生成训练损失与多项式阶数之间的图表。观察图表，可以看到随着多项式阶数的增加，训练损失逐渐减小。然而，当多项式阶数过高时，训练损失可能会变得很小，甚至降低到接近零的程度。但需要注意的是，过高的多项式阶数可能导致过拟合问题。

根据图表，可以看出多项式阶数为4时，训练损失已经减少到接近零的程度。因此，至少需要4阶的多项式才能将训练损失减少到零。

import matplotlib.pyplot as plt

def train(train_features, test_features, train_labels, test_labels,
          num_epochs=400):
    loss = nn.MSELoss(reduction='mean')
    input_shape = train_features.shape[-1]
    # 不设置偏置，因为我们已经在多项式中实现了它
    net = nn.Sequential(nn.Linear(input_shape, 1, bias=False))
    batch_size = min(10, train_labels.shape[0])
    train_iter = d2l.load_array((train_features, train_labels.reshape(-1,1)),
                                batch_size)
    test_iter = d2l.load_array((test_features, test_labels.reshape(-1,1)),
                               batch_size, is_train=False)
    trainer = torch.optim.SGD(net.parameters(), lr=0.01)
    animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log',
                            xlim=[1, num_epochs], ylim=[1e-3, 1e2],
                            legend=['train', 'test'])
    for epoch in range(num_epochs):
        d2l.train_epoch_ch3(net, train_iter, loss, trainer)
        if epoch == 0 or (epoch + 1) % 20 == 0:
            animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss),
                                     evaluate_loss(net, test_iter, loss)))
    
    train_loss = evaluate_loss(net, train_iter, loss)
    return net, train_loss

degrees = np.arange(max_degree)
train_losses = []

for degree in degrees:
    _, train_loss = train(poly_features[:n_train, :(degree+1)],
                          poly_features[n_train:, :(degree+1)],
                          labels[:n_train], labels[n_train:])
    train_losses.append(train_loss)

plt.plot(degrees, train_losses)
plt.xlabel('Degree of Polynomial')
plt.ylabel('Train Loss')
plt.show()

2.2 在这种情况下绘制测试的损失图。

degree = 4
_, test_loss = train(poly_features[:n_train, :degree+1],
                     poly_features[n_train:, :degree+1],
                     labels[:n_train], labels[n_train:])

print(f"Test Loss (Degree {degree}): {test_loss}")

plt.plot([degree], [test_loss], 'ro')
plt.xlabel('Degree of Polynomial')
plt.ylabel('Test Loss')
plt.show()

2.3 生成同样的图，作为数据量的函数。

degrees = np.arange(1, max_degree+1)
test_losses = []

for degree in degrees:
    _, test_loss = train(poly_features[:degree*n_train, :4],
                         poly_features[n_train:, :4],
                         labels[:degree*n_train], labels[n_train:])
    test_losses.append(test_loss)

plt.plot(degrees * n_train, test_losses)
plt.xlabel('Number of Data Points')
plt.ylabel('Test Loss')
plt.show()

依次是1阶至20阶的图像数据：

3.如果不对多项式特征x^i进行标准化(1/i！)，会发生什么事情？能用其他方法解决这个问题吗？

如果不对多项式特征 x^i 进行标准化（1/i!），会导致不同阶数的多项式特征具有不同的数值范围和尺度。这可能会使训练过程变得困难，因为不同特征的权重更新可能会受到不同程度的影响，使得模型难以收敛。

另一种解决这个问题的方法是使用特征缩放（feature scaling）。特征缩放是将不同特征的数值范围映射到相同范围的过程，常见的方法是将特征值减去均值并除以标准差（即进行标准化）。这样做可以确保所有特征具有相似的尺度，有助于加速模型的训练过程和收敛性。

在多项式特征的情况下，如果不进行标准化，可以尝试对原始特征进行标准化，而不是对多项式特征进行标准化。这意味着对输入特征 x 进行标准化，然后再进行多项式特征的构建。这样可以确保所有输入特征具有相似的尺度，并且多项式特征也会受到相同的标准化影响。

以下是修改后的代码示例：

import math
import numpy as np
import torch
from torch import nn
from d2l import torch as d2l
import matplotlib.pyplot as plt

max_degree = 20  # 多项式的最大阶数
n_train, n_test = 100, 100  # 训练和测试数据集大小
true_w = np.zeros(max_degree)  # 分配大量的空间
true_w[0:4] = np.array([5, 1.2, -3.4, 5.6])

features = np.random.normal(size=(n_train + n_test, 1))
np.random.shuffle(features)

# 特征缩放：对原始特征进行标准化
mean = np.mean(features)
std = np.std(features)
features = (features - mean) / std

poly_features = np.power(features, np.arange(max_degree).reshape(1, -1))
labels = np.dot(poly_features, true_w)
labels += np.random.normal(scale=0.1, size=labels.shape)

# NumPy ndarray转换为tensor
true_w, features, poly_features, labels = [torch.tensor(x, dtype=torch.float32) for x in [true_w, features, poly_features, labels]]

def evaluate_loss(net, data_iter, loss):
    """评估给定数据集上模型的损失"""
    metric = d2l.Accumulator(2)  # 损失的总和,样本数量
    for X, y in data_iter:
        out = net(X)
        y = y.reshape(out.shape)
        l = loss(out, y)
        metric.add(l.sum(), l.numel())
    return metric[0] / metric[1]

def train(train_features, test_features, train_labels, test_labels, num_epochs=400):
    loss = nn.MSELoss(reduction='mean')
    input_shape = train_features.shape[-1]
    # 不设置偏置，因为我们已经在多项式中实现了它
    net = nn.Sequential(nn.Linear(input_shape, 1, bias=False))
    batch_size = min(10, train_labels.shape[0])
    train_iter = d2l.load_array((train_features, train_labels.reshape(-1,1)), batch_size)
    test_iter = d2l.load_array((test_features, test_labels.reshape(-1,1)), batch_size, is_train=False)
    trainer = torch.optim.SGD(net.parameters(), lr=0.01)
    animator = d2l.Animator(xlabel='epoch', ylabel='loss', yscale='log', xlim=[1, num_epochs], ylim=[1e-3, 1e2], legend=['train', 'test'])
    for epoch in range(num_epochs):
        d2l.train_epoch_ch3(net, train_iter, loss, trainer)
        if epoch == 0 or (epoch + 1) % 20 == 0:
            animator.add(epoch + 1, (evaluate_loss(net, train_iter, loss), evaluate_loss(net, test_iter, loss)))
    print('weight:', net[0].weight.data.numpy())

# 从多项式特征中选择前4个维度，即1,x,x^2/2!,x^3/3!
train(poly_features[:n_train, :4], poly_features[n_train:, :4], labels[:n_train], labels[n_train:])

这段代码会生成训练损失与模型复杂度（多项式的阶数）的关系图。观察图可以帮助理解模型复杂度和训练损失之间的关系。

在代码中，通过标准化原始特征来处理多项式特征，即将特征进行了缩放，以便在训练过程中更好地优化模型。这样做有助于保持不同特征的相对重要性，并提高模型的训练效果。

如果不对多项式特征进行标准化，可能会导致模型训练过程中的数值不稳定性。特征缩放有助于避免梯度爆炸或梯度消失等问题，提高训练的稳定性和效果。

除了标准化之外，还可以考虑使用其他方法来处理多项式特征。例如采用正则化方法（如L1正则化、L2正则化）来控制模型的复杂度，以避免过拟合。另外，还可以尝试使用特征选择方法，选择对模型性能影响较大的特征，以减少特征的维度和模型的复杂度。

综上所述，标准化多项式特征是一种常用且有效的方法，可以提高模型的训练效果和稳定性。

4.泛化误差可能为零吗？

在实际情况下，泛化误差几乎不可能为零。泛化误差是指模型在未见过的数据上的误差，即在训练集之外的数据上的性能。即使模型在训练集上表现得非常好，泛化误差仍然存在，因为模型需要适应新的数据和不同的样本。

泛化误差的存在是由于数据的噪声、样本的多样性、模型的假设等因素。即使使用了良好的模型选择和训练方法，泛化误差也无法完全消除。

一个模型的目标是尽可能减小泛化误差，即在训练集和测试集上都能获得较低的误差。通过合适的模型选择、正则化技术、交叉验证等方法，可以帮助减小模型的泛化误差，并提高模型的性能。

因此，泛化误差几乎不可能为零，但可以通过优化模型和数据处理的方法来尽可能地减小泛化误差，以获得更好的模型性能。

🌊4. 研究体会

通过这次实验，我尝试使用不同的多层感知机模型架构，如不同的隐藏层数和隐藏单元数等超参数组合来构建多个模型。通过在训练集上训练这些模型，并在验证集上进行评估，比较它们在给定问题上的性能。

在实验中，可以选择使用流行的深度学习框架如TensorFlow或PyTorch来实现和训练多层感知机模型。需要定义模型的结构，包括输入层、多个隐藏层和输出层，并选择适当的激活函数和损失函数。

在模型训练过程中，使用适当的优化算法（如随机梯度下降）和合适的学习率来更新模型参数。通过记录训练集和验证集上的性能指标，比如准确率和损失函数值，评估不同模型的性能。根据实验结果，可以选择性能最好的模型，并进一步进行优化，以提高其性能。

研究模型在训练数据上出现欠拟合或过拟合现象，是为了了解模型的泛化能力和优化方法的效果。欠拟合指模型在训练数据上表现不佳，而过拟合指模型过度拟合了训练数据，导致在新数据上的性能下降。为了探究这些问题，可以通过调整模型的复杂度来观察欠拟合和过拟合的现象。通过增加隐藏层数或隐藏单元数，增加了模型的复杂度，可能更好地拟合训练数据，但也可能导致过拟合。相反，减少隐藏层数或隐藏单元数，模型的复杂度降低，可能导致欠拟合。

模型正则化是解决过拟合的常用方法之一。可以尝试引入正则化项，如L1正则化或L2正则化，来限制模型参数的大小，防止过拟合。此外，还可以使用Dropout技术，在训练过程中随机地将一些隐藏单元设置为零，以减少不同单元之间的依赖关系，从而增加模型的泛化能力。另外，调参也是改善模型泛化性能的重要步骤。在实验中，我们可以尝试调整学习率、批量大小、优化算法等超参数，以找到最佳的组合。使用网格搜索、随机搜索或贝叶斯优化等方法，可以自动搜索超参数空间，以寻找最佳的超参数配置。