【机器学习】二分类模型评估方法大全

一、模型搭建

导入包、全局设置

import numpy as np
import os
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)

（1）import os

os 模块提供了与操作系统交互的方式，允许你执行与文件和目录操作、环境变量等相关的各种操作。

（2）%matplotlib inline

%matplotlib inline 是一个特殊的 Jupyter Notebook 魔术命令（magic command）。它告诉 Jupyter 在 notebook 中直接显示 matplotlib 绘制的图表，而不需要使用 plt.show()。

这一行代码通常在 Jupyter Notebook 中的代码单元格的顶部使用，以确保图表能够直接嵌入到 notebook 中。在其他的 Python 开发环境中，这一行代码可能不是必需的，因为图表可能会在独立的窗口中显示。

（3）import warnings
warnings.filterwarnings('ignore')

是一个用于设置警告过滤器的语句，它指示 Python 在运行时忽略所有警告。

（4）np.random.seed(42)

设置 NumPy 随机数生成器种子的命令。这一行代码的作用是使得随机数生成变得可重复。通过指定相同的种子，你可以确保在每次运行程序时生成的随机数序列是相同的。

（5）plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

rcParams 是 Matplotlib 中的一个参数配置字典，用于控制图形的默认属性。这个字典包含了许多用于配置图形元素的键值对，如字体大小、颜色、线型等。通过修改这些参数，可以全局地改变 Matplotlib 图形的默认外观。 通过修改 rcParams 中的一些键值对，实现了全局地设置了坐标轴标签和刻度标签的字体大小。这样，之后绘制的所有图形都会遵循这些字体大小的设定，除非在特定图形的绘制代码中有显式的覆盖。

数据集读取

（1）mnist_784数据集

from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')

从OpenML存储库中获取了标识符为'mnist_784'的MNIST数据集。该数据集包含28x28像素的手写数字（0到9）的灰度图像。每个图像被展平为一个包含784个（28 * 28）特征的1D数组。fetch_openml函数返回一个类似字典的对象(sklearn.utils.Bunch)，其中包含数据、目标标签和与数据集相关的其他信息。

（2）数据集预处理

i、将训练集洗牌

ii、将多分类变为二分类（数字是否为5）

X, y = mnist.data, mnist.target
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]


# 洗牌操作
import numpy as np
shuffle_index = np.random.permutation(60000)
y_train, X_train = y_train.loc[shuffle_index], X_train.loc[shuffle_index]

#将多分类转换为2分类
y_train_5 = (y_train=='5')
y_test_5 = (y_test=='5')

模型建立

from sklearn.linear_model import SGDClassifier
sgd_clf = SGDClassifier(max_iter=5,random_state=42)
sgd_clf.fit(X_train,y_train_5)

（1）SGDClassifier分类器

SGD 是 "Stochastic Gradient Descent" 的缩写，中文翻译为 "随机梯度下降"。在机器学习中，梯度下降是一种优化算法，用于最小化损失函数，从而使模型能够学到适当的参数。

在 SGDClassifier 中，该算法被用于训练线性分类器，以对数据进行分类。

（2）粗略验证模型

sum(sgd_clf.predict(X_train) == y_train_5)
#输出为：57370

sum(sgd_clf.predict(X_test) == y_test_5)
#输出为：9592

sum(sgd_clf.predict(X) == (y == '5'))
#输出为：66962

二、模型评估

交叉验证cross_val_score

（1）cross_val_score函数实现

from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf,X_train,y_train_5,cv=3,scoring='accuracy')
#输出为：array([0.964 , 0.9579, 0.9571])

（2）自己编写交叉验证（理解即可，仅为描述cross_val_score的实现过程）

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

skflods = StratifiedKFold(n_splits=3,random_state=42,shuffle=True)
for train_index,test_index in skflods.split(X_train,y_train_5):
    clone_clf = clone(sgd_clf)
    X_train_folds = X_train.loc[train_index]
    y_train_folds = y_train_5[train_index]
    X_test_folds = X_train.loc[test_index]
    y_test_folds = y_train_5[test_index]
    
    clone_clf.fit(X_train_folds,y_train_folds)
    y_pred = clone_clf.predict(X_test_folds)
    n_correct = sum(y_pred == y_test_folds)
    print(n_correct/len(y_pred))

（1）

from sklearn.model_selection import StratifiedKFold
from sklearn.base import clone

这两行代码导入了 StratifiedKFold 类和 clone 函数，它们分别用于分层折叠交叉验证和克隆模型。

（2）

skflods = StratifiedKFold(n_splits=3,random_state=42,shuffle=True)

这一行创建了一个 StratifiedKFold 对象，用于执行分层折叠交叉验证。参数说明：

n_splits=3：指定了折叠的数量，这里是 3 折。
random_state=42：设置了随机种子，以确保交叉验证的可重复性。
shuffle=True：表示在每次划分前是否洗牌（打乱样本的顺序）。

（3）

for train_index, test_index in skfolds.split(X_train, y_train_5):

这一行通过 skfolds.split 方法迭代每个折叠，返回训练集和测试集的索引。train_index 和 test_index 是当前折叠的训练数据和测试数据的索引。

（4）

for train_index, test_index in skfolds.split(X_train, y_train_5):

这一行使用 clone 函数创建了 SGDClassifier 模型的一个克隆。clone 函数确保了克隆的模型和原始模型是相互独立的，对一个模型的更改不会影响另一个。

（5）

X_train_folds = X_train.loc[train_index]

y_train_folds = y_train_5[train_index]

X_test_folds = X_train.loc[test_index]

y_test_folds = y_train_5[test_index]

这几行通过索引从原始的 X_train 和 y_train_5 数据中获取了当前折叠的训练数据和测试数据。

（6）

clone_clf.fit(X_train_folds, y_train_folds)

这一行使用当前折叠的训练数据对克隆的模型进行训练。

（7）

y_pred = clone_clf.predict(X_test_folds)

这一行使用训练后的模型对当前折叠的测试数据进行预测。

（8）

n_correct = sum(y_pred == y_test_folds)

accuracy = n_correct / len(y_pred)

print(accuracy)

这几行计算了当前折叠的准确度。n_correct 记录了正确预测的数量，然后通过除以测试集样本数量得到准确度，并打印出来。这样，你就可以得到每个折叠的模型性能。这个过程在整个循环中会重复执行，每次迭代都是一个新的折叠。

混淆矩阵

（1）cross_val_predict函数

功能：交叉验证并返回预测标签。在使用 cross_val_predict 时，需要提供一个已经初始化的模型（例如 SGDClassifier 或其他分类器/回归器）。然后，cross_val_predict 会根据指定的交叉验证折叠策略（cv 参数）多次拆分数据集，对每个折叠使用指定的模型进行训练和预测。

from sklearn.model_selection import cross_val_predict
y_train_pred = cross_val_predict(sgd_clf,X_train,y_train_5,cv=3)
y_train_pred[:10]
#输出为：array([False, False, False, False, False, False, False, False, True, False])

（2）得到混淆矩阵

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_train_5,y_train_pred)
cm
#输出为：array([[54058,   521],
#              [1899,   3522]], dtype=int64)

[[ **true negatives** , **false positives** ],

[ **false negatives** , **true positives** ]]

* true negatives: 54058个数据被正确的分为非5类别
* false positives：521张被错误的分为5类别

* false negatives：1899张错误的分为非5类别
* true positives： 3522张被正确的分为5类别

一个完美的分类器应该只有**true positives** 和 **true negatives**, 即主对角线元素不为0，其余元素为0

（3）展示混淆矩阵

from sklearn.metrics import ConfusionMatrixDisplay
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.show()

准确率、召回率和 “F1得分”

准确率：模型预测的正确率

召回率：模型预测对目标的找出能力

F1得分：两者的调和平均

将Precision 和 Recall结合到一个称为F1 score 的指标,调和平均值给予低值更多权重。因此，如果召回和精确度都很高，分类器将获得高F 1分数。

from sklearn.metrics import precision_score,recall_score
precision_score(y_train_5,y_train_pred)
#输出为：0.8711352955725946
recall_score(y_train_5,y_train_pred)
#输出为：0.6496956281128943

from sklearn.metrics import f1_score
f1_score(y_train_5,y_train_pred)
#输出为：0.7442941673710904

阈值对结果的影响

（1）decision_function（）方法

Scikit-Learn不允许直接设置阈值，但它可以得到决策分数，调用其decision_function（）方法，而不是调用分类器的predict（）方法，该方法返回每个实例的分数，然后使用想要的阈值根据这些分数进行预测：

y_scores = sgd_clf.decision_function(X_train[:10])
y_scores
#输出为：array([-558596.64281983, -359729.00538583, -273844.96590398,
#       -502105.55813185, -507402.81825564, -629069.60120174,
#       -441083.42849022, -155220.78442829,  105466.65971603,
#       -579296.81125892])


t = 0
y_pred = (y_scores > t)
y_pred
#输出为：array([False, False, False, False, False, False, 
#              False, False,  True, False])

（2）cross_val_predict（）方法

y_scores = cross_val_predict(sgd_clf, X_train, y_train_5, cv=3,
                             method="decision_function")
y_scores[:10]
#输出为：array([ -545086.1906455 ,  -200238.20632717,  -366873.76172794,
#        -648828.94558457,  -572767.52239341, -1016184.25580999,
#        -419438.40135302,  -171080.39957192,   237230.03978349,
#        -793932.50331372])

（3）准确率召回率曲线

from sklearn.metrics import precision_recall_curve
precisions, recalls, thresholds = precision_recall_curve(y_train_5, y_scores)

thresholds为阈值，

precisions, recalls, thresholds三个分别表示在thresholds阈值下的准确率precisions和召回率recalls

from sklearn.metrics import PrecisionRecallDisplay
disp = PrecisionRecallDisplay(precision=precisions, recall=recalls)
disp.plot()
plt.show()
#可以直接用函数PrecisionRecallDisplay绘制准确率-召回率图像

ROC曲线和ROC_score

（1）介绍ROC曲线

receiver operating characteristic (ROC) 曲线是二元分类中的常用评估方法

它与精确度/召回曲线非常相似，但ROC曲线不是绘制精确度与召回率，而是绘制true positive rate(TPR) 与false positive rate(FPR)
要绘制ROC曲线，首先需要使用roc_curve（）函数计算各种阈值的TPR和FPR：

我的理解：TPR和FPR既是两个类别分对的数量占实际类别总数的比例。原召回率既是一种类别分对的数量占实际类别总数的比例，索引TPR既是召回率

（2）实现

from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores)


thresholds     #阈值
#输出为：array([  885939.44131781,   885938.44131781,   674510.09938886, ...,
#       -1913350.70937442, -1914531.45188909, -3051105.22556601])


fpr
#输出为：array([0.        , 0.        , 0.        , ..., 0.99598747, 0.99598747,
#       1.        ])


tpr
#输出为：array([0.00000000e+00, 1.84467810e-04, 2.95148497e-03, ...,
#       9.99815532e-01, 1.00000000e+00, 1.00000000e+00])

#使用封装函数直接画roc图线并计算面积
from sklearn import metrics
roc_auc = metrics.auc(fpr, tpr)
display = metrics.RocCurveDisplay(fpr=fpr, tpr=tpr, roc_auc=roc_auc,
                                  estimator_name='example estimator')
display.plot()
plt.show()
roc_auc

对于AUC的计算，上面使用了函数 auc() ，auc()函数可计算任意图像与x轴围成的面积，此处可用ROC_score专用函数 roc_auc_score() 来计算面积：

from sklearn.metrics import roc_auc_score

roc_auc_score(y_train_5, y_scores) #这里传入的参数并不是x、y的序列，函数内部会进行TPR和FPR的计算

#输出为：0.9598058535696421