【机器学习】分类任务：二分类与多分类

二分类与多分类：概念与区别

二分类和多分类是分类任务的两种类型，区分的核心在于目标变量（label）的类别数：

二分类：目标变量 y 只有两个类别，通常记为 y∈{0,1} 或 y∈{−1,1}。
示例：垃圾邮件分类（垃圾邮件或非垃圾邮件）。
多分类：目标变量 y 包含三个或更多类别，记为 y∈{1,2,…,K}。
示例：手写数字识别（类别为 0 到 9 的数字）。

1. 二分类问题

特征与目标

输入：特征向量 $x \in \mathbb{R}^d$ 。
输出：目标 y ∈ {0,1}。
模型预测：预测值为类别 1 的概率 $P(y=1|x) = \hat{y}$ 。

模型与算法

常用模型：
- 逻辑回归
- 支持向量机（SVM）
- 决策树
- 随机森林
- 神经网络（二分类输出层使用 Sigmoid 激活）
损失函数：
- 对数似然损失（Log-Likelihood Loss）： $\mathcal{L} = -\frac{1}{N} \sum_{i=1}^N \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$
评估指标：
- 准确率（Accuracy）
- 精确率（Precision）
- 召回率（Recall）
- F1 分数（F1-Score）
- AUC-ROC 曲线

案例代码

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, roc_auc_score

# 生成二分类数据
# 参数说明：n_samples=100表示生成100个样本，n_features=4表示数据有4个特征，n_classes=2表示二分类问题，
# n_informative=2表示其中2个特征是有信息的，n_redundant=1表示1个特征是冗余的，n_repeated=0表示没有重复的特征，
# random_state=0表示随机种子，保证结果可重复
X, y = make_classification(n_samples=100, n_features=4, n_classes=2, n_informative=2, n_redundant=1, n_repeated=0,
                           random_state=0)

# 数据集划分
# 将数据集划分为训练集和测试集，test_size=0.2表示测试集占20%，random_state=42保证划分结果可重复
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 使用逻辑回归模型进行训练
# 初始化逻辑回归模型
model = LogisticRegression()
# 使用训练集数据拟合模型
model.fit(X_train, y_train)
# 预测测试集的类别
y_pred = model.predict(X_test)
# 预测测试集的正类概率
y_prob = model.predict_proba(X_test)[:, 1]

# 评估模型性能
# 输出测试集的准确率
print("Accuracy:", accuracy_score(y_test, y_pred))
# 输出测试集的AUC-ROC分数
print("AUC-ROC:", roc_auc_score(y_test, y_prob))

输出结果

Accuracy: 0.9
AUC-ROC: 0.9090909090909091

2. 多分类问题

特征与目标

输入：特征向量 $x \in \mathbb{R}^d$ 。
输出：目标 $y \in \{1, 2, \dots, K\}$ 。
模型预测：预测每个类别的概率 $P(y=k|x)$ ，所有类别概率之和为 1。

模型与算法

常用模型：
- Softmax 回归（多类别逻辑回归）
- 决策树与随机森林
- 梯度提升树（如 XGBoost、LightGBM）
- 神经网络（输出层使用 Softmax 激活）
损失函数：
- 交叉熵损失（Cross-Entropy Loss）： $\mathcal{L} = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K 1(y_i = k) \log(\hat{y}_{i,k})$ ,k 是样本 i 被预测为类别 k 的概率。
评估指标：
- 准确率（Accuracy）
- 混淆矩阵（Confusion Matrix）
- 平均精确率、召回率与 F1 分数（Macro / Micro / Weighted）

案例代码

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# 生成二分类数据
# 参数说明：n_samples=100表示生成100个样本，n_features=4表示数据有4个特征，n_classes=2表示二分类问题，
# n_informative=2表示其中2个特征是有信息的，n_redundant=1表示1个特征是冗余的，n_repeated=0表示没有重复的特征，
# random_state=0表示随机种子，保证结果可重复
X, y = make_classification(n_samples=100, n_features=4, n_classes=2, n_informative=2, n_redundant=1, n_repeated=0,
                           random_state=0)

# 数据集划分
# 将数据集划分为训练集和测试集，test_size=0.2表示测试集占20%，random_state=42保证划分结果可重复
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 初始化随机森林分类器模型
model = RandomForestClassifier()

# 使用训练集数据拟合模型
model.fit(X_train, y_train)

# 使用拟合好的模型对测试集进行预测
y_pred = model.predict(X_test)

# 评估
# 输出模型的准确率
print("Accuracy:", accuracy_score(y_test, y_pred))
# 输出模型的分类报告，包含精确度、召回率、F1分数等指标
print("Classification Report:\n", classification_report(y_test, y_pred))

输出结果

Accuracy: 0.9
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.82      0.90        11
           1       0.82      1.00      0.90         9

    accuracy                           0.90        20
   macro avg       0.91      0.91      0.90        20
weighted avg       0.92      0.90      0.90        20

3. 二分类与多分类的区别

属性	二分类	多分类
目标变量	y∈{0,1}	y∈{1,2,…,K}
损失函数	对数似然损失	交叉熵损失
预测输出	类别 0 或 1 的概率	每个类别的概率分布
模型复杂度	相对简单	更复杂，需要考虑类别间关系
评估指标	精确率、召回率、AUC 等	混淆矩阵、宏平均 F1 等

4. 注意事项

模型选择：
- 对于二分类问题，许多模型（如逻辑回归、SVM）内置支持；
- 多分类问题可通过**一对多（OvR）或多对多（OvO）**策略，将多分类问题分解为多个二分类问题。
不平衡数据：
- 二分类和多分类中，不平衡数据都会导致评估指标偏差，需要关注 AUC 或调整权重。
概率解释：
- 二分类中概率直接表示为某一类别的置信度；
- 多分类中概率分布表示样本属于每个类别的可能性。