目录
- 1. 竞赛简要
- 2. 数据分析
- 2.1 特征类型统计
- 2.2 四个荒野区域数据分析
- 2.3 连续特征分析
- 2.4 离散特征分析
- 2.5 特征相关性热图
- 2.6 特征间的散点关系图
- 3. 特征工程
- 3.1 特征组合
- 3.2 连续特征标准化
- 4. 模型搭建
- 4.1 模型定义
- 4.2 绘制混淆矩阵和ROC曲线
- 4.3 模型对比与选择
- 5. 测试集分类
- 5.1 特征重要性排名
- 5.2 测试集分类
- 6. 参考文献
1. 竞赛简要
本次竞赛的数据集包含训练集和测试集,数据集的研究区域包含位于科罗拉多州北部罗斯福国家森林的四个荒野区域,这些区域受人为干扰影响最小,每个观测样本的区域大小为 30m x 30m。其中,训练集大小有15120个样本,测试集有565892个样本。共56个特征,7种类别(用数字1-7表示),第一列为样本Id
列,最后一列Cover_Type
为标签列,详细数据信息见竞赛地址。代码的最终评分为0.78729,竞赛所需的数据集和代码已经打包上传到Gitee,点击直达。
2. 数据分析
2.1 特征类型统计
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import warnings
warnings.filterwarnings('ignore')
# 数据集研究区域位于科罗拉多州北部罗斯福国家森林的四个荒野区域, 每个观测值是一个 30m x 30m 的区域
train = pd.read_csv("D:/Desktop/kaggle数据集/forest-cover/train.csv")
# 删除ID列
train.drop("Id", axis=1, inplace=True)
#------------------------------------------------------------------------------------------------------------#
# 统计数据类别数量(特征中没有非数值变量)
# 分类特征(唯一值数量小于25,也可以是其他值,视情况而定),例:特征列 [1, 2, 2, 3, 3] 的唯一值(1, 2, 3)
# 连续特征(‘Id’列不计入)
#------------------------------------------------------------------------------------------------------------#
discrete_features = [col for col in train.columns if len(train[col].unique()) < 25]
continuous_features = [feature for feature in train.columns if feature not in discrete_features]
print("Number of discrete features : ",len(discrete_features))
print("Number of continuous features: ", len(continuous_features))
print("离散型特征: ",discrete_features)
print("连续型特征:", continuous_features)
# 重命名一些较长的特征,否则绘图时容易重叠
continuous_features_labels = ["Elevation", "Aspect", "Slope", "Horz_Dist_To_Hydrology", "Vert_Dist_To_Hydrology",
"Horz_Dist_To_Roadways", "Hillshade 9am", "Hillshade Noon", "Hillshade 3pm", "Horz_Dist_To_Fire_Points"]
Number of discrete features : 45
Number of continuous features: 10
离散型特征: ['Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3', 'Wilderness_Area4', 'Soil_Type1', 'Soil_Type2', 'Soil_Type3', 'Soil_Type4', 'Soil_Type5', 'Soil_Type6', 'Soil_Type7', 'Soil_Type8', 'Soil_Type9', 'Soil_Type10', 'Soil_Type11', 'Soil_Type12', 'Soil_Type13', 'Soil_Type14', 'Soil_Type15', 'Soil_Type16', 'Soil_Type17', 'Soil_Type18', 'Soil_Type19', 'Soil_Type20', 'Soil_Type21', 'Soil_Type22', 'Soil_Type23', 'Soil_Type24', 'Soil_Type25', 'Soil_Type26', 'Soil_Type27', 'Soil_Type28', 'Soil_Type29', 'Soil_Type30', 'Soil_Type31', 'Soil_Type32', 'Soil_Type33', 'Soil_Type34', 'Soil_Type35', 'Soil_Type36', 'Soil_Type37', 'Soil_Type38', 'Soil_Type39', 'Soil_Type40', 'Cover_Type']
连续型特征: ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points']
其中,45个离散特征包含40个土壤类型特征、4个荒野区域特征和一个森林覆盖类型特征。
2.2 四个荒野区域数据分析
cover_type_counts = np.array(train.groupby(["Cover_Type"]).size())
cover_type_counts
array([2160, 2160, 2160, 2160, 2160, 2160, 2160], dtype=int64)
可以发现,7种森林覆盖类型的总数都相同。
四个荒野区域:
- Rawah Wilderness Area
- Neota Wilderness Area
- Comanche Peak Wilderness Area
- Cache la Poudre Wilderness Area
area_1_counts = np.array(train.groupby(["Cover_Type"]).sum()["Wilderness_Area1"])
area_2_counts = np.asarray(train.groupby(["Cover_Type"]).sum()["Wilderness_Area2"])
area_3_counts = np.asarray(train.groupby(["Cover_Type"]).sum()["Wilderness_Area3"])
area_4_counts = np.asarray(train.groupby(["Cover_Type"]).sum()["Wilderness_Area4"])
print(f"区域1森林覆盖类型统计:{area_1_counts}")
区域1森林覆盖类型统计:[1062 1134 0 0 856 0 545]
print("区域1所有土地类型总计:{}".format(train['Wilderness_Area1'].sum()))
print("区域2所有土地类型总计:{}".format(train['Wilderness_Area2'].sum()))
print("区域3所有土地类型总计:{}".format(train['Wilderness_Area3'].sum()))
print("区域4所有土地类型总计:{}".format(train['Wilderness_Area4'].sum()))
区域1所有森林类型总计:3597
区域2所有森林类型总计:499
区域3所有森林类型总计:6349
区域4所有森林类型总计:4675
绘制每个区域上的土地类型分布情况:
area_counts = np.zeros((4, 7))
area_counts[0] = np.array(train.groupby(["Cover_Type"]).sum()["Wilderness_Area1"])
area_counts[1] = np.array(train.groupby(["Cover_Type"]).sum()["Wilderness_Area2"])
area_counts[2] = np.array(train.groupby(["Cover_Type"]).sum()["Wilderness_Area3"])
area_counts[3] = np.array(train.groupby(["Cover_Type"]).sum()["Wilderness_Area4"])
# 初始化底部位置
bottoms = np.zeros(4)
for i in range(7):
# width用于控制条形的宽度
plt.bar(range(1,5), area_counts[:, i], width=0.5, bottom=bottoms, label=f"Type {i + 1}")
# 更新底部位置
bottoms += area_counts[:, i]
# 动态设置 y 轴的范围
plt.ylim([0, np.max(bottoms) + 1000])
plt.legend(ncol=2)
plt.xlabel("Wilderness Areas")
plt.ylabel("Total Number")
plt.title("Forest Cover Types in each Wilderness Area")
plt.xticks(range(1,5), ['Area 1', 'Area 2', 'Area 3', 'Area 4']);
plt.tight_layout()
2.3 连续特征分析
查看不同土地类型相对于每个连续变量的分布
fig, axs = plt.subplots(10, 2, figsize=(14,40))
for i in range(len(continuous_features)):
#--------------------------------------------------------------------------------------------------------#
# kdeplot 用于绘制核密度估计图(可视化连续变量的概率密度分布)
# 纵坐标为概率密度:概率密度并不是直接表示概率,而是表示在某个小区间内找到随机变量的可能性,
# 通过计算相应区间的积分来获得,而不是直接从概率密度函数中读取
#--------------------------------------------------------------------------------------------------------#
sns.kdeplot(data=train, x=continuous_features[i], hue='Cover_Type', palette='turbo', fill=True, ax=axs[i][0])
sns.boxplot(data=train, x='Cover_Type', y=continuous_features[i], palette='turbo', fliersize=2, ax=axs[i][1])
axs[i][0].set_xlabel(continuous_features_labels[i], fontsize=14)
axs[i][1].set_ylabel(continuous_features_labels[i], fontsize=14)
axs[i][1].set_xlabel('Cover Type', fontsize=14)
axs[i][0].set_ylabel('Density', fontsize=14)
# 调整子图布局,wspace=0.4:表示子图之间的水平间距是子图宽度的40%
plt.subplots_adjust(wspace=0.4, hspace=0.3)
运行结果(图太长,截取局部):
由箱线图可知:海拔(Elevation)特征对土地类型的划分较为明显,例如土地类型7都分布海拔3000m以上,土地类型4都分布再海拔2500m以下。Horizontal_Distance_To_Roadways
也在一定程度上分离了覆盖类型,类型4都分布在水平范围2000m以内,而类型1,2,7在4000m内分布较为均匀。
绘制连续特征数值分布:
fig, ax = plt.subplots(nrows=5, ncols=2, figsize=(14, 20))
for i, feature in enumerate(continuous_features):
sns.histplot(data=train, x=feature, kde=True, ax=ax[i//2,i%2])
运行结果(图太长,截取局部):
2.4 离散特征分析
查看不同土地类型的土壤类型分布情况(共 40 种不同土壤类型),根据分布情况分析各覆盖类型中以哪些土壤类型为主。
fig, axs = plt.subplots(4,2, figsize=(14, 18))
soil_data = [soil_1, soil_2, soil_3, soil_4, soil_5, soil_6, soil_7]
for i in range(7):
sns.barplot(x=list(range(1,41)), y=soil_data[i], ax=axs[i//2][i%2], palette='coolwarm')
axs[3][1].axis('off')
axs[0][0].set_title('Cover Type 1 (Spruce/Fir)')
axs[0][1].set_title('Cover Type 2 (Lodgepole Pine)')
axs[1][0].set_title('Cover Type 3 (Ponderosa Pine)')
axs[1][1].set_title('Cover Type 4 (Cottonwood/Willow)')
axs[2][0].set_title('Cover Type 5 (Aspen)')
axs[2][1].set_title('Cover Type 6 (Douglas Fir)')
axs[3][0].set_title('Cover Type 7 (Krummholz)')
for i in range(4):
for j in range(2):
# 设置横坐标值的字体大小
axs[i][j].set_xticklabels(labels=list(range(1,41)), fontsize=7.5)
axs[i][j].set_xlabel('Soil Type')
axs[i][j].set_ylabel('Total Number')
# 旋转刻度标签
axs[i][j].tick_params(axis='x', rotation=45)
plt.subplots_adjust(wspace=0.2, hspace=0.45)
运行结果(图太长,截取局部):
2.5 特征相关性热图
plt.figure(figsize=(8,7))
corr = train[continuous_features].corr()
sns.heatmap(corr, xticklabels=continuous_features_labels, yticklabels=continuous_features_labels, linewidths=0.5, annot=True);
由图可知,特征Hillshade 9am
和Hillshade 3pm
之间具有极强的相关性, Horizontal_Distance_To_Hydrology
和Vertical_Distance_To_Hydrology
之间的相关性也较高。后续的特征工程重点分析这几个特征。此外,Hillshade
在不同时间段的预测值、海拔与道路/火点距离之间的相关性也较高。
注:特征之间的强相关性会导致多重共线性,多重共线性会使回归模型的系数估计不可靠,影响其解释性和预测能力。特别是在具有高度相关性的自变量之间,模型很难准确估计每个自变量对因变量的独立贡献。但是由于测试数据比训练数据大得多,根据现有的特征得出的结论可能不完善,因此一些强相关的特征继续保留。
下面将使用散点图详细分析哪些特征适合组合。
2.6 特征间的散点关系图
#--------------------------------------------------------------------------------------------------------------------------------#
# 绘制连续变量之间的分布关系
# jointplot:用于绘制两个变量之间的关系,同时显示这些变量的分布情况(边际图)
# height=5: 设置绘图的高度(单位是英寸)
# s=3: 设置散点的大小
#--------------------------------------------------------------------------------------------------------------------------------#
sns.jointplot(data=train, x="Horizontal_Distance_To_Hydrology", y="Vertical_Distance_To_Hydrology",
hue="Cover_Type", palette='turbo', height=5, s=3);
由图知,Vertical_Distance_To_Hydrology
和Horizontal_Distance_To_Hydrology
之间存在较强的正相关关系。这与相关性系数0.65相符。
- Horizontal_Distance_To_Hydrology:到最近的地表水体的水平距离
- Vertical_Distance_To_Hydrology:到最近的地表水体的垂直距离
后续的特征工程可以根据这两个特征计算出距离地表水的实际距离,将两个特征作为直角三角形的两条直角边,计算得出的斜边
C
C
C即为实际距离,计算公式:
C
=
V
2
+
H
2
C=\sqrt{V^2+H^2}
C=V2+H2
sns.jointplot(data=train, x="Horizontal_Distance_To_Roadways", y="Elevation", hue="Cover_Type", palette='turbo', height=5, s=3);
sns.jointplot(data=train, x="Horizontal_Distance_To_Fire_Points", y="Horizontal_Distance_To_Roadways", hue="Cover_Type", palette='turbo', height=5, s=3);
以上结果表明,特征之间整体相关性不强,但是单独以每种森林覆盖类型来看, 存在一些类型的观测值沿正斜率线分布,可以对特征进行加减(有助于捕捉更多的潜在模式和关系)组合。
sns.jointplot(data=train, x="Vertical_Distance_To_Hydrology", y="Elevation", hue="Cover_Type", palette='turbo', height=5, s=3);
Elevation
和Vertical_Distance_To_Hydrology
之间的相关性系数为0.12(几乎没有相关性),但是以每种森林覆盖类型来看, 每种类型对应的两个特征的值沿正斜率线分布(即正相关)。可以将两个特征相加或相减,有助于按土地类型对数据进行分层。此外,相加时具有现实解释意义,即距离实际观测地点最近的地表水的海拔。
3. 特征工程
3.1 特征组合
# 特征组合:通过距离最近地表水的垂直距离和水平距离计算实际直线距离
train['Actual_Distance_To_Hydrology'] = (train['Vertical_Distance_To_Hydrology']**2 +
train['Horizontal_Distance_To_Hydrology']**2)**0.5
train['Elevation_Plus_Vertical_Hydrology'] = train['Elevation'] + train['Vertical_Distance_To_Hydrology']
train['Elevation_Minus_Vertical_Hydrology'] = train['Elevation'] - train['Vertical_Distance_To_Hydrology']
# 特征 Hillshade_9am 和 Hillshade_3pm 之间具有极强的负相关性,Hillshade_3pm和Hillshade_Noon有较(0.61)强正相关性,因此对三个特征进行组合
train['Hillshade'] = train['Hillshade_9am'] + train['Hillshade_3pm'] + train['Hillshade_Noon']
train['Elevation_Plus_Horizontal_Roadways'] = train['Elevation'] +train['Horizontal_Distance_To_Roadways']
train['Elevation_Minus_Horizontal_Roadways'] = train['Elevation'] - train['Horizontal_Distance_To_Roadways']
train['Fire_Points_Plus_Roadways'] = train['Horizontal_Distance_To_Fire_Points'] + train['Horizontal_Distance_To_Roadways']
train['Fire_Points_Minus_Roadways'] = train['Horizontal_Distance_To_Fire_Points'] - train['Horizontal_Distance_To_Roadways']
划分训练集与验证集:
X = train.drop('Cover_Type', axis=1)
y = train['Cover_Type']
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)
在新数据集中检索连续特征:
discrete_features = [col for col in X.columns if len(X[col].unique()) < 25]
continuous_features = [feature for feature in X.columns if feature not in discrete_features]
print("Number of discrete features : ",len(discrete_features))
print("Number of continuous features: ", len(continuous_features))
print("离散型特征: ",discrete_features)
print("连续型特征:", continuous_features)
Number of discrete features : 44
Number of continuous features: 18
离散型特征: ['Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3', 'Wilderness_Area4', 'Soil_Type1', 'Soil_Type2', 'Soil_Type3', 'Soil_Type4', 'Soil_Type5', 'Soil_Type6', 'Soil_Type7', 'Soil_Type8', 'Soil_Type9', 'Soil_Type10', 'Soil_Type11', 'Soil_Type12', 'Soil_Type13', 'Soil_Type14', 'Soil_Type15', 'Soil_Type16', 'Soil_Type17', 'Soil_Type18', 'Soil_Type19', 'Soil_Type20', 'Soil_Type21', 'Soil_Type22', 'Soil_Type23', 'Soil_Type24', 'Soil_Type25', 'Soil_Type26', 'Soil_Type27', 'Soil_Type28', 'Soil_Type29', 'Soil_Type30', 'Soil_Type31', 'Soil_Type32', 'Soil_Type33', 'Soil_Type34', 'Soil_Type35', 'Soil_Type36', 'Soil_Type37', 'Soil_Type38', 'Soil_Type39', 'Soil_Type40']
连续型特征: ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points', 'Actual_Distance_To_Hydrology', 'Elevation_Plus_Vertical_Hydrology', 'Elevation_Minus_Vertical_Hydrology', 'Hillshade', 'Elevation_Plus_Horizontal_Roadways', 'Elevation_Minus_Horizontal_Roadways', 'Fire_Points_Plus_Roadways', 'Fire_Points_Minus_Roadways']
3.2 连续特征标准化
将特征的均值调整为 0,标准差调整为 1,使得数据符合标准正态分布(均值为 0,方差为 1)。
注:训练集和验证集不能一起标准化,因为这会导致模型在训练时“看到”验证集的信息(数据泄露)。会导致模型在验证集上的表现过于乐观。应先在训练集上拟合标准化器,并使用该标准化器对训练集和验证集进行单独转换(保证训练和验证集使用相同的标准进行缩放)。这样可以确保模型在验证集上的表现能够真实反映其泛化能力。
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# 计算 X_train 中指定的连续变量的均值和标准差,scaler 会存储这些特征的均值和标准差
scaler.fit(X_train[continuous_features])
X_train_scaled = X_train.copy()
X_valid_scaled = X_valid.copy()
# 标准化处理
X_train_scaled[continuous_features] = scaler.transform(X_train_scaled[continuous_features])
X_valid_scaled[continuous_features] = scaler.transform(X_valid_scaled[continuous_features])
4. 模型搭建
4.1 模型定义
定义超参数优化函数和模型函数:
import optuna
from xgboost import XGBClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import cross_val_score
"""
定义超参数优化函数
Params:
objective:待优化的评估函数
Returns:
params:最佳超参数组合
"""
def tune(objective, X_train, y_train):
# 创建了一个 Optuna 的 Study 对象,用于管理一次超参数优化的过程
# direction='maximize' 表示优化的目标是最大化目标函数的返回值(返回值是负的均方根误差)
study = optuna.create_study(direction='maximize')
# 使用 study 对象的 optimize 方法来执行优化过程
# n_trials=10:指进行优化的试验次数
study.optimize(lambda trial: objective(trial, X_train, y_train), n_trials=10)
params = study.best_params
# 获取经过优化后的最佳得分((目标函数的返回值))
best_score = study.best_value
print(f"Best score: {best_score} \nOptimized parameters: {params}")
return params
注:XGBClassifier 要求类标签必须是从 0 开始的连续整数(标签需减1),而 ExtraTreesClassifier 没有要求。
"""
XGBoost分类模型
Params:
trial:是由 Optuna 库提供的对象,用于在超参数优化过程中管理参数的提议和跟踪
Returns:
score:交叉验证评分
"""
def xgb_objective(trial, X_train, y_train):
_n_estimators = trial.suggest_categorical("n_estimators", [50, 100, 200, 400, 500, 800])
_max_depth = trial.suggest_int("max_depth", 1, 20)
_learning_rate = trial.suggest_float("learning_rate", 0.01, 1)
xgboost = XGBClassifier(
n_estimators=_n_estimators,
max_depth=_max_depth,
learning_rate=_learning_rate,
)
score = cross_val_score(xgboost, X_train, y_train-1, cv=10, scoring="accuracy").mean()
return score
"""
极端随机树分类模型
Params:
trial:是由 Optuna 库提供的对象,用于在超参数优化过程中管理参数的提议和跟踪
Returns:
score:交叉验证评分
"""
def etc_objective(trial, X_train, y_train):
_n_estimators = trial.suggest_categorical("n_estimators", [50, 100, 200, 400, 500, 800])
_max_depth = trial.suggest_int("max_depth", 1, 20)
extra = ExtraTreesClassifier(
n_estimators=_n_estimators,
max_depth=_max_depth,
)
score = cross_val_score(extra, X_train, y_train, cv=10, scoring="accuracy").mean()
return score
执行XGBoost模型的超参数优化过程:
import time
start_time = time.time()
xgb_params = tune(xgb_objective, X_train_scaled, y_train)
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Running time: {elapsed_time:.2f} seconds")
Best score: 0.8761569906144686
Optimized parameters: {'n_estimators': 100, 'max_depth': 8, 'learning_rate': 0.3587540957081757}
Running time: 178.71 seconds
在验证集上测试准确率:
xgboost = XGBClassifier(**xgb_params, random_state=9)
xgboost.fit(X_train_scaled, y_train-1)
score = xgboost.score(X_valid_scaled, y_valid-1)
print("XGBoost准确率:{}".format(score))
XGBoost准确率:0.8806216931216931
打印分类报告:
# 打印各类别的分类报告
y_pred_xgb = xgboost.predict(X_valid_scaled)
print("XGBoost分类报告:")
print(classification_report(y_valid, y_pred_xgb+1, target_names=[str(i) for i in range(1, 8)], digits=3))
XGBoost分类报告:
precision recall f1-score support
1 0.779 0.777 0.778 400
2 0.807 0.716 0.759 444
3 0.904 0.846 0.874 454
4 0.944 0.985 0.964 461
5 0.901 0.962 0.931 425
6 0.864 0.903 0.883 435
7 0.947 0.973 0.960 405
accuracy 0.881 3024
macro avg 0.878 0.880 0.878 3024
weighted avg 0.879 0.881 0.879 3024
经模型挑选后,为减少代码运行时间,准确率较差的模型和方案已注释,如下:
# 未标准化的数据集,已得出结论,先注释
# start_time = time.time()
# xgb1_params = tune(xgb_objective, X_train, y_train)
# end_time = time.time()
# elapsed_time = end_time - start_time
# print(f"Running time: {elapsed_time:.2f} seconds")
# etc_params = tune(etc_objective, X_train_scaled, y_train)
# etc = ExtraTreesClassifier(**etc_params, random_state=1)
# etc.fit(X_train_scaled, y_train)
# score = etc.score(X_valid_scaled, y_valid)
# print("ExtraTree准确率:{}".format(score))
# 打印各类别的分类报告
# y_pred_etc = etc.predict(X_valid_scaled)
# print("ExtraTree分类报告:")
# print(classification_report(y_valid, y_pred_etc, target_names=[str(i) for i in range(1, 8)], digits=3))
4.2 绘制混淆矩阵和ROC曲线
绘制XGBoost分类器的混淆矩阵:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, roc_curve, auc
conf_matrix = confusion_matrix(y_valid, y_pred_xgb+1)
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues',
xticklabels=range(1, 8), yticklabels=range(1,8))
plt.ylabel('True')
plt.xlabel('Predicted')
plt.title('Confusion Matrix');
横轴为预测类别,纵轴为实际类别。对标线上的值表示模型正确预测的样本数量,非对角线上的值表示模型错误预测的样本数量。对角线(1, 1)中的值311表示实际类别为1的样本中有311条被正确预测为类别1。
绘制XGBoost分类器的ROC曲线:
# 样本被预测为各个类别的概率
y_pred_proba = xgboost.predict_proba(X_valid_scaled)
# 以类别(1-7)为键值存储每个类的fpr和tpr(fpr[i]的大小与阈值数组大小相等)
fpr = {}
tpr = {}
roc_auc = {}
# 计算ROC曲线和AUC
for i in range(1, 8):
#--------------------------------------------------------------------------------------------------------------#
# roc_curve用于计算 FPR 和 TPR,参数:实际标签(布尔数组)和预测概率。返回值:
# fpr[i]:类别 i 的假阳性率
# tpr[i]:类别 i 的真阳性率
#_:阈值数组
# i-1:列索引从0开始
#--------------------------------------------------------------------------------------------------------------#
fpr[i], tpr[i], k = roc_curve(y_valid == i, y_pred_proba[:, i-1])
roc_auc[i] = auc(fpr[i], tpr[i])
# 绘制所有类别的ROC曲线
for i in range(1, 8): # 类别从1到7
plt.plot(fpr[i], tpr[i], label='Type {} (AUC = {:.3})'.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--') # 对角线
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='lower right');
4.3 模型对比与选择
model_score = pd.DataFrame({
'Classifier':['XGBoost', 'ExtraTrees'],
'Validation Set Accuracy':[0.880, 0.874],
'Validation Set Precision':[0.879,0.873],
'Validation Set Recall':[0.881,0.874],
'Validation Set F1 score':[0.879, 0.872]
})
model_score
比较发现,XGBoost的模型分类效果更佳。确定XGBoost模型后,对比是否需要数据标准化处理。
model_score = pd.DataFrame({
'Classifier':['XGBoost', 'XGBoost'],
'数据标准化':['标准化','未标准化'],
'Best score':[0.876,0.878],
'Running time(s)':[178, 260]
})
model_score
对比可知,在最佳评分相差不大的情况下,标准化后的数据集能极大提高代码的运行时间,因此确定最佳分类方案为XGBoost+数据标准化。
5. 测试集分类
使用xgboost模型在全部原始训练集上进行训练。由于测试数据比训练数据大得多,且初始的训练集和验证集只划分过一次,泛化性不强,需在原始训练集上进行完整训练,以保证其泛化性。首先对测试集进行标准化处理:
scaler = StandardScaler()
scaler.fit(X[continuous_features])
X_scaled = X.copy()
X_scaled[continuous_features] = scaler.transform(X_scaled[continuous_features])
best_model = XGBClassifier(**xgb_params)
best_model.fit(X_scaled, y-1)
5.1 特征重要性排名
feature_importance = pd.DataFrame({
'Feature': X.columns,
'Importance': best_model.feature_importances_
})
# 按照重要性排序
feature_importance = feature_importance.sort_values(by='Importance', ascending=False)
top_features = feature_importance.head(20)
# 绘制条形图
plt.figure(figsize=(10, 5))
barplot = sns.barplot(x='Importance', y='Feature', data=top_features, palette='viridis')
plt.title('Top 20 Feature Importance')
plt.xlabel('Importance Score')
plt.ylabel('Features')
# 标注数值
for index, row in enumerate(top_features.itertuples()):
barplot.text(row.Importance, index, f" {row.Importance:.3f}", color='black', ha="left", va="center")
由图可知,构造的特征Elevation_Minus_vertical_Hydrology
对样本分类的重要性较高。这表明,特征工程构造的特征能较大程度上影响模型的分类决策。
5.2 测试集分类
使用同样的特征工程方法对测试集特征进行处理:
test = pd.read_csv('D:/Desktop/kaggle数据集/forest-cover/test.csv')
Id = test['Id']
test.drop(["Id"], axis=1, inplace=True)
test['Actual_Distance_To_Hydrology'] = (test['Vertical_Distance_To_Hydrology']**2 +
test['Horizontal_Distance_To_Hydrology']**2)**0.5
test['Elevation_Plus_Vertical_Hydrology'] = test['Elevation'] + test['Vertical_Distance_To_Hydrology']
test['Elevation_Minus_Vertical_Hydrology'] = test['Elevation'] - test['Vertical_Distance_To_Hydrology']
# 特征 Hillshade_9am 和 Hillshade_3pm 之间具有极强的负相关性,Hillshade_3pm和Hillshade_Noon有较(0.61)强正相关性,因此对三个特征进行组合
test['Hillshade'] = test['Hillshade_9am'] + test['Hillshade_3pm'] + test['Hillshade_Noon']
test['Elevation_Plus_Horizontal_Roadways'] = test['Elevation'] +test['Horizontal_Distance_To_Roadways']
test['Elevation_Minus_Horizontal_Roadways'] = test['Elevation'] - test['Horizontal_Distance_To_Roadways']
test['Fire_Points_Plus_Roadways'] = test['Horizontal_Distance_To_Fire_Points'] + test['Horizontal_Distance_To_Roadways']
test['Fire_Points_Minus_Roadways'] = test['Horizontal_Distance_To_Fire_Points'] - test['Horizontal_Distance_To_Roadways']
生成提交文件:
test_scaled = test.copy()
test_scaled[continuous_features] = scaler.transform(test_scaled[continuous_features])
pred = best_model.predict(test_scaled)
submission = pd.DataFrame({'Id': Id,'Cover_Type': pred+1})
# submission.to_csv('/kaggle/working/submission.csv', index=False)
print('Submission file created!')
6. 参考文献
[1] Forest Cover Type: 0.81031 (Top 6%) w/ ExtraTrees
[2] Multivariate Classification w/ExtraTreesClassifier