【数据挖掘实战】房价预测

本次对kaggle中的入门级数据集，房价回归数据集进行数据挖掘，预测房屋价格。

本人主页：机器学习司猫白

机器学习专栏：机器学习实战

PyTorch入门专栏：PyTorch入门

深度学习实战：深度学习

ok，话不多说，我们进入正题吧

概述

本次竞赛有 79 个解释变量（几乎）描述了爱荷华州艾姆斯住宅的各个方面，需要预测每套住宅的最终价格。

数据集描述

本次数据集已经上传，大家可以自行下载尝试

文件说明

train.csv - 训练集

test.csv - 测试集

data_description.txt - 每列的完整描述，最初由 Dean De Cock 准备，但经过轻微编辑以匹配此处使用的列名称

Sample_submission.csv - 根据销售年份和月份、地块面积和卧室数量的线性回归提交的基准

建模思路

本次预测是预测房屋价格，很明显是一个回归预测。这里考虑使用线性回归和树模型的回归进行尝试并优化其中参数，选择最佳的一个模型进行预测，输出每个房屋的预测价格。

Python源码

一，打开数据文件，查看数据的基本情况。

import numpy as np 
import pandas as pd 

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')
train_data.info()

输出：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallCond    1460 non-null   int64  
 19  YearBuilt      1460 non-null   int64  
 20  YearRemodAdd   1460 non-null   int64  
 21  RoofStyle      1460 non-null   object 
 22  RoofMatl       1460 non-null   object 
 23  Exterior1st    1460 non-null   object 
 24  Exterior2nd    1460 non-null   object 
 25  MasVnrType     588 non-null    object 
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object 
 28  ExterCond      1460 non-null   object 
 29  Foundation     1460 non-null   object 
 30  BsmtQual       1423 non-null   object 
 31  BsmtCond       1423 non-null   object 
 32  BsmtExposure   1422 non-null   object 
 33  BsmtFinType1   1423 non-null   object 
 34  BsmtFinSF1     1460 non-null   int64  
 35  BsmtFinType2   1422 non-null   object 
 36  BsmtFinSF2     1460 non-null   int64  
 37  BsmtUnfSF      1460 non-null   int64  
 38  TotalBsmtSF    1460 non-null   int64  
 39  Heating        1460 non-null   object 
 40  HeatingQC      1460 non-null   object 
 41  CentralAir     1460 non-null   object 
 42  Electrical     1459 non-null   object 
 43  1stFlrSF       1460 non-null   int64  
 44  2ndFlrSF       1460 non-null   int64  
 45  LowQualFinSF   1460 non-null   int64  
 46  GrLivArea      1460 non-null   int64  
 47  BsmtFullBath   1460 non-null   int64  
 48  BsmtHalfBath   1460 non-null   int64  
 49  FullBath       1460 non-null   int64  
 50  HalfBath       1460 non-null   int64  
 51  BedroomAbvGr   1460 non-null   int64  
 52  KitchenAbvGr   1460 non-null   int64  
 53  KitchenQual    1460 non-null   object 
 54  TotRmsAbvGrd   1460 non-null   int64  
 55  Functional     1460 non-null   object 
 56  Fireplaces     1460 non-null   int64  
 57  FireplaceQu    770 non-null    object 
 58  GarageType     1379 non-null   object 
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object 
 61  GarageCars     1460 non-null   int64  
 62  GarageArea     1460 non-null   int64  
 63  GarageQual     1379 non-null   object 
 64  GarageCond     1379 non-null   object 
 65  PavedDrive     1460 non-null   object 
 66  WoodDeckSF     1460 non-null   int64  
 67  OpenPorchSF    1460 non-null   int64  
 68  EnclosedPorch  1460 non-null   int64  
 69  3SsnPorch      1460 non-null   int64  
 70  ScreenPorch    1460 non-null   int64  
 71  PoolArea       1460 non-null   int64  
 72  PoolQC         7 non-null      object 
 73  Fence          281 non-null    object 
 74  MiscFeature    54 non-null     object 
 75  MiscVal        1460 non-null   int64  
 76  MoSold         1460 non-null   int64  
 77  YrSold         1460 non-null   int64  
 78  SaleType       1460 non-null   object 
 79  SaleCondition  1460 non-null   object 
 80  SalePrice      1460 non-null   int64  
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

根据输出结果，我们可以看到数据集中存在缺失值。缺失值如果不处理，会影响后续建模过程，甚至可能导致模型报错。这里有一个具体的情况需要说明：假设缺失值出现在object类型的特征中，通常情况下，我们会使用独热编码（One-Hot Encoding）将分类数据转化为数值。如果我们直接对包含缺失值的列进行独热编码，可能会生成一列专门表示缺失值（通常是NaN的列）。这样会导致训练数据和后续用于预测的实际数据维度不一致，进而无法使用模型进行预测。

此外，一些模型对NaN值非常敏感，因为NaN表示缺失数据，而不是数值类型。如果模型在训练时遇到NaN值，很多模型会因此报错，因为它们无法处理非数值的输入数据。因此，在建模前，我们需要先处理缺失值，确保数据的一致性和模型能够正确训练。常见的处理方法包括填充缺失值（如使用均值、中位数或众数填充）或者删除包含缺失值的行或列。

数据维度一致性：训练数据和预测数据的特征维度必须完全一致，否则模型无法正确应用于新数据。

二，数据处理和特征工程

# 计算每个特征的缺失值比例
missing_values = train_data.isnull().sum()  # 计算每一列的缺失值数量
total_values = train_data.shape[0]  # 获取总行数

# 计算每一列缺失值的比例
missing_percentage = (missing_values / total_values) * 100

# 显示缺失值比例超过50%的特征
high_missing_features = missing_percentage[missing_percentage > 50]

# 输出缺失值比例超过50%的特征
high_missing_features

输出：

Alley          93.767123
MasVnrType     59.726027
PoolQC         99.520548
Fence          80.753425
MiscFeature    96.301370
dtype: float64

这里计算了缺失值的比例。

train_data2 = train_data.drop(['MiscFeature', 'Fence', 'PoolQC',  'MasVnrType', 'Alley','Id'], axis=1)
test_data2 = test_data.drop(['MiscFeature', 'Fence', 'PoolQC', 'MasVnrType', 'Alley','Id'], axis=1)
id = test_data['Id']
train_data2.shape, test_data2.shape

删除缺失值过多的列，剩下的列采用填充的方法进行处理。

# 处理测试集中的缺失值
for column in test_data2.columns:
   if test_data2[column].dtype == 'object':
       # 对象类型，使用训练集的众数填充
       test_data2[column].fillna(train_data2[column].mode()[0], inplace=True)
   else:
       # 数值类型，使用训练集的中位数填充
       test_data2[column].fillna(train_data2[column].median(), inplace=True)

# 处理训练集中的缺失值
for column in train_data2.columns:
   if train_data2[column].dtype == 'object':
       # 对象类型，使用训练集的众数填充
       train_data2[column].fillna(train_data2[column].mode()[0], inplace=True)
   else:
       # 数值类型，使用训练集的中位数填充
       train_data2[column].fillna(train_data2[column].median(), inplace=True)


# 查看处理后的训练集和测试集
print(train_data2.shape)
print(test_data2.shape)

输出：

(1460, 75)
(1459, 74)

缺失值处理完毕，接下来就可以划分目标变量和特征。

train_data3=train_data2.drop(['SalePrice'], axis=1)

label=train_data2['SalePrice']
train_data3.shape

输出：
(1460, 74)

这里可以看到，特征较多，考虑尝试使用相关性去除一部分。

import seaborn as sns
import matplotlib.pyplot as plt

# 选择所有数值类型的列
numerical_data = train_data3.select_dtypes(include=['number'])

# 计算相关性矩阵
correlation_matrix = numerical_data.corr()

# 设置绘图的尺寸
plt.figure(figsize=(15, 8))

# 使用seaborn绘制热图
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.1f', linewidths=0.5)

# 设置标题
plt.title('Correlation Heatmap of Numerical Features')

# 显示热图
plt.show()

有点看不太清，那就直接使用阈值，去除相关性大于0.8的列。

# 设置相关性阈值
threshold = 0.8

# 找到相关性大于阈值的列对
to_drop = set()  # 用于存储要删除的列
for i in range(len(correlation_matrix.columns)):
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > threshold:
            colname = correlation_matrix.columns[i]
            # 仅当当前列未被标记删除时才进行删除操作
            if colname not in to_drop:
                to_drop.add(correlation_matrix.columns[j])

list(to_drop)

输出：

['GarageCars', 'GrLivArea', 'TotalBsmtSF']

# 删除相关性较强的列
train_data4 = train_data3.drop(columns=to_drop)
test_data4 = test_data2.drop(columns=to_drop)

print(train_data4.shape)
print(test_data4.shape)

from sklearn.preprocessing import LabelEncoder

# 创建每个类别特征进行编码
for column in train_data4.select_dtypes(include=['object']).columns:
   # 合并训练集和测试集的类别，以创建一个包含所有可能类别的编码器
   all_categories = pd.concat([train_data4[column], test_data4[column]]).unique()
   encoder = LabelEncoder()
   encoder.fit(all_categories)
   
   # 使用编码器对训练集和测试集进行编码
   train_data4[column] = encoder.transform(train_data4[column])
   test_data4[column] = encoder.transform(test_data4[column])

# 查看处理后的训练集和测试集
print(train_data4.shape)
print(test_data4.shape)

这里对object类型的列进行编码，使其变为数值，至于为什么使用标签编码，后续我会出一个有关特征编码的文章，这里不多进行赘述。

三，模型训练与评估

先考虑使用线性回归中的岭回归，来看看效果。

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split


X = train_data4  # 特征数据
y = label  # 目标变量

# 划分数据集为训练集和验证集
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# 定义岭回归模型
ridge_model = Ridge()

# 设置待调优的超参数范围，这里我们主要调节 alpha（正则化参数）
param_grid = {'alpha': np.logspace(-6, 6, 13)}  # alpha 的范围通常是从 1e-6 到 1e6

# 使用交叉验证来选择最佳的 alpha 参数
grid_search = GridSearchCV(ridge_model, param_grid, cv=5, scoring='neg_mean_squared_error')  # 5折交叉验证，使用负均方误差作为评分标准

# 拟合模型
grid_search.fit(X_train, y_train)

# 输出最佳参数
print("Best alpha parameter:", grid_search.best_params_)

# 获取最佳模型
best_ridge_model = grid_search.best_estimator_

# 使用最佳模型在验证集上评估
score = best_ridge_model.score(X_val, y_val)
print("Model R^2 score on validation set:", score)

# 输出交叉验证的结果
print("Best cross-validation score:", grid_search.best_score_)

Best alpha parameter: {'alpha': 100.0}
Model R^2 score on validation set: 0.8496053872702527
Best cross-validation score: -1348455440.2012005

再使用lightgbm，树模型来看看效果。

import optuna
import lightgbm as lgb
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

X = train_data4  # 特征数据
y = label  # 目标变量

# 划分训练集和验证集
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=42)

def objective(trial):
    # 使用Optuna选择超参数
    params = {
        'objective': 'regression',  # 回归任务
        'boosting_type': 'gbdt',  # 梯度提升决策树
        'num_leaves': trial.suggest_int('num_leaves', 20, 100),  # 树的最大叶子数
        'learning_rate': trial.suggest_float('learning_rate', 1e-5, 1e-1, log=True),  # 学习率，使用对数均匀分布
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),  # 树的数量
        'max_depth': trial.suggest_int('max_depth', 3, 15),  # 树的最大深度
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),  # 数据采样率
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),  # 特征采样率
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),  # 每个叶子的最小样本数
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-5, 1.0, log=True),  # L1 正则化
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-5, 1.0, log=True)  # L2 正则化
    }

    # 创建LightGBM模型
    model = lgb.LGBMRegressor(**params, verbose=-1)
    
    # 训练模型
    model.fit(X_train, y_train)
    
    # 进行预测
    y_pred = model.predict(X_valid)
    
    # 计算RMSE（均方根误差）
    rmse = np.sqrt(mean_squared_error(y_valid, y_pred))
    
    return rmse  # Optuna将根据最小化RMSE来寻找最佳超参数

# 创建Optuna的Study对象
study = optuna.create_study(direction='minimize')  # 最小化RMSE

# 开始超参数优化
study.optimize(objective, n_trials=50)  # 尝试100次

# 输出最佳超参数和对应的RMSE值
print(f"Best trial: {study.best_trial.params}")
print(f"Best RMSE: {study.best_value}")

# 使用最佳超参数训练最终模型
best_params = study.best_trial.params
final_model = lgb.LGBMRegressor(**best_params, verbose=-1)

# 训练最终模型时
final_model.fit(X_train, y_train)

# 在验证集上进行预测并计算RMSE和R2
y_pred_final = final_model.predict(X_valid)
final_rmse = np.sqrt(mean_squared_error(y_valid, y_pred_final))
final_r2 = r2_score(y_valid, y_pred_final)

print(f"Final RMSE on validation set: {final_rmse}")
print(f"Final R2 on validation set: {final_r2}")

Best trial: {'num_leaves': 97, 'learning_rate': 0.013163137448188754, 'n_estimators': 372, 'max_depth': 11, 'subsample': 0.8474988867349187, 'colsample_bytree': 0.7064845955811748, 'min_child_samples': 5, 'reg_alpha': 0.0011685340064003379, 'reg_lambda': 0.041584313394230084}
Best RMSE: 26248.97344413891
Final RMSE on validation set: 26248.97344413891
Final R2 on validation set: 0.910172189779164

根据输出结果，初步发现lightgbm模型效果会更好。这里解释以下回归模型的评估，比如这里的RMSE，虽然说这个指标是越小越好，小到多少是好，大到多少是不好，这里要讲的是RMSE更像是一个相对指标，比如第一次运行RMSE为1000，第二次运行RMSE是998，那么第二次运行的就是更优的，并没有一个绝对的数值来评判，而是相对的比较。

1. 这里使用 Optuna 对 LightGBM 回归模型的超参数进行优化，目的是找到能够最小化 RMSE 的最佳参数组合。

2. 优化的超参数包括树的深度、叶子数、学习率等。

3. 最终训练并评估了一个基于最佳超参数的回归模型，并计算了其在验证集上的 RMSE 和 R²。

由于数据量较小，很容易过拟合，因此加入了l1和l2正则化，并进行超参数优化，可以看到训练集RMSE和测试集RMSE非常接近，说明并没有过度拟合。

四，使用真实的数据运行模型，预测房屋的价格

y_pred_test = final_model.predict(test_data4)
# 将预测结果转换为 DataFrame
y_pred_df = pd.DataFrame({
   'Id': test_data['Id'],
   'SalePrice': y_pred_test
})

# 保存预测结果到 CSV 文件
y_pred_df.to_csv('predictions.csv', index=False)
y_pred_df

Id SalePrice
0 1461 128989.106316
1 1462 155402.491796
2 1463 173423.163568
3 1464 184025.799434
4 1465 200870.139148
... ... ...
1454 2915 84714.331635
1455 2916 89781.868635
1456 2917 171236.073006
1457 2918 121141.145259
1458 2919 220957.998442

1459 rows × 2 columns

Id	SalePrice
0	1461	128989.106316
1	1462	155402.491796
2	1463	173423.163568
3	1464	184025.799434
4	1465	200870.139148
...	...	...
1454	2915	84714.331635
1455	2916	89781.868635
1456	2917	171236.073006
1457	2918	121141.145259
1458	2919	220957.998442

这样模型的预测结果就保存为了csv文件。

五，展示特征重要性

import matplotlib.pyplot as plt
# 绘制特征重要性图
lgb.plot_importance(final_model, importance_type='split', max_num_features=10, figsize=(10, 6))
plt.title('Feature Importance (Split)')
plt.show()