机器学习-随机森林温度预测模型优化

文章目录

- 前言
- 旧模型训练
- 新模型训练
- - 参数查看
  - 组合参数
  - 训练学习
  - 模型评估

前言

在机器学习-随机森林算法预测温度一文中，通过增大模型训练数据集和训练特征的方式去优化模型的性能，本文将记录第三方种优化方式，通过调整随机森林创建模型参数的方式去优化模型，即调参。这里调参和神经网络使用验证集调整超参数概念不太一样，所以不会去使用验证集。本文调参，将使用RandomizedSearchCV（）函数，去交叉验证不同参数组合的模型性能，选择最优性能的参数组合模型。

旧模型训练

为了缩短参数训练的时间，对比旧模型，将先用部分数据 2016年的，不含ws_1、prcp_1、snwd_1三个特征值的，参见文章机器学习-随机森林算法预测温度

其评估结果如下：

误差是： 4.16
score： 0.843355562598595
MAE是: 4.16409589041096
MSE是: 26.98129152054795
RMSE是： 5.194351886477075

新模型训练

数据集和特征选择和旧模型保持一致，只通过调整模型构建参数进行调优。前面的数据探索性分析和数据预处理都一样。从构建模型开始，有了变化：

参数查看

# 建立随机森林模型
from sklearn.ensemble import RandomForestRegressor
# 建立预测模型
rf = RandomForestRegressor(random_state=42)
from pprint import pprint
# 格式化方式打印json数据
pprint(rf.get_params())

输出如下：

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 42,
 'verbose': 0,
 'warm_start': False}

通过查看官网api，官网地址：

发现每个参数都是可以指定的，当然指定不同的参数，结果肯定也不一样，下面将构造一种参数范围，让随机森林自己去学习训练评估出最佳参数组合。

组合参数

from sklearn.model_selection import RandomizedSearchCV
n_estimators = [int(x) for x in np.linspace(start=100,stop=1000,num=10)]
max_features = [1.0,'sqrt','log2']
max_depth = [int(x) for x in np.linspace(10,200,10)]
max_depth.append(None)
min_samples_split = [2,5,10]
min_samples_leaf = [1,2,4]
bootstrap = [True,False]

random_param = {'bootstrap': bootstrap,
                 'max_depth': max_depth,
                 'max_features': max_features,
                 'min_samples_leaf': min_samples_leaf,
                 'min_samples_split': min_samples_split,
                 'n_estimators': n_estimators
                }

上面只是一种可能的参数组合范围，参照api文档进行简单枚举

训练学习

rf_random = RandomizedSearchCV(estimator=rf,param_distributions=random_param,n_iter=100,scoring='neg_mean_absolute_error',cv=3,random_state=42)
rf_random.fit(train_features,train_labels)

模型将开始训练，如下图：
在这里插入图片描述

等训练程序跑完，打印训练学习后的最佳参数

pprint(rf_random.best_params_)

如下：

{'bootstrap': True,
 'max_depth': 73,
 'max_features': 1.0,
 'min_samples_leaf': 2,
 'min_samples_split': 10,
 'n_estimators': 600}

模型评估

由于代码重复出现，对评估代码进行封装

def evaluate(model, test_features, test_labels):
    pre = model.predict(test_features)

    errors = abs(pre - test_labels)
    print('误差是：', round(np.mean(errors), 2))
    # 得分
    score = model.score(test_features, test_labels)
    print('score：', score)
    import sklearn.metrics as sm

    print('MAE是:', sm.mean_absolute_error(pre, test_labels))
    print('MSE是:', sm.mean_squared_error(pre, test_labels))
    print('RMSE是：', np.sqrt(sm.mean_squared_error(pre, test_labels)))

执行评估：

best_model = rf_random.best_estimator_
evaluate(best_model,test_features,test_labels)

结果如下：

误差是： 4.06
得分： 0.852906033295568
MAE是: 4.061986168567313
MSE是: 25.336266403102137
RMSE是： 5.033514319350064

可以看到，和一开始的旧模型评估结果相比，性能得到了一定幅度提升。

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：/a/543747.html

如若内容造成侵权/违法违规/事实不符，请联系我们进行投诉反馈qq邮箱809451989@qq.com，一经查实，立即删除！