集成算法实验(Bagging策略)

Bagging模型(随机森林)

Bagging：训练多个分类器取平均
$f(x)=1/M\sum^M_{m=1}{f_m(x)}$
全称： bootstrap aggregation（说白了就是并行训练一堆分类器）
最典型的代表就是随机森林，现在Bagging模型基本上也是随机森林。

随机：数据采样随机，每棵树只用部分数据；数据有多个特征(属性)组成，每棵树随机选择部分特征。随机是为了使得每个分类器拥有明显差异性。
森林：很多个决策树并行放在一起
如何对所有树选择最终结果？分类的话可以采取少数服从多数，回归的话可以采用取平均值。

Bagging策略

首先对训练数据集进行多次采样(有放回的取样)，保证每次得到的采样每份数据都是不同的(但是每份数据里面可能和其他数据有相同)
分别训练多个模型，例如树模型
预测时需得到所有模型结果再进行集成

import numpy as np
import os
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12
import warnings
warnings.filterwarnings('ignore')
np.random.seed(42)

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons

X,y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

plt.plot(X[:,0][y==0],X[:,1][y==0],'yo',alpha = 0.6)
plt.plot(X[:,0][y==0],X[:,1][y==1],'bs',alpha = 0.6)

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# DecisionTreeClassifier 采用树模型
# n_estimators 500个分类器
# max_samples 每次采样100条数据
# bootstrap = True 有放回的随机采样
bag_clf = BaggingClassifier(DecisionTreeClassifier(),
                  n_estimators = 500,
                  max_samples = 100,
                  bootstrap = True,
                  n_jobs = -1,
                  random_state = 42
)
bag_clf.fit(X_train,y_train)
y_pred = bag_clf.predict(X_test)

# Bagging策略的结果
accuracy_score(y_test,y_pred)

结果：0.904

# 不使用Bagging策略，为了对比公平，random_state都为42
tree_clf = DecisionTreeClassifier(random_state = 42)
tree_clf.fit(X_train,y_train)
y_pred_tree = tree_clf.predict(X_test)
accuracy_score(y_test,y_pred_tree)

结果：0.856

决策边界

集成与传统方法对比

from matplotlib.colors import ListedColormap
def plot_decision_boundary(clf,X,y,axes=[-1.5,2.5,-1,1.5],alpha=0.5,contour =True):
    x1s=np.linspace(axes[0],axes[1],100)
    x2s=np.linspace(axes[2],axes[3],100)
    x1,x2 = np.meshgrid(x1s,x2s)
    X_new = np.c_[x1.ravel(),x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
    plt.contourf(x1,x2,y_pred,cmap = custom_cmap,alpha=0.3)
    if contour:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1,x2,y_pred,cmap = custom_cmap2,alpha=0.8)
    plt.plot(X[:,0][y==0],X[:,1][y==0],'yo',alpha = 0.6)
    plt.plot(X[:,0][y==0],X[:,1][y==1],'bs',alpha = 0.6)
    plt.axis(axes)
    plt.xlabel('x1')
    plt.xlabel('x2')

plt.figure(figsize = (12,5))
plt.subplot(121)
plot_decision_boundary(tree_clf,X,y)
plt.title('Decision Tree')
plt.subplot(122)
plot_decision_boundary(bag_clf,X,y)
plt.title('Decision Tree With Bagging')

可以看出Bagging的分类结果更加平滑

OOB(Out Of Bag)策略

在Bagging中，假如总共有100条数据，每颗树只随机取80条数据(有放回)，那么Out Of Bag的意思就是选择每次剩下20条数据作为验证集。

# oob_score = True 开启oob
bag_clf = BaggingClassifier(DecisionTreeClassifier(),
                  n_estimators = 500,
                  max_samples = 100,
                  bootstrap = True,
                  n_jobs = -1,
                  random_state = 42,
                  oob_score = True
)
bag_clf.fit(X_train,y_train)
print('验证集的结果为：')
bag_clf.oob_score_

验证集的结果为：
0.9253333333333333

y_pred = bag_clf.predict(X_test)
print('测试集的结果为：')
accuracy_score(y_test,y_pred)

测试集的结果为：
0.904

随机森林

from sklearn.ensemble import RandomForestClassifier
rf_clf = RandomForestClassifier()
rf_clf.fit(X_train,y_train)

特征重要性：
sklearn中是看每个特征的平均深度

from sklearn.datasets import load_iris
iris = load_iris()
rf_clf = RandomForestClassifier(n_estimators=500,n_jobs=-1)
rf_clf.fit(iris['data'],iris['target'])
for name,score in zip(iris['feature_names'],rf_clf.feature_importances_):
    print (name,score)

sepal length (cm) 0.11105536416721994
sepal width (cm) 0.02319505364393038
petal length (cm) 0.44036215067701534
petal width (cm) 0.42538743151183406
Mnist中哪些特征比较重要呢？

# from sklearn.datasets import fetch_mldata
# mnist = fetch_mldata('MNIST original')
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784', version=1, cache=True)

rf_clf = RandomForestClassifier(n_estimators=500,n_jobs=-1)
rf_clf.fit(mnist['data'],mnist['target'])

rf_clf.feature_importances_.shape
# (784,)

def plot_digit(data):
    # 将一维(784,)的feature_importances_改成二维正方形(28,28)
    image = data.reshape(28,28)
    plt.imshow(image,cmap=matplotlib.cm.hot)
    plt.axis('off')

plot_digit(rf_clf.feature_importances_)
char = plt.colorbar(ticks=[rf_clf.feature_importances_.min(),rf_clf.feature_importances_.max()])
char.ax.set_yticklabels(['Not important','Very important'])