

Intermediate Mechine Learning之类型变量
实战:自己从头到尾首先Housing Prices Competition for Kaggle Learn Users并成功提交

Intermediate Mechine Learning之管道(pipeline之前一直错译为工作流)

Intermediate Mechine Learning之交叉验证
Intermediate Mechine Learning之XGBoost
Intermediate Mechine Learning之数据泄露






numeric_cols = [cname for cname in train_data.columns if train_data[cname].dtype in ['int64', 'float64']]
X = train_data[numeric_cols].copy()
X_test = test_data[numeric_cols].copy()

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score

def get_score(n_estimators):
    """Return the average MAE over 3 CV folds of random forest model.
    Keyword argument:
    n_estimators -- the number of trees in the forest
    # Replace this body with your own code
    my_pipeline = Pipeline(steps=[
        ('preprocessor', SimpleImputer()),
        ('model', RandomForestRegressor(n_estimators=n_estimators, random_state=0))

    scores = -1 * cross_val_score(my_pipeline, X, y,
    return scores.mean()

n_list = list(range(50, 401, 50))
results = {}
for ns in n_list:
    mean_s = get_score(ns)
    results[ns] = mean_s
import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(list(results.keys()), list(results.values()))

后续可以学习超参数优化课程,可以从网格搜索grid search开始



gradient boosting梯度迭代模型是Kaggle比赛中实现了多种数据集的SOTA

对于随机森林方法,它本质上使用了多个单独的决策树进行学习,可以称作ensemble methods集成学习方法。另外一种集成学习方法叫做graient boosting



from xgboost import XGBRegressor
my_model = XGBRegressor()
my_model.fit(X_train, y_train)

# 更多参数
my_model = XGBRegressor(n_estimators=500, learning_rate=0.05, n_jobs=4)  # 迭代次数,学习率和并行数
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5,       #自动停止
             eval_set=[(X_valid, y_valid)], #测试用集合

Data Leakage

两种类型的数据泄露:target leakagetrain-test contamination 训练、测试污染

Target leakage



Train-test Contamination

一个建议是:When using cross-validation, it’s even more critical that you do your preprocessing inside the pipeline!


  • card: 1 if credit card application accepted, 0 if not
  • reports: Number of major derogatory reports
  • age: Age n years plus twelfths of a year
  • income: Yearly income (divided by 10,000)
  • share: Ratio of monthly credit card expenditure to yearly income
  • expenditure: Average monthly credit card expenditure
  • owner: 1 if owns home, 0 if rents
  • selfempl: 1 if self-employed, 0 if not
  • dependents: 1 + number of dependents
  • months: Months living at current address
  • majorcards: Number of major credit cards held
  • active: Number of active credit accounts
expenditures_cardholders = X.expenditure[y]
expenditures_noncardholders = X.expenditure[~y]

print('Fraction of those who did not receive a card and had no expenditures: %.2f' \
      %((expenditures_noncardholders == 0).mean()))
print('Fraction of those who received a card and had no expenditures: %.2f' \
      %(( expenditures_cardholders == 0).mean()))
Fraction of those who did not receive a card and had no expenditures: 1.00
Fraction of those who received a card and had no expenditures: 0.02

potential_leaks = ['expenditure', 'share', 'active', 'majorcards']     #排除潜在可能的泄露
X2 = X.drop(potential_leaks, axis=1)

# Evaluate the model with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y, 

print("Cross-val accuracy: %f" % cv_scores.mean())   # 准确率大大下降





Step 4: Preventing Infections
An agency that provides healthcare wants to predict which patients from a rare surgery are at risk of infection, so it can alert the nurses to be especially careful when following up with those patients.
You want to build a model. Each row in the modeling dataset will be a single patient who received the surgery, and the prediction target will be whether they got an infection.
Some surgeons may do the procedure in a manner that raises or lowers the risk of infection. But how can you best incorporate the surgeon information into the model?
You have a clever idea.

  1. Take all surgeries by each surgeon and calculate the infection rate among those surgeons.
  2. For each patient in the data, find out who the surgeon was and plug in that surgeon’s average infection rate as a feature.
    Does this pose any target leakage issues?
    Does it pose any train-test contamination issues?

This poses a risk of both target leakage and train-test contamination (though you may be able to avoid both if you are careful).
You have target leakage if a given patient’s outcome contributes to the infection rate for his surgeon, which is then plugged back into the prediction model for whether that patient becomes infected. You can avoid target leakage if you calculate the surgeon’s infection rate by using only the surgeries before the patient we are predicting for. Calculating this for each surgery in your training data may be a little tricky.
You also have a train-test contamination problem if you calculate this using all surgeries a surgeon performed, including those from the test-set. The result would be that your model could look very accurate on the test set, even if it wouldn’t generalize well to new patients after the model is deployed. This would happen because the surgeon-risk feature accounts for data in the test set. Test sets exist to estimate how the model will do when seeing new data. So this contamination defeats the purpose of the test set.



# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load original data
from sklearn.model_selection import train_test_split

X_full = pd.read_csv("/kaggle/input/home-data-for-ml-course/train.csv")
X_test = pd.read_csv("/kaggle/input/home-data-for-ml-course/test.csv")

X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

# X_train, X_valid, y_train, y_valid = train_test_split(X_full, y, train_size=0.8, test_size=0.2,
                                                      # random_state=0)

print("Load data successfully.")

# print(X_full.isnull().sum()[X_full.isnull().sum()>0])
# # 对于缺失值过多的列,采用丢弃策略
# X_drop_cols = [col for col in X_full.columns if X_full[col].isnull().sum() > 100]
# X_full.drop(X_drop_cols, axis=1, inplace=True)

numerical_cols = [col for col in X_full.columns if X_full[col].dtype in ["int64", "float64"]]
categorical_cols = [col for col in X_full.columns if X_full[col].dtype == "object"]

# print(X_drop_cols)

# define pipelinefrom sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import cross_val_score

numerical_transformer = SimpleImputer(strategy="constant")

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy="most_frequent")),
    ('one_hot', OneHotEncoder(handle_unknown="ignore"))

preprocessor = ColumnTransformer(
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)

def get_score(model):
    my_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    scores = -1 * cross_val_score(my_pipeline, X_full, y,
    return scores.mean()

print("get_score defined.")

# 挑选最佳模型
from xgboost import XGBRegressor
# my_model = XGBRegressor(n_estimators=2000, 
#                         learning_rate=0.01,
#                         random_state=0,
#                        n_jobs=4)
# s = get_score(my_model)
# print(f"MAE is {s}")
  • 最原始模型:17468
  • 丢弃缺失值超过10的:17562
  • 丢弃缺失值超过40的:17524
  • 丢弃缺失值超过100的:17516


  • epoch-200: 17489
  • epoch-300: 17467
  • epoch-400: 17463
  • epoch-450: 17467


  • 轮次450: 17818
  • 轮次600:17504
  • 轮次700:17403
  • 轮次800:17343
  • 轮次900:17319
  • 轮次1000:17307
  • 轮次1500:17271
  • 轮次2000:17268
final_model = XGBRegressor(n_estimators=2000, 
final_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', final_model)
final_pipeline.fit(X_full, y)

predictions = final_pipeline.predict(X_test)
print("Predictions on test set:", predictions)

output = pd.DataFrame({'Id': X_test.Id,
                      'SalePrice': predictions})
output.to_csv("submission.csv", index=False)
print("Sub saved")






