文章目录
- 前言
- 案例背景
- 数据集介绍
- 加载数据集
- 探索性数据分析(EDA)
- 可视化特征和目标值之间关系
- 缺失值分析
- 数据预处理
- 数据清洗
- 缺失值处理
- 去除噪声并且规范化文本内容
- 数据转换
- 数据划分
- 建模
- 逻辑回归模型
- 决策分类树模型
- 随机森林模型
- 梯度提升树模型
- 预测
- LR 完整的 python 代码
前言
官网链接:Titanic - Machine Learning from Disaster | Kaggle
Notebook 链接:Titanic Analysis Predictions | LR, DT, RF, GBT | Kaggle
(其中 Version 1-3 含有分析过程,文末仅贴有逻辑回归模型的完整 python 代码)
案例背景
泰坦尼克号的沉没是历史上最臭名昭著的沉船事故之一。
1912 年 4 月 15 日,在她的处女航中,被广泛认为“不沉”的泰坦尼克号与冰山相撞后沉没。不幸的是,船上没有足够的救生艇,导致 2224 名乘客和机组人员中有 1502 人死亡。
虽然生存有一定的运气成分,但似乎某些群体比其他群体更有可能生存。
在本次挑战中,我们要求建立一个预测模型来回答以下问题:“什么样的人更有可能生存?”使用乘客数据(即姓名、年龄、性别、社会经济阶层等)。
数据集介绍
数据分为两组:
- 训练集(train.csv)
- 测试集(test.csv)
训练集:包含机上部分乘客(确切地说是 891 名)的详细信息,重要的是,将揭示他们是否幸存,也称为“基本事实”。
测试集:包含类似的信息,但没有披露每位乘客的“基本事实”。预测这些结果是你的工作。
列名 | 含义 |
---|---|
PassengerId | 乘客编号 |
Survived | 生存情况(0:死亡,1:存活) |
Pclass | 客舱等级 |
Name | 姓名 |
Sex | 性别 |
Age | 年龄 |
SibSp | 同代直系亲属数 |
Parch | 不同代直系亲属数 |
Ticket | 船票编号 |
Fare | 船票价格 |
Cabin | 客舱号 |
Embarked | 登船港口 |
加载数据集
# 忽略警告
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
# 加载数据集
df = pd.read_csv("./titanic/train.csv")
df.sample(5, random_state=0)
探索性数据分析(EDA)
df.info()
可视化特征和目标值之间关系
from matplotlib import pyplot as plt
import seaborn as sns
features = ["Pclass", "Age", "SibSp", "Parch", "Fare"]
fig, axes = plt.subplots(1, 5, figsize=(15, 3), tight_layout=True)
for feature, ax in zip(features, axes):
plt.sca(ax)
sns.kdeplot(df.loc[df["Survived"] == 1, feature], label="1", fill=True)
sns.kdeplot(df.loc[df["Survived"] == 0, feature], label="0", fill=True)
plt.legend(title="Survived")
plt.show()
缺失值分析
df.isnull().sum()
# 删除缺失值
data = df['Age'].dropna()
# 绘制直方图
sns.histplot(data, kde=True, color='skyblue', label='Histogram', stat='density')
# 绘制正态分布曲线
sns.kdeplot(data, color='r', label='Normal Distribution')
plt.legend()
plt.show()
sum(df['Cabin'].isnull()) / len(df)
plt.pie(x=df['Embarked'].value_counts().values, labels=df['Embarked'].value_counts().index, autopct='%1.1f%%')
plt.show()
- 处理缺失值的策略
- Age 趋近于正态分布,根据 Name 中的称呼给 Age 赋其对于均值
- Cabin 中缺失值占比 77%,缺失过多,删除该列
- Embarked 中有 2 个缺失值使用占比最大的 S 填充
数据预处理
数据清洗
缺失值处理
import re
def name_title(x):
return x.split('.')[0].split(' ')[-1]
df['Name'].apply(remove_noise).value_counts()
def remove_noise(x):
return re.sub(r'[".,()]+', '', x)
df['NameTitle'] = df['Name'].apply(name_title)
df.sample(5, random_state=0)
# 根据分组计算平均值
group_means = df.groupby('NameTitle')['Age'].mean()
# 填充缺失值
df['Age'] = df['Age'].fillna(df['NameTitle'].map(group_means))
df.sample(5, random_state=0)
df = df.drop('Cabin', axis=1)
df['Embarked'].fillna('S', inplace=True)
df.head()
# 提取每个单元格中包含的非字母字符
symbols_per_cell = df['Name'].apply(lambda x: ''.join([char for char in x if not char.isalpha()]))
# 获取所有不同的符号
unique_symbols = set(''.join(symbols_per_cell))
unique_symbols
去除噪声并且规范化文本内容
def ticket_pref(x):
if len(x.split(' ')) == 1:
return 'nan'
else:
x = ".".join(x.split(' ')[:-1])
return re.sub(r'[./]+', '', x).lower()
def ticket_ID(x):
x = x.split(' ')[-1]
return int(x) if x.isdigit() else 0
df['Name'] = df['Name'].apply(remove_noise)
df['TicketPref'] = df['Ticket'].apply(ticket_pref)
df['TicketID'] = df['Ticket'].apply(ticket_ID)
df.sample(5, random_state=0)
y = df['Survived']
X = df.drop(['PassengerId', 'Survived', 'Ticket', 'NameTitle'], axis=1)
X.sample(5, random_state=0)
数据转换
- 处理文本数据
- Name 使用 TF-IDF(Term Frequency-Inverse Document Frequency)进行特征提取(Feature Extraction)
- Sex、Embarked、TicketPref 使用独热编码(One-Hot Encoding)进行特征编码(Feature Encoding)
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector as selector
# 提取数值类型的特征列
numeric_columns = selector(dtype_include='number')
# 定义 Pipeline 中每个步骤
text_transformer = Pipeline(steps=[
('tfidf', TfidfVectorizer())
])
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
# 使用 ColumnTransformer 指定每列的处理方式
preprocessor = ColumnTransformer(
transformers=[
('text', text_transformer, 'Name'),
('categorical', categorical_transformer, ['Sex', 'Embarked', 'TicketPref']),
('numeric', numeric_transformer, numeric_columns)
])
# 创建完整的 Pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])
# 在你的数据上使用 Pipeline 进行处理
X_processed = pipeline.fit_transform(X)
数据划分
from sklearn.model_selection import train_test_split
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=0)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
建模
逻辑回归模型
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, classification_report
import numpy as np
# 创建逻辑回归模型
lr = LogisticRegression()
# 定义参数网格
param_grid = {
'C': np.logspace(-3, 3, 7),
'max_iter': list(range(5, 40, 5)),
}
# 设置多类分类评估器
scorer = make_scorer(accuracy_score)
# 创建 GridSearchCV 对象
grid_search = GridSearchCV(
estimator=lr,
param_grid=param_grid,
scoring=scorer,
cv=5 # 使用交叉验证
)
# 运行网格搜索
grid_search.fit(X_train, y_train)
# 输出最佳参数
print("Best Parameters: ", grid_search.best_params_)
# 在验证集上评估模型
lr_model = grid_search.best_estimator_
y_pred = lr_model.predict(X_test)
# 评估(Evaluation)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))
决策分类树模型
from sklearn.tree import DecisionTreeClassifier
# Create Decision Tree classifier
dt_classifier = DecisionTreeClassifier()
# Define parameter grid
param_grid = {
'criterion': ['gini', 'entropy'],
'max_depth': list(range(5, 25, 5)),
'min_samples_split': [3, 7, 12],
'min_samples_leaf': [2, 4, 6],
}
# Set the scoring metric
scorer = make_scorer(accuracy_score)
# Create GridSearchCV object
grid_search = GridSearchCV(
estimator=dt_classifier,
param_grid=param_grid,
scoring=scorer,
cv=5 # Using 5-fold cross-validation
)
# Run grid search
grid_search.fit(X_train, y_train)
# Output the best parameters
print("Best Parameters: ", grid_search.best_params_)
# Evaluate the model on the test set
dt_model = grid_search.best_estimator_
y_pred = dt_model.predict(X_test)
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))
随机森林模型
from sklearn.ensemble import RandomForestClassifier
# Create Random Forest classifier
rf_classifier = RandomForestClassifier()
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 150],
'criterion': ['gini', 'entropy'],
'max_depth': [5, 10, 15],
'min_samples_split': [3, 7, 12],
'min_samples_leaf': [2, 4, 6],
}
# Set the scoring metric
scorer = make_scorer(accuracy_score)
# Create GridSearchCV object
grid_search = GridSearchCV(
estimator=rf_classifier,
param_grid=param_grid,
scoring=scorer,
cv=5 # Using 5-fold cross-validation
)
# Run grid search
grid_search.fit(X_train, y_train)
# Output the best parameters
print("Best Parameters: ", grid_search.best_params_)
# Evaluate the model on the test set
rf_model = grid_search.best_estimator_
y_pred = rf_model.predict(X_test)
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))
梯度提升树模型
from sklearn.ensemble import GradientBoostingClassifier
# Create Gradient Boosting classifier
gb_classifier = GradientBoostingClassifier()
# Define parameter grid
param_grid = {
'n_estimators': [50, 100, 150],
'learning_rate': [0.01, 0.1, 0.2],
'max_depth': [3, 4, 5],
'min_samples_split': [3, 7, 12],
'min_samples_leaf': [2, 4, 6],
}
# Set the scoring metric
scorer = make_scorer(accuracy_score)
# Create GridSearchCV object
grid_search = GridSearchCV(
estimator=gb_classifier,
param_grid=param_grid,
scoring=scorer,
cv=5 # Using 5-fold cross-validation
)
# Run grid search
grid_search.fit(X_train, y_train)
# Output the best parameters
print("Best Parameters: ", grid_search.best_params_)
# Evaluate the model on the test set
gb_model = grid_search.best_estimator_
y_pred = gb_model.predict(X_test)
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.metrics import confusion_matrix
# 假设 y_test 是真实标签,y_scores 是预测的概率得分
models = [lr_model, dt_model, rf_model, gb_model]
y_scores = [model.predict_proba(X_test)[:, 1] for model in models]
fig, axes = plt.subplots(2, 4, figsize=(15, 7), tight_layout=True)
fig.suptitle('ROC Curve & Confusion matrix', size=16)
for i in range(4):
# 计算 ROC 曲线的值
fpr, tpr, thresholds = roc_curve(y_test, y_scores[i])
# 计算 AUC(Area Under the Curve)
auc = roc_auc_score(y_test, y_scores[i])
plt.sca(axes[0][i])
plt.plot(fpr, tpr, label=f'AUC = {auc:.2f}')
plt.plot([0, 1], [0, 1], 'k--', label='Random')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(models[i].__class__.__name__)
plt.legend()
plt.sca(axes[1][i])
y_pred = models[i].predict(X_test)
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, cmap='Blues', fmt='g', linewidths=.5)
plt.title(models[i].__class__.__name__)
plt.xlabel('Predicted Labels')
plt.ylabel('Real Labels')
plt.show()
预测
# 导入数据集
test_data = pd.read_csv("./titanic/test.csv")
# 数据预处理
test_data['NameTitle'] = test_data['Name'].apply(name_title)
group_means = test_data.groupby('NameTitle')['Age'].mean()
test_data['Age'].fillna(df['NameTitle'].map(group_means), inplace=True)
test_data['Fare'].fillna(test_data['Fare'].mean(), inplace=True)
test_data['Name'] = test_data['Name'].apply(remove_noise)
test_data['TicketPref'] = test_data['Ticket'].apply(ticket_pref)
test_data['TicketID'] = test_data['Ticket'].apply(ticket_ID)
test = test_data.drop(['PassengerId', 'Ticket', 'Cabin', 'NameTitle'], axis=1)
test.sample(5, random_state=0)
# 数据转化
X_test_processed = pipeline.transform(test)
X_test_processed.shape
# 模型预测
val = lr_model.predict(X_test_processed)
sub = pd.read_csv("./titanic/gender_submission.csv")
sub['Survived'] = val
sub.to_csv('./titanic/submission.csv', index=False)
print("Your submission was successfully saved!")
LR 完整的 python 代码
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import make_column_selector as selector
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score, classification_report
import numpy as np
import re
import warnings
warnings.filterwarnings("ignore")
'''Setp 1: Load dataset'''
df = pd.read_csv("titanic/train.csv")
'''Setp 2: Data Preprocessing'''
def name_title(x):
return x.split('.')[0].split(' ')[-1]
def remove_noise(x):
return re.sub(r'[".,()]+', '', x)
def ticket_pref(x):
if len(x.split(' ')) == 1:
return 'nan'
else:
x = ".".join(x.split(' ')[:-1])
return re.sub(r'[./]+', '', x).lower()
def ticket_ID(x):
x = x.split(' ')[-1]
return int(x) if x.isdigit() else 0
# data preprocessing
def preprocessing(df):
df = df.copy()
# Missing Data Handling
df['NameTitle'] = df['Name'].apply(name_title)
# Fill in missing values
df['Age'].fillna(df['NameTitle'].map(df.groupby('NameTitle')['Age'].mean()), inplace=True)
df['Embarked'].fillna('S', inplace=True)
# Remove Noise
df['Name'] = df['Name'].apply(remove_noise)
# Standardize Text Content
df['TicketPref'] = df['Ticket'].apply(ticket_pref)
df['TicketID'] = df['Ticket'].apply(ticket_ID)
return df
train_df = preprocessing(df)
y = train_df['Survived']
X = train_df.drop(['PassengerId', 'Survived', 'Ticket', 'NameTitle', 'Cabin'], axis=1)
'''Setp 3: Data Transformation'''
# Extracting columns with numerical features
numeric_columns = selector(dtype_include='number')
# Define each step in the pipeline
text_transformer = Pipeline(steps=[
('tfidf', TfidfVectorizer())
])
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
# Use ColumnTransformer to specify the processing method for each column
preprocessor = ColumnTransformer(
transformers=[
('text', text_transformer, 'Name'),
('categorical', categorical_transformer, ['Sex', 'Embarked', 'TicketPref']),
('numeric', numeric_transformer, numeric_columns)
])
# Create a complete pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor)])
# Use a pipeline to process data
X_processed = pipeline.fit_transform(X)
'''Setp 4: Data Splitting'''
# Splitting the training set and test set
X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=0.2, random_state=0)
'''Setp 5: Modeling'''
# Create Logistic Regression
lr = LogisticRegression()
# Define parameter grid
param_grid = {
'C': np.logspace(-3, 3, 7),
'max_iter': list(range(5, 50, 1)),
}
# Set the scoring metric
scorer = make_scorer(accuracy_score)
# Create GridSearchCV object
grid_search = GridSearchCV(
estimator=lr,
param_grid=param_grid,
scoring=scorer,
cv=5 # Using 5-fold cross-validation
)
# Run grid search
grid_search.fit(X_train, y_train)
# Output the best parameters
print("Best Parameters: ", grid_search.best_params_)
# Evaluate the model on the test set
lr_model = grid_search.best_estimator_
y_pred = lr_model.predict(X_test)
# Evaluation
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(classification_report(y_test, y_pred))
'''Setp 6: Predicting'''
test_data = pd.read_csv("titanic/test.csv")
test_data.head()
def preprocess_data(df):
df = df.copy()
# Missing Data Handling
df['NameTitle'] = df['Name'].apply(name_title)
# Fill in missing values
df['Age'].fillna(df['NameTitle'].map(train_df.groupby('NameTitle')['Age'].mean()), inplace=True)
df['Fare'].fillna(df['Fare'].mean(), inplace=True)
# Remove Noise
df['Name'] = df['Name'].apply(remove_noise)
# Standardize Text Content
df['TicketPref'] = df['Ticket'].apply(ticket_pref)
df['TicketID'] = df['Ticket'].apply(ticket_ID)
df = df.drop(['PassengerId', 'Ticket', 'NameTitle', 'Cabin'], axis=1)
return df
# Data preprocessing
test = preprocess_data(test_data)
# Data Transformation
X_test_processed = pipeline.transform(test)
# Predicting
val = lr_model.predict(X_test_processed)
sub = pd.read_csv("titanic/gender_submission.csv")
sub['Survived'] = val
sub.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")