有这么一句话在业界广泛流传:数据和特征决定了机器学习的上限,而模型和算法只是逼近这个上限而已。由此可见,特征工程在机器学习中占有相当重要的地位。在实际应用当中,可以说特征工程是机器学习成功的关键。
特征工程是数据分析中最耗时间和精力的一部分工作,它不像算法和模型那样是确定的步骤,更多是工程上的经验和权衡。因此没有统一的方法。这里只是对一些常用的方法做一个总结。
特征工程包含了 Data PreProcessing(数据预处理)、Feature Extraction(特征提取)、Feature Selection(特征选择)和 Feature construction(特征构造)等子问题。
导入必要的包
import numpy as np
import pandas as pd
import re
import sys
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import gc
# Setting configuration.
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
SEED = 42
时间序列特征
本案例并不涉及时序特征(time series),但在其他许多场景,时序特征处理可能会非常复杂。时序特征的处理方法有很多,这里参考sklearn中关于时间特征的案例简单介绍几种有代表性的方法。这些方法可以帮助我们更好地理解和利用时间序列数据中的模式和趋势,从而提高机器学习模型的性能和预测准确度。
参考资料:
Time-related feature engineering
Lagged features for time series forecasting
数据探索
Data exploration
# We start by loading the data from the OpenML repository.
from sklearn.datasets import fetch_openml
bike_sharing = fetch_openml("Bike_Sharing_Demand", version=2, as_frame=True, parser="pandas")
df = bike_sharing.frame
df.head()
season | year | month | hour | holiday | weekday | workingday | weather | temp | feel_temp | humidity | windspeed | count | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | spring | 0 | 1 | 0 | False | 6 | False | clear | 9.84 | 14.395 | 0.81 | 0.0 | 16 |
1 | spring | 0 | 1 | 1 | False | 6 | False | clear | 9.02 | 13.635 | 0.80 | 0.0 | 40 |
2 | spring | 0 | 1 | 2 | False | 6 | False | clear | 9.02 | 13.635 | 0.80 | 0.0 | 32 |
3 | spring | 0 | 1 | 3 | False | 6 | False | clear | 9.84 | 14.395 | 0.75 | 0.0 | 13 |
4 | spring | 0 | 1 | 4 | False | 6 | False | clear | 9.84 | 14.395 | 0.75 | 0.0 | 1 |
X = df.drop('count', axis=1)
y = df['count']
目标变量为共享单车的每小时租赁次数
# Get a quick understanding of the periodic patterns of the data
fig, ax = plt.subplots(figsize=(12, 4))
average_week_demand = df.groupby(["weekday", "hour"])["count"].mean()
average_week_demand.plot(ax=ax)
_ = ax.set(
title="Average hourly bike demand during the week",
xticks=[i * 24 for i in range(7)],
xticklabels=["Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat"],
xlabel="Time of the week",
ylabel="Number of bike rentals",
)
由于数据集是按时间排序的,我们使用基于时间的交叉验证
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
ts_cv = TimeSeriesSplit(
n_splits=3, # to keep the notebook fast enough on common laptops
gap=48, # 2 days data gap between train and test
max_train_size=10000, # keep train sets of comparable sizes
test_size=3000, # for 2 or 3 digits of precision in scores
)
all_splits = list(ts_cv.split(X, y))
train_idx, test_idx = all_splits[0]
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
构造评估函数
import numpy as np
from sklearn.model_selection import cross_validate
def evaluate(model, X, y, cv, model_prop=None, model_step=None):
cv_results = cross_validate(
model,
X,
y,
cv=cv,
scoring=["neg_mean_absolute_error", "neg_root_mean_squared_error"],
return_estimator=model_prop is not None,
)
if model_prop is not None:
if model_step is not None:
values = [
getattr(m[model_step], model_prop) for m in cv_results["estimator"]
]
else:
values = [getattr(m, model_prop) for m in cv_results["estimator"]]
print(f"Mean model.{model_prop} = {np.mean(values)}")
mae = -cv_results["test_neg_mean_absolute_error"]
rmse = -cv_results["test_neg_root_mean_squared_error"]
print(
f"Mean Absolute Error: {mae.mean():.3f} +/- {mae.std():.3f}\n"
f"Root Mean Squared Error: {rmse.mean():.3f} +/- {rmse.std():.3f}"
)
from sklearn.ensemble import HistGradientBoostingRegressor
gbrt = HistGradientBoostingRegressor(categorical_features="from_dtype", random_state=42)
categorical_columns = X.columns[X.dtypes == "category"]
print("Categorical features:", categorical_columns.tolist())
evaluate(gbrt, X, y, cv=ts_cv, model_prop="n_iter_")
Categorical features: ['season', 'holiday', 'workingday', 'weather']
Mean model.n_iter_ = 100.0
Mean Absolute Error: 61.872 +/- 17.473
Root Mean Squared Error: 91.119 +/- 23.725
## 时间组件
一般情况下,时间变量都存储在字符中,我们需要先将字符特征转换为时间戳。由于共享单车数据集中没有时间变量,我们自行构造时间数据演示。
ts_df = pd.DataFrame([
'2019-01-01 01:22:26', '2019-02-02 04:34:52', '2019-03-03 06:16:40',
'2019-04-04 08:11:38', '2019-05-05 10:52:39', '2019-06-06 12:06:25',
'2019-07-07 14:05:25', '2019-08-08 16:51:33', '2019-09-09 18:28:28',
'2019-10-10 20:55:12', '2019-11-11 22:55:12', '2019-12-12 00:55:12',
], columns=['time'])
ts_df['time'] = pd.to_datetime(ts_df['time'])
print(ts_df.head())
time
0 2019-01-01 01:22:26
1 2019-02-02 04:34:52
2 2019-03-03 06:16:40
3 2019-04-04 08:11:38
4 2019-05-05 10:52:39
然后,根据时间所在的自然周期,将一个时间特征转化为若干个离散特征,这种方法在分析具有明显时间趋势的问题比较好用。
ts_df['year'] = ts_df['time'].dt.year
ts_df['month'] = ts_df['time'].dt.month
ts_df['day'] = ts_df['time'].dt.day
ts_df['hour'] = ts_df['time'].dt.hour
ts_df['minute'] = ts_df['time'].dt.minute
ts_df['second'] = ts_df['time'].dt.second
进一步提取
ts_df['season'] = ts_df['time'].dt.quarter
ts_df['season'] = ts_df['season'].map({0: "spring", 1: "summer", 2: "spring", 3: "winter"})
# ts_df['weekofyear'] = ts_df['time'].dt.weekofyear
ts_df['dayofweek'] = ts_df['time'].dt.dayofweek
还可以提取小时所在每一天的周期:凌晨,上午,下午,晚上
ts_df['hour_section'] = (ts_df['hour'] // 6).astype(str)
布尔特征提取
ts_df['is_leap_year'] = ts_df['time'].dt.is_leap_year # 是否闰年
ts_df['is_month_start'] = ts_df['time'].dt.is_month_start
ts_df['is_month_end'] = ts_df['time'].dt.is_month_end
ts_df['is_quarter_start'] = ts_df['time'].dt.is_quarter_start
ts_df['is_quarter_end'] = ts_df['time'].dt.is_quarter_end
ts_df['is_year_start'] = ts_df['time'].dt.is_year_start
ts_df['is_year_end'] = ts_df['time'].dt.is_year_end
ts_df['is_weekend'] = ts_df['dayofweek'].isin([5, 6])
# ts_df['is_holiday'] = ts_df['time'].isin(holiday_list) # 需要罗列假日时间
ts_df.head(5)
time | year | month | day | hour | minute | second | season | dayofweek | hour_section | is_leap_year | is_month_start | is_month_end | is_quarter_start | is_quarter_end | is_year_start | is_year_end | is_weekend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2019-01-01 01:22:26 | 2019 | 1 | 1 | 1 | 22 | 26 | summer | 1 | 0 | False | True | False | True | False | True | False | False |
1 | 2019-02-02 04:34:52 | 2019 | 2 | 2 | 4 | 34 | 52 | summer | 5 | 0 | False | False | False | False | False | False | False | True |
2 | 2019-03-03 06:16:40 | 2019 | 3 | 3 | 6 | 16 | 40 | summer | 6 | 1 | False | False | False | False | False | False | False | True |
3 | 2019-04-04 08:11:38 | 2019 | 4 | 4 | 8 | 11 | 38 | spring | 3 | 1 | False | False | False | False | False | False | False | False |
4 | 2019-05-05 10:52:39 | 2019 | 5 | 5 | 10 | 52 | 39 | spring | 6 | 1 | False | False | False | False | False | False | False | True |
差分编码:对于时间型数据来说,即可以把它转换成离散值,也可以计算出时间变量到某一时间之间的数值差距,从而转化为连续值。
ts_df['diff'] = ts_df['time'] - pd.Timestamp('2019-01-01')
ts_df['diff'].dt.days
ts_df['diff'].dt.seconds
ts_df['diff_hours'] = ts_df['diff'].values.astype("timedelta64[h]").astype('int') # hours
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/Library/Python/3.9/lib/python/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'diff_hours'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[33], line 5
2 ts_df['diff'].dt.days
3 ts_df['diff'].dt.seconds
----> 5 ts_df['diff_hours'].values.astype("timedelta64[h]").astype('int') # hours
File ~/Library/Python/3.9/lib/python/site-packages/pandas/core/frame.py:4090, in DataFrame.__getitem__(self, key)
4088 if self.columns.nlevels > 1:
4089 return self._getitem_multilevel(key)
-> 4090 indexer = self.columns.get_loc(key)
4091 if is_integer(indexer):
4092 indexer = [indexer]
File ~/Library/Python/3.9/lib/python/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
3807 if isinstance(casted_key, slice) or (
3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
3811 raise InvalidIndexError(key)
-> 3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
3817 self._check_indexing_error(key)
KeyError: 'diff_hours'
ts_df['diff'].values.astype("timedelta64[h]")
array([ 1, 772, 1470, 2240, 2986, 3756, 4502, 5272, 6042, 6788, 7558,
8280], dtype='timedelta64[h]')
> 注意,这里的days属性是完全忽略时分秒属性,seconds则是忽略了天数的计算结果。后面的diff_hours则是将时间间隔转换成相差的小时数。
## one-hot编码
Time-steps as categories
由于时间特征是离散的,我们可以忽略排序,使用one-hot编码,从而给线性模型带来更大的灵活性。
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import RidgeCV
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
one_hot_encoder = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
alphas = np.logspace(-6, 6, 25)
one_hot_linear_pipeline = make_pipeline(
ColumnTransformer(
transformers=[
("categorical", one_hot_encoder, categorical_columns),
("one_hot_time", one_hot_encoder, ["hour", "weekday", "month"]),
],
remainder=MinMaxScaler(),
),
RidgeCV(alphas=alphas),
)
evaluate(one_hot_linear_pipeline, X, y, cv=ts_cv)
Mean Absolute Error: 100.844 +/- 6.861
Root Mean Squared Error: 132.145 +/- 2.966
绘制拟合曲线:
one_hot_linear_pipeline.fit(X_train, y_train)
one_hot_linear_predictions = one_hot_linear_pipeline.predict(X_test)
plt.figure(figsize=(12, 4))
plt.title("Predictions by one-hot linear models")
plt.plot(y_test.values[-96:], "x-", alpha=0.2, label="Actual demand", color="black")
plt.plot(one_hot_linear_predictions[-96:], "x-", label="One-hot time features")
plt.legend()
plt.show()
## 周期性编码
在我们之前看到的拟合线中,呈现出了阶梯状的特点。这是因为每个虚拟变量都被单独处理,缺乏连续性。然而,时间等变量具有明显的周期性连续性。这时,我们可以尝试对时间特征进行周期编码(cyclical encoding),重新编码后,周期范围内的第一个值和最后一个值之间还可以保持平滑。
### 三角函数变换
cyclical encoding with trigonometric transformation
我们可以使用正弦/余弦变换进行循环编码。我们可以利用这种变换将周期性的时间特征编码成两个新的特征,从而更好地表达时间的连续性。
from sklearn.preprocessing import FunctionTransformer
def sin_transformer(period):
return FunctionTransformer(lambda x: np.sin(x / period * 2 * np.pi))
def cos_transformer(period):
return FunctionTransformer(lambda x: np.cos(x / period * 2 * np.pi))
现在,我们可以使用以下策略构建特征提取管道:
cyclic_cossin_transformer = ColumnTransformer(
transformers=[
("categorical", one_hot_encoder, categorical_columns),
("month_sin", sin_transformer(12), ["month"]),
("month_cos", cos_transformer(12), ["month"]),
("weekday_sin", sin_transformer(7), ["weekday"]),
("weekday_cos", cos_transformer(7), ["weekday"]),
("hour_sin", sin_transformer(24), ["hour"]),
("hour_cos", cos_transformer(24), ["hour"]),
],
remainder=MinMaxScaler(),
)
cyclic_cossin_linear_pipeline = make_pipeline(
cyclic_cossin_transformer,
RidgeCV(alphas=alphas),
)
evaluate(cyclic_cossin_linear_pipeline, X, y, cv=ts_cv)
Mean Absolute Error: 115.371 +/- 14.248
Root Mean Squared Error: 159.276 +/- 5.759
绘制拟合曲线
cyclic_cossin_linear_pipeline.fit(X_train, y_train)
cyclic_cossin_linear_predictions = cyclic_cossin_linear_pipeline.predict(X_test)
plt.figure(figsize=(12, 4))
plt.title("Predictions by one-hot linear models")
plt.plot(y_test.values[-96:], "x-", alpha=0.2, label="Actual demand", color="black")
plt.plot(cyclic_cossin_linear_predictions[-96:], "x-", label="One-hot + Trigonometric")
plt.legend()
plt.show()
上图说明,正弦/余弦特征允许模型拾取主要模式,但不足以完全捕获该系列的动态。
### 周期性样条特征
Periodic spline features
同样我们可以尝试对每个周期性特征使用足够数量的样条变换
from sklearn.preprocessing import SplineTransformer
def periodic_spline_transformer(period, n_splines=None, degree=3):
if n_splines is None:
n_splines = period
n_knots = n_splines + 1 # periodic and include_bias is True
return SplineTransformer(
degree=degree,
n_knots=n_knots,
knots=np.linspace(0, period, n_knots).reshape(n_knots, 1),
extrapolation="periodic",
include_bias=True,
)
参数 `extrapolation="periodic"`,使周期范围内的第一个值和最后一个值之间保持平滑。
对于那些有序的离散特征,样条编码在保留更多信息的同时,比one-hot编码更有效。
cyclic_spline_transformer = ColumnTransformer(
transformers=[
("categorical", one_hot_encoder, categorical_columns),
("cyclic_month", periodic_spline_transformer(12, n_splines=6), ["month"]),
("cyclic_weekday", periodic_spline_transformer(7, n_splines=3), ["weekday"]),
("cyclic_hour", periodic_spline_transformer(24, n_splines=12), ["hour"]),
],
remainder=MinMaxScaler(),
)
cyclic_spline_linear_pipeline = make_pipeline(
cyclic_spline_transformer,
RidgeCV(alphas=alphas),
)
evaluate(cyclic_spline_linear_pipeline, X, y, cv=ts_cv)
Mean Absolute Error: 98.339 +/- 6.321
Root Mean Squared Error: 132.398 +/- 2.490
绘制拟合曲线
cyclic_spline_linear_pipeline.fit(X_train, y_train)
cyclic_spline_linear_predictions = cyclic_spline_linear_pipeline.predict(X_test)
plt.figure(figsize=(12, 4))
plt.title("Predictions by one-hot linear models")
plt.plot(y_test.values[-96:], "x-", alpha=0.2, label="Actual demand", color="black")
plt.plot(cyclic_spline_linear_predictions[-96:], "x-", label="One-hot + Spline")
plt.legend()
plt.show()
## 特征交互
feature interactions
线性模型不能自动捕获特征间的交互效应,因此我们可以显式的引入交互项,然后与之前已经预处理的特征合并。
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import PolynomialFeatures
hour_workday_interaction = make_pipeline(
ColumnTransformer(
[
("cyclic_hour", periodic_spline_transformer(24, n_splines=8), ["hour"]),
("workingday", FunctionTransformer(lambda x: x == "True"), ["workingday"]),
]
),
PolynomialFeatures(degree=2, interaction_only=True, include_bias=False),
)
# Those features are then combined with the ones already computed in the previous spline-base pipeline.
cyclic_spline_interactions_pipeline = make_pipeline(
FeatureUnion(
[
("marginal", cyclic_spline_transformer),
("interactions", hour_workday_interaction),
]
),
RidgeCV(alphas=alphas),
)
evaluate(cyclic_spline_interactions_pipeline, X, y, cv=ts_cv)
Mean Absolute Error: 84.792 +/- 6.839
Root Mean Squared Error: 110.399 +/- 7.442
cyclic_spline_interactions_pipeline.fit(X_train, y_train)
cyclic_spline_interactions_predictions = cyclic_spline_interactions_pipeline.predict(X_test)
plt.figure(figsize=(12, 4))
plt.title("Predictions by linear models")
plt.plot(y_test.values[-96:], "x-", alpha=0.2, label="Actual demand", color="black")
plt.plot(one_hot_linear_predictions[-96:], "x-", label="One-hot time features")
plt.plot(cyclic_cossin_linear_predictions[-96:], "x-", label="One-hot + Trigonometric")
plt.plot(cyclic_spline_linear_predictions[-96:], "x-", label="One-hot + Spline")
plt.plot(cyclic_spline_interactions_predictions[-96:], "x-", label="Splines + polynomial")
plt.legend()
plt.show()
窗口特征
将时间序列在时间轴上划分窗口是一个常用且有效的方法,包括滑动窗口( 根据指定的单位长度来框住时间序列,每次滑动一个单位),与滚动窗口(根据指定的单位长度来框住时间序列,每次滑动窗口长度的多个单位)。窗口分析对平滑噪声或粗糙的数据非常有用,比如移动平均法等,这种方式结合基础的统计方法,即按照时间的顺序对每一个时间段的数据进行统计,从而可以得到每个时间段内目标所体现的特征,进而从连续的时间片段中,通过对同一特征在不同时间维度下的分析,得到数据整体的变化趋势。
count = df["count"]
lagged_df = pd.concat(
[
count,
count.shift(1).rename("lagged_count_1h"),
count.shift(2).rename("lagged_count_2h"),
count.shift(3).rename("lagged_count_3h"),
count.shift(24).rename("lagged_count_1d"),
count.shift(24 + 1).rename("lagged_count_1d_1h"),
count.shift(7 * 24).rename("lagged_count_7d"),
count.shift(7 * 24 + 1).rename("lagged_count_7d_1h"),
count.shift(1).rolling(24).mean().rename("lagged_mean_24h"),
count.shift(1).rolling(24).max().rename("lagged_max_24h"),
count.shift(1).rolling(24).min().rename("lagged_min_24h"),
count.shift(1).rolling(7 * 24).mean().rename("lagged_mean_7d"),
count.shift(1).rolling(7 * 24).max().rename("lagged_max_7d"),
count.shift(1).rolling(7 * 24).min().rename("lagged_min_7d"),
],
axis="columns",
)
lagged_df.tail(10)
count | lagged_count_1h | lagged_count_2h | lagged_count_3h | lagged_count_1d | lagged_count_1d_1h | lagged_count_7d | lagged_count_7d_1h | lagged_mean_24h | lagged_max_24h | lagged_min_24h | lagged_mean_7d | lagged_max_7d | lagged_min_7d | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17369 | 247 | 203.0 | 224.0 | 157.0 | 160.0 | 169.0 | 70.0 | 135.0 | 93.500000 | 224.0 | 1.0 | 67.732143 | 271.0 | 1.0 |
17370 | 315 | 247.0 | 203.0 | 224.0 | 138.0 | 160.0 | 46.0 | 70.0 | 97.125000 | 247.0 | 1.0 | 68.785714 | 271.0 | 1.0 |
17371 | 214 | 315.0 | 247.0 | 203.0 | 133.0 | 138.0 | 33.0 | 46.0 | 104.500000 | 315.0 | 1.0 | 70.386905 | 315.0 | 1.0 |
17372 | 164 | 214.0 | 315.0 | 247.0 | 123.0 | 133.0 | 33.0 | 33.0 | 107.875000 | 315.0 | 1.0 | 71.464286 | 315.0 | 1.0 |
17373 | 122 | 164.0 | 214.0 | 315.0 | 125.0 | 123.0 | 26.0 | 33.0 | 109.583333 | 315.0 | 1.0 | 72.244048 | 315.0 | 1.0 |
17374 | 119 | 122.0 | 164.0 | 214.0 | 102.0 | 125.0 | 26.0 | 26.0 | 109.458333 | 315.0 | 1.0 | 72.815476 | 315.0 | 1.0 |
17375 | 89 | 119.0 | 122.0 | 164.0 | 72.0 | 102.0 | 18.0 | 26.0 | 110.166667 | 315.0 | 1.0 | 73.369048 | 315.0 | 1.0 |
17376 | 90 | 89.0 | 119.0 | 122.0 | 47.0 | 72.0 | 23.0 | 18.0 | 110.875000 | 315.0 | 1.0 | 73.791667 | 315.0 | 1.0 |
17377 | 61 | 90.0 | 89.0 | 119.0 | 36.0 | 47.0 | 22.0 | 23.0 | 112.666667 | 315.0 | 1.0 | 74.190476 | 315.0 | 1.0 |
17378 | 49 | 61.0 | 90.0 | 89.0 | 49.0 | 36.0 | 12.0 | 22.0 | 113.708333 | 315.0 | 1.0 | 74.422619 | 315.0 | 1.0 |
自相关性
df['count'].autocorr(1)
0.8438107986135388