- 🍨 本文为🔗365天深度学习训练营 中的学习记录博客
- 🍖 原作者:K同学啊 | 接辅导、项目定制
一、我的环境
1.语言环境:Python 3.9
2.编译器:Pycharm
3.深度学习环境:TensorFlow 2.10.0
二、GPU设置
import os
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import tensorflow as tf
from sklearn import metrics
from sklearn.preprocessing import MinMaxScaler
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'
gpus = tf.config.list_physical_devices("GPU")
if gpus:
gpu0 = gpus[0] # 如果有多个GPU,仅使用
tf.config.experimental.set_memory_growth(gpu0, True) # 设置GPU显存用量按需使
tf.config.set_visible_devices([gpu0], "GPU")
三、导入数据
data = pd.read_csv("data/weather.csv")
df = data.copy()
print(data.head())
运行结果:
Date Location MinTemp ... Temp3pm RainToday RainTomorrow
0 2008-12-01 Albury 13.4 ... 21.8 No No
1 2008-12-02 Albury 7.4 ... 24.3 No No
2 2008-12-03 Albury 12.9 ... 23.2 No No
3 2008-12-04 Albury 9.2 ... 26.5 No No
4 2008-12-05 Albury 17.5 ... 29.7 No No
[5 rows x 23 columns]
print(data.describe())
运行结果:
MinTemp MaxTemp ... Temp9am Temp3pm
count 143975.000000 144199.000000 ... 143693.000000 141851.00000
mean 12.194034 23.221348 ... 16.990631 21.68339
std 6.398495 7.119049 ... 6.488753 6.93665
min -8.500000 -4.800000 ... -7.200000 -5.40000
25% 7.600000 17.900000 ... 12.300000 16.60000
50% 12.000000 22.600000 ... 16.700000 21.10000
75% 16.900000 28.200000 ... 21.600000 26.40000
max 33.900000 48.100000 ... 40.200000 46.70000
[8 rows x 16 columns]
print(data.dtypes)
运行结果:
Date object
Location object
MinTemp float64
MaxTemp float64
Rainfall float64
Evaporation float64
Sunshine float64
WindGustDir object
WindGustSpeed float64
WindDir9am object
WindDir3pm object
WindSpeed9am float64
WindSpeed3pm float64
Humidity9am float64
Humidity3pm float64
Pressure9am float64
Pressure3pm float64
Cloud9am float64
Cloud3pm float64
Temp9am float64
Temp3pm float64
RainToday object
RainTomorrow object
dtype: object
# 将数据转换为日期时间格式
data['Date'] = pd.to_datetime(data['Date'])
data['year'] = data['Date'].dt.year
data['Month'] = data['Date'].dt.month
data['day'] = data['Date'].dt.day
print(data.head())
运行结果:
Date Location MinTemp MaxTemp ... RainTomorrow year Month day
0 2008-12-01 Albury 13.4 22.9 ... No 2008 12 1
1 2008-12-02 Albury 7.4 25.1 ... No 2008 12 2
2 2008-12-03 Albury 12.9 25.7 ... No 2008 12 3
3 2008-12-04 Albury 9.2 28.0 ... No 2008 12 4
4 2008-12-05 Albury 17.5 32.3 ... No 2008 12 5
[5 rows x 26 columns]
data.drop('Date', axis=1, inplace=True)
print(data.columns)
运行结果:
Index(['Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
'Temp3pm', 'RainToday', 'RainTomorrow', 'year', 'Month', 'day'],
dtype='object')
四、数据分析
plt.figure(figsize=(10, 8))
# data.corr()表示了data中的两个变量之间的相关性
numeric_data = data.select_dtypes(include=[np.number])
ax = sns.heatmap(numeric_data.corr(), square=True, annot=True, fmt='.2f')
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
运行结果:
是否下雨
# 设置样式和调色板
sns.set(style="whitegrid", palette="Set2")
# 创建一个 1 行 2 列的图像布局
fig, axes = plt.subplots(1, 2, figsize=(10, 4)) # 图形尺寸调大 (10, 4)
# 图表标题样式
title_font = {'fontsize': 14, 'fontweight': 'bold', 'color': 'darkblue'}
# 第一张图:RainTomorrow
sns.countplot(x='RainTomorrow', data=data, ax=axes[0], edgecolor='black') #
axes[0].set_title('Rain Tomorrow', fontdict=title_font) # 设置标题
axes[0].set_xlabel('Will it Rain Tomorrow?', fontsize=12) # X轴标签
axes[0].set_ylabel('Count', fontsize=12) # Y轴标签
axes[0].tick_params(axis='x', labelsize=11) # X轴刻度字体大小
axes[0].tick_params(axis='y', labelsize=11) # Y轴刻度字体大小
# 第二张图:RainToday
sns.countplot(x='RainToday', data=data, ax=axes[1], edgecolor='black') # 添加
axes[1].set_title('Rain Today', fontdict=title_font) # 设置标题
axes[1].set_xlabel('Did it Rain Today?', fontsize=12) # X轴标签
axes[1].set_ylabel('Count', fontsize=12) # Y轴标签
axes[1].tick_params(axis='x', labelsize=11) # X轴刻度字体大小
axes[1].tick_params(axis='y', labelsize=11) # Y轴刻度字体大小
sns.despine() # 去除图表顶部和右侧的边框
plt.tight_layout() # 调整布局,避免图形之间的重叠
plt.savefig("02.png")
plt.show()
运行结果:
x = pd.crosstab(data['RainTomorrow'], data['RainToday'])
print(x)
运行结果:
RainToday No Yes
RainTomorrow
No 92728 16858
Yes 16604 14597
y = x / x.transpose().sum().values.reshape(2, 1) * 100
print(y)
运行结果:
RainToday No Yes
RainTomorrow
No 84.616648 15.383352
Yes 53.216243 46.783757
y.plot(kind="bar", figsize=(4, 3), color=['#006666', '#d279a6']);
地理位置与下雨关系:
x = pd.crosstab(data['Location'], data['RainToday'])
# 获取每个城市下雨天数和非下雨天数的百分比
y = x / x.transpose().sum().values.reshape((-1, 1)) * 100
# 按每个城市的雨天百分比排序
y = y.sort_values(by='Yes', ascending=True)
color = ['#cc6699', '#006699', '#006666', '#862d86', '#ff9966']
y.Yes.plot(kind="barh", figsize=(15, 20), color=color)
湿度和压力对下雨的影响:
data.columns
plt.figure(figsize=(8,6))
sns.scatterplot(data=data,x='Pressure9am',y='Pressure3pm',hue='RainTomorrow');
plt.savefig("04.png")
plt.show()
plt.figure(figsize=(8,6))
sns.scatterplot(data=data,x='Humidity9am',
y='Humidity3pm',hue='RainTomorrow');
plt.savefig("05.png")
plt.show()
气温对下雨的影响:
plt.figure(figsize=(8,6))
sns.scatterplot(x='MaxTemp', y='MinTemp',
data=data, hue='RainTomorrow');
plt.savefig("06.png")
plt.show()
五、数据预处理
# 每列中缺失数据的百分比
data.isnull().sum()/data.shape[0]*100
运行结果:
Location 0.000000
MinTemp 1.020899
MaxTemp 0.866905
Rainfall 2.241853
Evaporation 43.166506
Sunshine 48.009762
WindGustDir 7.098859
WindGustSpeed 7.055548
WindDir9am 7.263853
WindDir3pm 2.906641
WindSpeed9am 1.214767
WindSpeed3pm 2.105046
Humidity9am 1.824557
Humidity3pm 3.098446
Pressure9am 10.356799
Pressure3pm 10.331363
Cloud9am 38.421559
Cloud3pm 40.807095
Temp9am 1.214767
Temp3pm 2.481094
RainToday 2.241853
RainTomorrow 2.245978
year 0.000000
Month 0.000000
day 0.000000
dtype: float64
# 在该列中随机选择数进行填充
lst=['Evaporation','Sunshine','Cloud9am','Cloud3pm']
for col in lst:
fill_list = data[col].dropna()
data[col] = data[col].fillna(pd.Series(np.random.choice(fill_list, size=len(data.index))))
s = (data.dtypes == "object")
object_cols = list(s[s].index)
object_cols
['Location',
'WindGustDir',
'WindDir9am',
'WindDir3pm',
'RainToday',
'RainTomorrow']
# inplace=True:直接修改原对象,不创建副本
# data[i].mode()[0] 返回频率出现最高的选项,众数
for i in object_cols:
data[i].fillna(data[i].mode()[0], inplace=True)
t = (data.dtypes == "float64")
num_cols = list(t[t].index)
num_cols
['MinTemp',
'MaxTemp',
'Rainfall',
'Evaporation',
'Sunshine',
'WindGustSpeed',
'WindSpeed9am',
'WindSpeed3pm',
'Humidity9am',
'Humidity3pm',
'Pressure9am',
'Pressure3pm',
'Cloud9am',
'Cloud3pm',
'Temp9am',
'Temp3pm']
# .median(), 中位数
for i in num_cols:
data[i].fillna(data[i].median(), inplace=True)
data.isnull().sum()
Location 0
MinTemp 0
MaxTemp 0
Rainfall 0
Evaporation 0
Sunshine 0
WindGustDir 0
WindGustSpeed 0
WindDir9am 0
WindDir3pm 0
WindSpeed9am 0
WindSpeed3pm 0
Humidity9am 0
Humidity3pm 0
Pressure9am 0
Pressure3pm 0
Cloud9am 0
Cloud3pm 0
Temp9am 0
Temp3pm 0
RainToday 0
RainTomorrow 0
year 0
Month 0
day 0
dtype: int64
六、构建数据集
label_encoder = LabelEncoder()
for i in object_cols:
data[i] = label_encoder.fit_transform(data[i])
X = data.drop(['RainTomorrow', 'day'], axis=1).values
y = data['RainTomorrow'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=101)
scaler = MinMaxScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
七、预测是否会下雨
model = Sequential()
model.add(Dense(units=24, activation='tanh', ))
model.add(Dense(units=18, activation='tanh'))
model.add(Dense(units=23, activation='tanh'))
model.add(Dropout(0.5))
model.add(Dense(units=12, activation='tanh'))
model.add(Dropout(0.2))
model.add(Dense(units=1, activation='sigmoid'))
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-4)
model.compile(loss='binary_crossentropy',
optimizer=optimizer,
metrics=["accuracy"])
early_stop = EarlyStopping(monitor='val_loss',
mode='min',
min_delta=0.001,
verbose=1,
patience=25,
restore_best_weights=True)
八、模型训练
model.fit(x=X_train,
y=y_train,
validation_data=(X_test, y_test), verbose=1,
callbacks=[early_stop],
epochs=10,
batch_size=32
)
Epoch 1/10
3410/3410 [==============================] - 8s 2ms/step - loss: 0.4558 - accuracy: 0.8031 - val_loss: 0.3886 - val_accuracy: 0.8328
Epoch 2/10
3410/3410 [==============================] - 7s 2ms/step - loss: 0.3971 - accuracy: 0.8324 - val_loss: 0.3785 - val_accuracy: 0.8374
Epoch 3/10
3410/3410 [==============================] - 16s 5ms/step - loss: 0.3896 - accuracy: 0.8355 - val_loss: 0.3757 - val_accuracy: 0.8382
Epoch 4/10
3410/3410 [==============================] - 15s 5ms/step - loss: 0.3859 - accuracy: 0.8371 - val_loss: 0.3732 - val_accuracy: 0.8389
Epoch 5/10
3410/3410 [==============================] - 15s 5ms/step - loss: 0.3837 - accuracy: 0.8376 - val_loss: 0.3720 - val_accuracy: 0.8389
Epoch 6/10
3410/3410 [==============================] - 15s 4ms/step - loss: 0.3816 - accuracy: 0.8381 - val_loss: 0.3712 - val_accuracy: 0.8394
Epoch 7/10
3410/3410 [==============================] - 15s 5ms/step - loss: 0.3798 - accuracy: 0.8391 - val_loss: 0.3723 - val_accuracy: 0.8379
Epoch 8/10
3410/3410 [==============================] - 15s 4ms/step - loss: 0.3791 - accuracy: 0.8398 - val_loss: 0.3701 - val_accuracy: 0.8392
Epoch 9/10
3410/3410 [==============================] - 15s 5ms/step - loss: 0.3782 - accuracy: 0.8391 - val_loss: 0.3706 - val_accuracy: 0.8401
Epoch 10/10
3410/3410 [==============================] - 15s 4ms/step - loss: 0.3778 - accuracy: 0.8389 - val_loss: 0.3693 - val_accuracy: 0.8397
九、结果可视化
acc = model.history.history['accuracy']
val_acc = model.history.history['val_accuracy']
loss = model.history.history['loss']
val_loss = model.history.history['val_loss']
epochs_range = range(10)
plt.figure(figsize=(14, 4))
plt.subplot(1, 2, 1)
plt.plot(epochs_range, acc, label='Training Accuracy', color="#c94733")
plt.plot(epochs_range, val_acc, label='Validation Accuracy', color="#3fab47")
plt.legend(loc='lower right')
plt.title('Training and Validation Accuracy')
plt.grid(False)
plt.subplot(1, 2, 2)
plt.plot(epochs_range, loss, label='Training Loss', color="#c94733")
plt.plot(epochs_range, val_loss, label='Validation Loss', color="#3fab47")
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
plt.grid(False)
plt.show()
十、总结
这周学习天气预测,其中主要包括EDA(Exploratory Data Analysis)探索性数据分析,使用EDA的好处有:
- 可以有效发现变量类型、分布趋势、缺失值、异常值等。
- 缺失值处理:(i)删除缺失值较多的列,通常缺失超过50%的列需要删除;(ii)缺失值填充。对于离散特征,通常将NAN单独作为一个类别;对于连续特征,通常使用均值、中值、0或机器学习算法进行填充。具体填充方法因业务的不同而不同。
- 异常值处理(主要针对连续特征)。如:Winsorizer方法处理。
- 类别合并(主要针对离散特征)。如果某个取值对应的样本个数太少,就需要将该取值与其他值合并。因为样本过少会使数据的稳定性变差,且不具有统计意义,可能导致结论错误。由于展示空间有限,通常选择取值个数最少或最多的多个取值进行展示。
- 删除取值单一的列。
- 删除最大类别取值数量占比超过阈值的列。