2024/10/27周报

文章目录

摘要
Abstract
深度学习
- 预测进出水水质
- - 使用UCI机器学习库中的水处理数据集
  - 代码描述
  - 具体代码示例
  - 实验结果
- 智能比对示例
- - 数据示例
  - 比对步骤
  - Python 代码示例
  - 结果解读
  - 应用场景
总结
- 改进建议

摘要

本周对南宁伶俐工业园区污水处理厂进行调研，了解了该污水处理厂的详细工艺。又基于UCI水处理数据集，使用深度学习方法构建了预测模型，对水处理单元的进出水水质进行预测分析。数据预处理包括数据清洗、PCA降维、滑动窗口数据增强等，以提升模型的计算效率和准确性。模型采用卷积神经网络（CNN）、长短期记忆网络（LSTM）及注意力机制结合的CLATT模型，通过多层卷积、LSTM层、多头注意力机制及残差块提取特征，并用全连接层输出水质预测值。训练过程中，使用均方误差作为损失函数，并设置学习率调度器及早停机制以优化模型性能。测试集评估表明模型预测效果良好。此外，还比对分析通过PCA和随机森林，识别出污水处理单元的关键特征差异，为不同单元的性能优化提供了科学依据，支持对低效单元进行调整，提升整体水处理效率。

Abstract

This week, I conducted research on the sewage treatment plant in Nanning Lingli Industrial Park and learned about its detailed process. Based on the UCI water treatment dataset, a prediction model was constructed using deep learning methods to predict and analyze the inlet and outlet water quality of water treatment units. Data preprocessing includes data cleaning, PCA dimensionality reduction, sliding window data augmentation, etc., to improve the computational efficiency and accuracy of the model. The model adopts a CLATT model that combines convolutional neural network (CNN), long short-term memory network (LSTM), and attention mechanism. It extracts features through multi-layer convolution, LSTM layers, multi head attention mechanism, and residual blocks, and outputs water quality prediction values through fully connected layers. During the training process, mean square error is used as the loss function, and a learning rate scheduler and early stopping mechanism are set up to optimize model performance. The evaluation of the test set shows that the model has good predictive performance. In addition, comparative analysis was conducted using PCA and random forest to identify key feature differences in sewage treatment units, providing a scientific basis for optimizing the performance of different units and supporting adjustments to inefficient units to improve overall water treatment efficiency.

深度学习

预测进出水水质

使用UCI机器学习库中的水处理数据集

以下代码主要实现了对水处理数据集的处理、模型构建、训练和评估。
数据集描述如下：

• 数据来源：UCI机器学习库中的水处理数据集。
• 数据结构：包含39个特征，记录了水处理过程中的多个测量值，如pH值、溶解氧量、悬浮物浓度等。

代码描述

导入库
• 使用了 pandas 和 numpy 进行数据处理。
• 使用 torch 和 torch.nn 进行深度学习模型的构建。
• 使用 StandardScaler 标准化数据， train_test_split 划分数据集， PCA 进行降维。
• matplotlib 用于可视化。
设置随机数种子
为了确保实验的可重复性，使用 random.seed、np.random.seed 和 torch.manual_seed 设置随机数种子。
数据加载和预处理
• 数据加载：从 UCI 网站下载水处理数据集。该数据集包含水处理过程中的39个变量。
• 数据清洗：将 “?” 替换为 NaN，并用均值填补缺失值。转换数据类型为浮点数以便处理。
• 降维：使用PCA将39维的输入特征降到15维，以减少模型的计算量。
• 特征选择：选择进水和出水相关的特征进行预测。这里 in_features 为降维后的特征，out_features 选择了一些出水的指标（‘PH-S’、‘DBO-S’、‘DQO-S’、‘SS-S’）。
• 标准化：对输入和输出特征使用 StandardScaler 进行标准化。
数据增强
• 使用滑动窗口方法对数据进行增强，将时间序列特征封装成序列。create_sequences 函数生成时间步数为 time_steps=6 的序列，用于捕捉时间相关的特征。
数据集划分与转换
• 将生成的序列划分为训练集和测试集，并转换为 PyTorch 张量格式，以供后续模型训练和评估使用。
模型构建
创建了一个名为 CLATT 的神经网络模型，包含卷积层、LSTM、注意力机制、残差连接等模块。以下是模型结构：
• 卷积层：两层卷积层提取输入特征的局部信息。
• LSTM层：用于捕捉序列特征。
• 多头注意力机制：MultiHeadAttention 层由多头注意力机制组成，每个 AttentionLayer 计算注意力权重，提取重要特征。
• 残差块：ResidualBlock 在注意力机制后增加了残差连接，增强了模型的稳定性。
• 全连接层：将最终特征映射到4个目标特征。
模型训练
使用均方误差（MSE）作为损失函数。train_model 函数实现了模型训练，包括：
• 梯度更新：使用Adam优化器更新模型参数。
• 学习率调度器：每50个epoch减少学习率，以更好地收敛。
• 早停机制：如果模型在若干epoch内未改进，早停以避免过拟合。
模型评估
evaluate_model 函数在测试集上评估模型表现，并输出测试损失。
可视化预测结果
最后部分绘制预测值与实际值对比的折线图（针对 PH-S），直观展示模型预测的效果。

具体代码示例

import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import random
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# 设置随机数种子
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

# 1. 下载并加载数据
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/water-treatment/water-treatment.data'
data = pd.read_csv(url, header=None, sep=',')

# 更新列名以匹配数据集中的39列
data.columns = ['Date', 'Q-E', 'ZN-E', 'PH-E', 'DBO-E', 'DQO-E', 'SS-E', 'SSV-E', 'SED-E', 'COND-E',
                'PH-P', 'DBO-P', 'SS-P', 'SSV-P', 'SED-P', 'COND-P',
                'PH-D', 'DBO-D', 'DQO-D', 'SS-D', 'SSV-D', 'SED-D', 'COND-D',
                'PH-S', 'DBO-S', 'DQO-S', 'SS-S', 'SSV-S', 'SED-S', 'COND-S',
                'RD-DBO-P', 'RD-SS-P', 'RD-SED-P', 'RD-DBO-S', 'RD-DQO-S',
                'RD-DBO-G', 'RD-DQO-G', 'RD-SS-G', 'RD-SED-G']

# 删除日期列
data = data.drop('Date', axis=1)

# 将 "?" 替换为 NaN
data = data.replace('?', np.nan)

# 将数据类型转换为浮点数
data = data.astype(float)

# 填补缺失值（使用均值填补）
data = data.fillna(data.mean())

# 特征工程：使用PCA降维
pca = PCA(n_components=15)
data_pca = pca.fit_transform(data)

# 选择进水特征和出水特征
in_features = data_pca
out_features = data[['PH-S', 'DBO-S', 'DQO-S', 'SS-S']].values

# 数据标准化
scaler_in = StandardScaler()
scaler_out = StandardScaler()

X = scaler_in.fit_transform(in_features)
y = scaler_out.fit_transform(out_features)

# 数据增强：滑动窗口方法
def create_sequences(X, y, time_steps=6, step=1):
    Xs, ys = [], []
    for i in range(0, len(X) - time_steps, step):
        Xs.append(X[i:i + time_steps])
        ys.append(y[i + time_steps])
    return np.array(Xs), np.array(ys)

time_steps = 6
X_seq, y_seq = create_sequences(X, y, time_steps, step=1)

# 拆分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X_seq, y_seq, test_size=0.2, random_state=seed)

# 转换为PyTorch张量
X_train = torch.tensor(X_train, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_test = torch.tensor(y_test, dtype=torch.float32)

# 2. 模型构建
class CLATT(nn.Module):
    def __init__(self, time_steps, in_features, hidden_dim=256, out_features=4, num_heads=4, num_layers=3):
        super(CLATT, self).__init__()
        self.conv1 = nn.Conv1d(in_features, 64, kernel_size=3, padding=1)
        self.conv2 = nn.Conv1d(64, 64, kernel_size=3, padding=1)
        self.batch_norm = nn.BatchNorm1d(64)
        self.flatten = nn.Flatten()
        self.lstm = nn.LSTM(64 * time_steps, hidden_dim, num_layers=num_layers, batch_first=True, bidirectional=True)
        self.layer_norm1 = nn.LayerNorm(hidden_dim * 2)
        self.multi_head_attention = MultiHeadAttention(hidden_dim * 2, num_heads)
        self.layer_norm2 = nn.LayerNorm(hidden_dim * 2 * num_heads)
        self.residual_block = ResidualBlock(hidden_dim * 2 * num_heads, hidden_dim * 2)
        self.dropout = nn.Dropout(p=0.5)
        self.fc = nn.Linear(hidden_dim * 2 * num_heads, out_features)

    def forward(self, x):
        x = x.transpose(1, 2)
        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = self.batch_norm(x)
        x = self.flatten(x)
        x = x.unsqueeze(1)
        lstm_out, _ = self.lstm(x)
        lstm_out = self.layer_norm1(lstm_out)
        attention_out = self.multi_head_attention(lstm_out)
        attention_out = self.layer_norm2(attention_out)
        attention_out = self.residual_block(attention_out)
        attention_out = self.dropout(attention_out)
        out = self.fc(attention_out)
        return out

class MultiHeadAttention(nn.Module):
    def __init__(self, input_dim, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.attention_heads = nn.ModuleList([AttentionLayer(input_dim) for _ in range(num_heads)])

    def forward(self, inputs):
        attention_outputs = [head(inputs) for head in self.attention_heads]
        concat_attention = torch.cat(attention_outputs, dim=1)
        return concat_attention

class AttentionLayer(nn.Module):
    def __init__(self, input_dim):
        super(AttentionLayer, self).__init__()
        self.attn_weights = nn.Linear(input_dim, input_dim)

    def forward(self, inputs):
        scores = torch.tanh(self.attn_weights(inputs))
        attn_weights = torch.softmax(scores, dim=1)
        context_vector = torch.sum(attn_weights * inputs, dim=1)
        return context_vector

class ResidualBlock(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(ResidualBlock, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, input_dim)

    def forward(self, x):
        residual = x
        out = torch.relu(self.fc1(x))
        out = self.fc2(out)
        out += residual
        return torch.relu(out)

# 创建模型
model = CLATT(time_steps, in_features.shape[1], hidden_dim=256, out_features=4, num_heads=4, num_layers=3)
print(model)

# 3. 模型训练，加入学习率调度器和权重衰减
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-5)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=50, gamma=0.5)

def train_model(model, X_train, y_train, n_epochs=200, batch_size=64, patience=10):
    model.train()
    best_loss = float('inf')
    patience_counter = 0

    for epoch in range(n_epochs):
        for i in range(0, len(X_train), batch_size):
            X_batch = X_train[i:i + batch_size]
            y_batch = y_train[i:i + batch_size]
            y_pred = model(X_batch)
            loss = criterion(y_pred, y_batch)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        scheduler.step()  # 更新学习率
        if loss.item() < best_loss:
            best_loss = loss.item()
            patience_counter = 0
        else:
            patience_counter += 1

        if patience_counter >= patience:
            print(f"Early stopping at epoch {epoch + 1}")
            break

        if (epoch + 1) % 10 == 0:
            print(f'Epoch {epoch + 1}/{n_epochs}, Loss: {loss.item():.4f}')
            
train_model(model, X_train, y_train)

# 4. 模型评估
def evaluate_model(model, X_test, y_test):
    model.eval()
    with torch.no_grad():
        y_pred = model(X_test)
        loss = criterion(y_pred, y_test)
        print(f'Test Loss: {loss.item():.4f}')
    return y_pred


y_pred = evaluate_model(model, X_test, y_test)

# 反标准化结果
y_test_inv = scaler_out.inverse_transform(y_test.numpy())
y_pred_inv = scaler_out.inverse_transform(y_pred.numpy())

# 结果可视化
plt.figure(figsize=(10, 6))
plt.plot(y_test_inv[:100, 0], label='实际值 (PH-S)')
plt.plot(y_pred_inv[:100, 0], label='预测值 (PH-S)')
plt.legend()
plt.show()

实验结果

在这里插入图片描述

智能比对示例

为了更直观地理解污水处理单元的比对过程，下面提供一个完整的示例，假设我们有四个处理单元的数据集，数据指标包括化学需氧量（COD）去除效率、总氮（TN）去除效率、总磷（TP）去除效率、处理流量和pH值。我们将通过主成分分析（PCA）降维，并使用随机森林模型进行单元间的比对分析。

数据示例

假设我们有以下污水处理单元数据（单位：处理效率以百分比表示，流量以m³/h表示，pH值为无量纲）：
在这里插入图片描述

比对步骤

数据预处理
标准化：为消除不同指标间的量纲差异，先对数据进行标准化处理，将各指标缩放至相同的范围（例如0到1之间）。
主成分分析（PCA）
降维：使用PCA将五个特征降维到两个主要成分，以减少数据复杂度并突出关键影响因素。
解释性：通过PCA，我们可以得到各主成分对原始指标的解释度。假设我们发现降维后的两个主成分分别解释了70%和20%的数据变异，总共解释了90%，说明大部分信息被保留。
使用随机森林进行比对
特征重要性分析：在随机森林模型中，通过分析每个特征在模型中的重要性，识别出影响不同单元效率差异的关键因素。假设结果显示，COD去除效率和TN去除效率对单元性能差异影响最大。
比对分析：基于随机森林的分类结果，我们可以得到各单元的性能评分或分类，例如识别“高效单元”和“低效单元”。
结果解读与应用
结果解读：通过PCA和随机森林分析，我们得出单元C的处理效率最高，而单元B和D相对较低。此结论可以引导进一步的操作，例如为单元B和D提供优化建议。

优化方案：基于比对结果，可以针对COD和TN去除效率较低的单元（如单元B和D）提出改进措施，例如增加曝气量或调整药剂投加量。

Python 代码示例

以下是一个基于上述流程的Python代码实现。

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# 数据准备
data = {
    'Unit': ['A', 'B', 'C', 'D'],
    'COD Efficiency': [85, 80, 90, 78],
    'TN Efficiency': [70, 65, 80, 75],
    'TP Efficiency': [65, 75, 70, 68],
    'Flow Rate': [120, 150, 100, 110],
    'pH': [7.0, 6.8, 7.2, 7.1]
}

df = pd.DataFrame(data)
features = df.drop(columns=['Unit'])

# 标准化数据
scaler = StandardScaler()
scaled_features = scaler.fit_transform(features)

# PCA降维
pca = PCA(n_components=2)
principal_components = pca.fit_transform(scaled_features)
pca_df = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
pca_df['Unit'] = df['Unit']

# 假设高效单元标签
labels = np.array([1 if unit in ['C'] else 0 for unit in df['Unit']])

# 随机森林比对模型
rf = RandomForestClassifier()
rf.fit(scaled_features, labels)
importances = rf.feature_importances_

# 输出特征重要性
for feature, importance in zip(features.columns, importances):
    print(f"Feature: {feature}, Importance: {importance:.4f}")

# 输出PCA后的数据
print("\nPCA Result:\n", pca_df)

结果解读

在这里插入图片描述

特征重要性：代码输出会显示每个指标的特征重要性。例如，COD去除效率和TN去除效率可能会被认为对比对结果影响最大。
PCA结果：PCA降维后的结果可以在二维平面中可视化展示，显示各单元之间的相似性或差异性。

应用场景

根据比对结果，可以对比单元的相对效率并生成优化方案。例如，针对COD和TN效率较低的单元B和D，可以制定改进计划以提升这些指标。这种分析为后续的智能化优化和管理提供了数据支持。

总结

在上述污水处理水质预测和单元比对分析工作中，已有的成果为污水处理过程的优化和智能管理提供了重要基础，但还有一些方面可以进一步总结和改进，以提升模型的表现和应用价值。
数据处理全面且充分：通过数据清洗、标准化、降维等一系列预处理步骤，确保了模型的输入数据质量，并通过滑动窗口法增强时间序列特征。
模型架构融合多种技术：模型结构包括卷积层、LSTM层和多头注意力机制，通过深度特征提取有效捕捉了水质变化的时空特征，增强了模型的预测能力。
多模型联合比对分析：PCA与随机森林结合，识别了关键影响指标，为处理单元的性能优化提供了依据，有助于明确处理薄弱环节并制定针对性的改进方案。

改进建议

数据集扩展与细化：目前使用的UCI水处理数据集可能未完全覆盖实际污水处理中的复杂情况。可以引入更多现场数据（如不同季节、不同污染源的水质数据）进行模型训练，以提升模型的适用性。

1.模型参数优化与自动化调整：
通过自动化调参方法（如贝叶斯优化、网格搜索等），寻找最优的模型参数组合，从而提升模型性能。
引入自适应学习率优化器（如AdaBelief、Ranger）代替Adam优化器，提升模型的收敛速度与稳定性。

2.模型结构的改进：
在模型中加入图神经网络（GNN），模拟水质指标间的相互关系，有助于捕捉污染物的传播和扩散特征，尤其适用于流域或多水体的分析。
使用Transformer架构替代LSTM来处理时间序列特征，有望提升长时间序列数据的建模能力和性能。
引入实时数据处理：实现实时数据接入与模型预测，使系统能够在处理单元出现异常时快速响应，建议采用流处理技术（如Apache Kafka、Spark Streaming）对数据进行动态监控。

3.增加模型解释性：为了便于实际应用中的决策，可以引入SHAP值（SHapley Additive exPlanations）等解释性方法，使得每个预测结果可以追溯到关键指标的贡献程度，从而增强模型的透明性。

4.强化性能评估与验证：
增加模型评估指标，如R²、平均绝对误差（MAE）等，以提供更全面的模型评估。
通过交叉验证、留一验证等多种验证方法，确保模型的泛化性能，避免在特定数据集上的过拟合。

这些改进措施将进一步增强模型的预测准确性、稳定性及适用性，使得系统能够更好地应对实际应用中的复杂场景，为污水处理的智能化管理提供更强大的技术支持。