LSTM变种模型

GRU

GRU简介

门控循环神经网络 (Gated Recurrent Neural Network，GRNN) 的提出，旨在更好地捕捉时间序列中时间步距离较大的依赖关系。它通过可学习的门来控制信息的流动。其中，门控循环单元 (Gated Recurrent Unit ， GRU) 是一种常用的 GRNN 。 GRU 对 LSTM 做了很多简化，同时却保持着和 LSTM 相同的效果。

GRU的原理

GRU 的两个重大改进

将三个门：输入门、遗忘门、输出门变为两个门：更新门 (Update Gate) 和重置门 (Reset Gate)。
将 (候选) 单元状态 与 隐藏状态 (输出) 合并，即只有 当前时刻候选隐藏状态 和 当前时刻隐藏状态。

模型结构

简化图

内部结构图

$x_{t}$ :当前时刻输入信息
$h_{t-1}$ ：上一时刻的隐藏状态。隐藏状态充当了神经网络记忆，它包含了之前节点所见过的数据的信息
$h_{t}$ ：传递下一时刻的隐藏状态
$\tilde{h}_{t}$ :候选隐藏状态
$r_{t}$ ：重置门
$z_{t}$ ：更新门
$\sigma$ ：sigmoid函数，通过这个函数可以将数据变为0-1范围内的数值
$tanh$ ：tanh函数，通过这个函数可以将数据变为[-1,1]范围内的数值
$W_{z}$ 、 $W_{r}$ 和w是W是模型参数（权重矩阵），需要通过训练数据来学习
GRU通过其他门控制机制能够有效的捕捉到序列数据中的时间动态，同时相较于LSTM来说，由于起结构更加简洁，通常参数更少，计算效率更高

重置门

重置门决定在计算当前候选隐藏状态时，忽略多少过去的信息

更新门

更新门决定了多少过去的信息将被保留，它使用前一时间步的隐藏状态( $h_{t-1}$ )和当前输入( $x_{t}$ )来计算得出

候选隐藏状态

候选隐藏状态是当前时间步的建议更新，它包含了当前输入和过去的隐藏状态的信息，重置门的作用体现在它可以允许模型抛弃或保留之前的隐藏状态。

最终隐藏状态

最终隐藏状态是通过融合过去的隐藏状态和当前大的隐藏状态来计算得出的，更新门 $Z_{t}$ 控制了融合过去信息和当前信息的比列

代码实现

原生代码

import numpy as np
class GRU:
    def __init__(self, input_size, hidden_size):
        # # gru_model = GRU(10, 5)
        self.input_size = input_size # 10
        self.hidden_size = hidden_size # 5
        # hidden_size 是隐藏层的大小
        # input_size + hidden_size 当前时刻xt的维度+上一个时刻隐藏层的大小
        # (5, 15)
        self.W_z = np.random.randn(hidden_size, input_size + hidden_size)
        # (5)
        self.b_z = np.zeros((hidden_size, ))
        # (5, 15)
        self.W_r = np.random.randn(hidden_size, input_size + hidden_size)
        # (5)
        self.b_r = np.zeros((hidden_size, ))
        # (5, 15)
        self.W_h = np.random.randn(hidden_size, input_size + hidden_size)
        # (5)
        self.b_h = np.zeros((hidden_size, ))
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    def tanh(self, x):
        return np.tanh(x)
    def forward(self, x):
        # print("Input shape:", x.shape)
        # Input shape: (10,)
        h_prev = np.zeros((hidden_size, )) #(5,)
        concat_input = np.concatenate((h_prev, x), axis=0) # (15,)
        z_t = self.sigmoid(np.dot(self.W_z, concat_input) + self.b_z) # (5,)
        r_t = self.sigmoid(np.dot(self.W_r, concat_input) + self.b_r) # (5,)
        # s = r_t * h_prev
        # print(r_t.shape) # (5,)
        # print(s.shape) # (5,)
        concat_reset_input = np.concatenate((r_t * h_prev, x), axis=0) # (15,)
        h_candidate = self.tanh(np.dot(self.W_h, concat_reset_input) + self.b_h) # (5,)
        h_t = (1 - z_t) * h_prev + z_t * h_candidate # (5,)
        return h_t
# 测试原生代码实现的 GRU
input_size = 10
hidden_size = 5
seq_len = 8
# gru_model = GRU(10, 5)
gru_model = GRU(input_size, hidden_size)
x = np.random.randn(seq_len, input_size)
# print("Input shape:", x.shape) # (8, 10)
all_out = []
for t in range(seq_len): # 8
    x_t = x[t,:] # 确保 x_t 形状为 (input_size, 1) 
    # print("Input shape:", x_t.shape) # Input shape: (10,)
    h_t = gru_model.forward(x_t)
    all_out.append(h_t)
print("Output shape:", h_t.shape)
print("All output:", np.array(all_out).shape)

PyTorch

nn.GRUCell

import torch
import torch.nn as nn
class GRU(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(GRU, self).__init__()
        self.hidden_size = hidden_size # 5
        # (10, 5)
        self.gru_cell = nn.GRUCell(input_size, hidden_size)
    def forward(self, x, h_prev):
        # h_t = self.gru_cell(x, h_prev) 返回的是当前时间步 t 的隐藏状态 h_t
        h_t = self.gru_cell(x, h_prev)
        return h_t
# 测试 PyTorch 实现的 GRU
input_size = 10
hidden_size = 5
timesteps = 8
# gru_model = GRU(10, 5)
gru_model = GRU(input_size, hidden_size)
x = torch.randn(timesteps, input_size) # (8, 10)
h_prev = torch.zeros(hidden_size) # (5,)
for t in range(timesteps):
    # (1,10) (5,1)
    h_t = gru_model(x[t], h_prev)
    print("Output shape:", h_t.shape)

nn.GRU

import torch
import torch.nn as nn
class GRUModel(nn.Module):
    def __init__(self, input_size, hidden_size):
        super(GRUModel, self).__init__()
        self.hidden_size = hidden_size # 5
        self.gru = nn.GRU(input_size, hidden_size, batch_first=True)
    def forward(self, x):
        # # (1, 8, 10) [batch_size, timesteps, input_size]
        # 前向传播
        output, h_n = self.gru(x)
        return output, h_n
# 测试 PyTorch 实现的 GRU
input_size = 10
hidden_size = 5
timesteps = 8 # seq_len
batch_size = 1 # 设置批次大小为 1
# gru_model = GRUModel(10, 5)
gru_model = GRUModel(input_size, hidden_size)
# (1, 8, 10) [batch_size, timesteps, input_size]
x = torch.randn(batch_size, timesteps, input_size) # 输入张量的形状为 (batch_size,seq_len, input_size)
output, h_n = gru_model(x)
#  Output shape: torch.Size([1, 8, 5])
print("Output shape:", output.shape) # 输出形状为 (batch_size, seq_len,hidden_size)
# Final hidden state shape: torch.Size([1, 1, 5])
print("Final hidden state shape:", h_n.shape) # 最终隐藏状态的形状为 (num_layers,batch_size, hidden_size)

（1）output（

形状为 (batch_size, sequence_length, hidden_size) 。
这是一个完整的模型中的一个循环网络层，所以最后的输出不涉及分类数，而是隐藏层
这个张量包含了 GRU 对每个时间步的输出，也就是每个时间步的隐藏状态。对于每个时间步 t ， GRU 会输出一个对应的隐藏状态。
如果 batch_first=True （如在代码中设置的那样），则 output 的第一个维度是批次大小

batch_size ，第二个维度是序列长度 sequence_length ，第三个维度是隐藏层的大小 hidden_size 。

（2）h_n

形状为 (num_layers * num_directions, batch_size, hidden_size) 。
这是 GRU 在最后一个时间步的隐藏状态。它只保留每个样本在整个序列中最后一个时间步的隐藏状态。
如果是单向 GRU ， num_directions=1 ；如果是双向 GRU （即 bidirectional=True ），

num_directions=2 。

BiLSTM

概述

双向长短期记忆网络（Bi-directional Long Short-Term Memory，BiLSTM ）是一种扩展自长短期记忆网络（LSTM ）的结构，旨在解决传统 LSTM 模型只能考虑到过去信息的问题。 BiLSTM 在每个时间步同时考虑了过去和未来的信息，从而更好地捕捉了序列数据中的双向上下文关系。

BiLSTM 的创新点在于引入了两个独立的 LSTM 层，一个按正向顺序处理输入序列，另一个按逆向顺序处理输入序列。这样，每个时间步的输出就包含了当前时间步之前和之后的信息，进而使得模型能够更好地理解序列数据中的语义和上下文关系。

正向传递: 输入序列按照时间顺序被输入到第一个LSTM层。每个时间步的输出都会被计算并保留下来。
反向传递 : 输入序列按照时间的逆序（即先输入最后一个元素）被输入到第二个 LSTM 层。与正向传递类似，每个时间步的输出都会被计算并保留下来。
合并输出 : 在每个时间步，将两个 LSTM 层的输出通过某种方式合并（如拼接或加和）以得到最终的输出。

BILSTM模型应用背景

命名体识别

标注集

BMES标注集

分词的标注集并非只有一种，举例中文分词的情况，汉子作为词语开始 Begin，结束End,中间Middle，单字Single ，这四种情况就可以囊括所有的分词情况。于是就有了 BMES 标注集，这样的标注集在命名实体识别任务中也非常常见。

词性标注

在序列标注问题中单词序列就是 x，词性序列就是y ，当前词词性的判定需要综合考虑前后单词的词性。 而标注集最著名的就是863 标注集和北大标注集。

代码实现

原生代码

import numpy as np
import torch
class BiLSTM:
    # # bilstm = BiLSTM(2, 3, 4)
    def __init__(self, input_size, hidden_size, output_size):
        self.input_size = input_size # 2
        self.hidden_size = hidden_size #3
        self.output_size = output_size #4
        # 正向 LSTM 参数
        # self.forward_lstm = LSTM(2, 3, 4)
        self.forward_lstm = LSTM(input_size, hidden_size, output_size)
        # 反向 LSTM 参数
        self.backward_lstm = LSTM(input_size, hidden_size, output_size)
    def forward(self, inputs):
        # 正向 LSTM 传播
        forward_outputs, _, _ = self.forward_lstm.forward(inputs)
        # 反向 LSTM 传播
        # np.flip(inputs, axis=0) 是用来翻转 inputs 数组的第一维（即时间步维度）
        backward_outputs, _, _ = self.backward_lstm.forward(np.flip(inputs,
        axis=0))
        # 合并正向和反向的输出
        # x1 = [f1, f2, f3]
        # x2 = [b1, b2, b3]
        # zip() 的结果是 [(f1, b1), (f2, b2), (f3, b3)]
        combined_outputs = [np.concatenate((f, b), axis=0) for f, b in zip(forward_outputs, np.flip(backward_outputs, axis=0))]
        return combined_outputs
    
class LSTM:
    def __init__(self, input_size, hidden_size, output_size):
        """
        :param input_size: 词向量大小
        :param hidden_size: 隐藏层大小
        :param output_size: 输出类别
        """
        self.input_size = input_size #2
        self.hidden_size = hidden_size #3
        self.output_size = output_size #4
        # 初始化权重和偏置
        # (3, 5)
        self.w_f = np.random.rand(hidden_size, input_size + hidden_size)
        # (3,)
        self.b_f = np.random.rand(hidden_size)
        # (3, 5)
        self.w_i = np.random.rand(hidden_size, input_size + hidden_size)
        # (3,)
        self.b_i = np.random.rand(hidden_size)
        # (3, 5)
        self.w_c = np.random.rand(hidden_size, input_size + hidden_size)
        # (3,)
        self.b_c = np.random.rand(hidden_size)
        # (3, 5)
        self.w_o = np.random.rand(hidden_size, input_size + hidden_size)
        # (3,)
        self.b_o = np.random.rand(hidden_size)
        # (4, 3)
        # 输出层
        self.w_y = np.random.rand(output_size, hidden_size)
        # (4,)
        self.b_y = np.random.rand(output_size)
    def tanh(self, x):
        return np.tanh(x)
    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))
    def forward(self, x):
        # (5, 2)
        h_t = np.zeros((self.hidden_size,)) # 初始隐藏状态 (3,1)
        c_t = np.zeros((self.hidden_size,)) # 初始细胞状态 (3,1)
        h_states = [] # 存储每个时间步的隐藏状态
        c_states = [] # 存储每个时间步的细胞状态
        for t in range(x.shape[0]): # 5
            x_t = x[t] # 当前时间步的输入
            # concatenate 将x_t和h_t拼接 垂直方向
            x_t = np.concatenate([x_t, h_t]) # (5,)
            # 遗忘门
            f_t = self.sigmoid(np.dot(self.w_f, x_t) + self.b_f) # (3, 5)*(5,)=>(3,)+(3,)=>(3,)
            # 输入门
            i_t = self.sigmoid(np.dot(self.w_i, x_t) + self.b_i) # (3, 5)*(5,)=>(3,)+(3,)=>(3,)
            # 候选细胞状态
            c_hat_t = self.tanh(np.dot(self.w_c, x_t) + self.b_c) # (3, 5)*(5,)=>(3,)+(3,)=>(3,)
            # 更新细胞状态
            c_t = f_t * c_t + i_t * c_hat_t # (3,)*(3,)+(3,)*(3,)=>(3,)
            # 输出门
            o_t = self.sigmoid(np.dot(self.w_o, x_t) + self.b_o) # (3, 5)*(5,)=>(3,)+(3,)=>(3,)
            # 更新隐藏状态
            h_t = o_t * self.tanh(c_t) # (3,)*(3,)=>(3,)
            # 保存每个时间步的隐藏状态和细胞状态
            h_states.append(h_t) #(5,3)
            c_states.append(c_t) # (5,3)

        # 输出层 对最后一个时间步的隐藏状态进行预测，分类类别
        y_t = np.dot(self.w_y, h_t) + self.b_y # (4, 3)*(3,)=>(4,)+(4,)=>(4,)   
        # 转成张量形式 dim 0 表示行的维度
        output = torch.softmax(torch.tensor(y_t), dim=0) # (4,)
        # 转换为 NumPy 数组
        return np.array(h_states), np.array(c_states), output
    
# 测试用例
input_size = 2
hidden_size = 3
output_size = 4
# bilstm = BiLSTM(2, 3, 4)
bilstm = BiLSTM(input_size, hidden_size, output_size)
# 输入序列
inputs = np.random.rand(5, 2)
# 前向传播
outputs = bilstm.forward(inputs)
# Outputs after one forward pass: (5, 6)
print("Outputs after one forward pass:", np.array(outputs).shape)

Pytorch

import torch
import torch.nn as nn
# 定义BiLSTM类
class BiLSTM(nn.Module):
    # # bilstm = BiLSTM(10, 6, 5)
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(BiLSTM, self).__init__()
        # 初始化双向LSTM层，输入维度input_dim，隐藏层维度hidden_dim，双向设为True
        self.lstm = nn.LSTM(input_size=input_dim, hidden_size=hidden_dim,bidirectional=True)
        # 初始化线性层，因为是双向的，所以输入维度是2倍的hidden_dim，输出维度是output_dim
        self.linear = nn.Linear(hidden_dim * 2, output_dim)
    def forward(self, input_seq):
        # 创建一个测试输入张量，形状为(seq_len, batch_size, input_dim)
        # test_input = torch.randn(3, 1, 10)
        # 通过双向LSTM层处理输入序列
        # lstm_out时间步的隐藏状态输出
        lstm_out, _ = self.lstm(input_seq)
        # torch.Size([3, 1, 12])
        # print(lstm_out.shape)
        # lstm_out[-1] 通常指的是从LSTM模型获取的输出序列中的最后一个元素。
        # torch.Size([1, 12])
        # print(lstm_out[-1].shape)
        # 将双向LSTM层的最后一步的输出通过线性层获得最终输出
        final_output = self.linear(lstm_out[-1])
        return final_output
# 测试案例
# 输入维度
input_dim = 10 # 输入向量的维度
hidden_dim = 6 # 隐藏层的维度
output_dim = 5 # 输出向量的维度
seq_len = 3 # 输入序列的长度
# 实例化BiLSTM
# bilstm = BiLSTM(10, 6, 5)
bilstm = BiLSTM(input_dim, hidden_dim, output_dim)
# 创建一个测试输入张量，形状为(seq_len, batch_size, input_dim)
# 这里假设batch_size为1
# test_input = torch.randn(3, 1, 10)
test_input = torch.randn(seq_len, 1, input_dim)
# 获得BiLSTM的输出
test_output = bilstm(test_input)
print(test_output)