LSTM理解与应用

2023-11-09

首先感谢https://www.jianshu.com/p/9dc9f41f0b29作者的文章，让我对LSTM有了初步的认识。

还有我要推荐李宏毅老师讲的LSTM课程，讲的实在是太容易理解了，https://www.youtube.com/watch?v=xCGidAeyS4M

理解RNN

想要理解LSTM的前提是理解RNN，RNN（Recurrent Neural Network）是一类用于处理序列数据的神经网络。RNN的典型应用就是NLP，自然语言就是一个时间序列数据，上下文之间存在着一定的联系。RNN 是包含循环的网络，允许信息的持久化。

有时候当前预测位置与上下文中的相关信息间隔较远，当相关信息和当前预测位置之间的间隔不断增大时，RNN 会丧失学习到连接如此远的信息的能力,LSTM对于这种情况却有很好的效果。

理解LSTM

Long Short Term 网络—— 一般就叫做 LSTM ——是一种 RNN 特殊的类型，可以学习长期依赖信息。

所有 RNN 都具有一种重复神经网络模块的链式的形式。在标准的 RNN 中，这个重复的模块只有一个非常简单的结构，例如一个 tanh 层。

LSTM 同样是这样的结构，但是重复的模块拥有一个不同的结构。不同于单一神经网络层，这里是有四个，以一种非常特殊的方式进行交互。

门结构

LSTM 有通过“门”结构去除或增加信息到细胞状态的能力。门是一种让信息选择式通过的方法。他们包含一个 sigmoid 神经网络层和一个按位的乘法操作。下图就是这个门

Sigmoid 层输出 0到 1 之间的数值，描述每个部分有多少量可以通过。0代表“不许任何量通过”，1就指“允许任意量通过”！

第一步决定丢弃信息

在我们 LSTM 中的第一步是决定我们会从细胞状态中丢弃什么信息。这个决定通过一个称为忘记门层完成。该门会读取 $h_{t-1}$ 和 $x_t$ ，输出一个在 0到 1之间的数值给每个在细胞状态 $C_{t-1}$ 中的数字。1表示“完全保留”，0表示“完全舍弃”。

例如在nlp中，当我们看到新的主语，我们希望忘记旧的主语。

第二步决定更新的信息

下一步是确定什么样的新信息被存放在细胞状态中。这里包含两个部分。第一，sigmoid 层称 “输入门层” 决定什么值我们将要更新。然后，一个 tanh 层创建一个新的候选值向量， $\tilde{C}_t$ ，会被加入到状态中。下一步，我们会讲这两个信息来产生对状态的更新。比如增加新的主语，来替代旧的需要忘记的主语。

第三步更新细胞状态

将 $C_{t-1}$ 更新为 $C_t$ ，把旧状态与 $f_t$ 相乘，丢弃掉我们确定需要丢弃的信息。接着加上 $i_t * \tilde{C}_t$ 。这就是新的候选值，根据我们决定更新每个状态的程度进行变化。

第四步确定输出信息

运行一个 sigmoid 层来确定细胞状态的哪个部分将输出出去。接着，把细胞状态通过 tanh 进行处理（得到一个在 $-1$ 到 $1$ 之间的值）并将它和 sigmoid 门的输出相乘，最终仅仅会输出我们确定输出的那部分。

LSTM参数及其应用的简单解释

LSTM的参数

查找pytorch官方文档中的nn.LSTM可以看到，LSTM模型定义的时候有7个参数

input_size – The number of expected features in the input x
hidden_size – The number of features in the hidden state h
num_layers – Number of recurrent layers. E.g., setting num_layers=2 would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1
bias – If False, then the layer does not use bias weights b_ih and b_hh. Default: True
batch_first – If True, then the input and output tensors are provided as (batch, seq, feature). Default: False
dropout – If non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer, with dropout probability equal to dropout. Default: 0
bidirectional – If True, becomes a bidirectional LSTM. Default: False

num_layers 为LSTM 堆叠的层数，默认值是1层，如果设置为2，第二个LSTM接收第一个LSTM的计算结果。也就是第一层输入 [ X0 X1 X2 ... Xt]，计算出 [ h0 h1 h2 ... ht ]，第二层将 [ h0 h1 h2 ... ht ] 作为 [ X0 X1 X2 ... Xt] 输入再次计算，输出最后的 [ h0 h1 h2 ... ht ]。

bias 是偏置值，或者偏移值。没有偏置值就是以0为中轴，或以0为起点。

batch_first: 输入输出的第一维是否为 batch_size，默认值 False。input 默认是(seq_len, batch, input_size)， batch_first 设置为True，input变成(batch, seq, feature)。

bidirectional: 是否是双向 RNN，默认为：false，若为 true，则：num_directions=2，否则为1。

LSTM的输入

Inputs: input(seq_len, batch, input_size), (h_0, c_0)

h_0 of shape (num_layers * num_directions, batch, hidden_size)

c_0 of shape (num_layers * num_directions, batch, hidden_size)

If (h_0, c_0) is not provided, both h_0 and c_0 default to zero.

LSTM的输出

Outputs: output(seq_len, batch, num_directions * hidden_size), (h_n, c_n)

h_n of shape (num_layers * num_directions, batch, hidden_size)

c_n of shape (num_layers * num_directions, batch, hidden_size)

例子

>>> rnn = nn.LSTM(10, 20, 2)

>>> input = torch.randn(5, 3, 10)

>>> h0 = torch.randn(2, 3, 20)

>>> c0 = torch.randn(2, 3, 20)

>>> output, (hn, cn) = rnn(input, (h0, c0))

下面我拿kaggle比赛用过的LSTM代码来对参数和应用来做一下

声明LSTM的NeuralNet

class NeuralNet(nn.Module):
    def __init__(self, embedding_matrix, num_aux_targets):
        super(NeuralNet, self).__init__()
        embed_size = embedding_matrix.shape[1]
        
        self.embedding = nn.Embedding(max_features, embed_size)
        self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float32))
        self.embedding.weight.requires_grad = False
        self.embedding_dropout = SpatialDropout(0.3)
        
        self.lstm1 = nn.LSTM(embed_size, LSTM_UNITS, bidirectional=True, batch_first=True)
        self.lstm2 = nn.LSTM(LSTM_UNITS * 2, LSTM_UNITS, bidirectional=True, batch_first=True)
    
        self.linear1 = nn.Linear(DENSE_HIDDEN_UNITS, DENSE_HIDDEN_UNITS)
        self.linear2 = nn.Linear(DENSE_HIDDEN_UNITS, DENSE_HIDDEN_UNITS)
        
        self.linear_out = nn.Linear(DENSE_HIDDEN_UNITS, 1)
        self.linear_aux_out = nn.Linear(DENSE_HIDDEN_UNITS, num_aux_targets)
        
    def forward(self, x, lengths=None):
        h_embedding = self.embedding(x.long())
        h_embedding = self.embedding_dropout(h_embedding)
        
        h_lstm1, _ = self.lstm1(h_embedding)
        h_lstm2, _ = self.lstm2(h_lstm1)
        
        # global average pooling
        avg_pool = torch.mean(h_lstm2, 1)
        # global max pooling
        max_pool, _ = torch.max(h_lstm2, 1)
        
        h_conc = torch.cat((max_pool, avg_pool), 1)
        h_conc_linear1  = F.relu(self.linear1(h_conc))
        h_conc_linear2  = F.relu(self.linear2(h_conc))
        
        hidden = h_conc + h_conc_linear1 + h_conc_linear2
        
        result = self.linear_out(hidden)
        aux_result = self.linear_aux_out(hidden)
        out = torch.cat([result, aux_result], 1)
        
        return out

self.lstm1 = nn.LSTM(embed_size, LSTM_UNITS, bidirectional=True, batch_first=True)
#input_size = embed_size hidden_size = LSTM_UNITS
self.lstm2 = nn.LSTM(LSTM_UNITS * 2, LSTM_UNITS, bidirectional=True, batch_first=True)
#input_size = LSTM_UNITS * 2 hidden_size = LSTM_UNITS

h_lstm1, _ = self.lstm1(h_embedding)
#h_lstm1 的size为LSTM_UNITS * 2
h_lstm2, _ = self.lstm2(h_lstm1)
#h_lstm2 的size也为LSTM_UNITS * 2

显然这个例子中是双层的LSTM,示意图如下所示

h_conc = torch.cat((max_pool, avg_pool), 1)
Concatenates the given sequence of seq tensors in the given dimension. All tensors must either have the same shape (except in the concatenating dimension) or be empty.

h_conc_linear1 = F.relu(self.linear1(h_conc))
self.linear1 = nn.Linear(DENSE_HIDDEN_UNITS, DENSE_HIDDEN_UNITS) 输入为隐藏层size，输出也是隐藏层size

最终result输出为1，aux_result输出为7

训练LSTM模型

def train_model(learn,test,output_dim,lr=0.001,
                batch_size=512, n_epochs=5,
                enable_checkpoint_ensemble=True):
    
    all_test_preds = []
    checkpoint_weights = [1,2,4,8,6]
    test_loader = torch.utils.data.DataLoader(test, batch_size=batch_size, shuffle=False)
    n = len(learn.data.train_dl)
    phases = [(TrainingPhase(n).schedule_hp('lr', lr * (0.6**(i)))) for i in range(n_epochs)]
    sched = GeneralScheduler(learn, phases)
    learn.callbacks.append(sched)
    for epoch in range(n_epochs):
        learn.fit(1)  # 表示epoch为1
        test_preds = np.zeros((len(test), output_dim))    
        for i, x_batch in enumerate(test_loader):
            X = x_batch[0].cuda()
            y_pred = sigmoid(learn.model(X).detach().cpu().numpy())
            test_preds[i * batch_size:(i+1) * batch_size, :] = y_pred

        all_test_preds.append(test_preds)


    if enable_checkpoint_ensemble:
        test_preds = np.average(all_test_preds, weights=checkpoint_weights, axis=0)    
    else:
        test_preds = all_test_preds[-1]
        
    return test_preds

主函数

model = NeuralNet(embedding_matrix, y_aux_train.shape[-1])
#这里的model就是上述的双层LSTM模型
learn = Learner(databunch, model, loss_func=custom_loss)
#实例化Learner，Learner是fastai下面的类
test_preds = train_model(learn,test_dataset,output_dim=7)    
all_test_preds.append(test_preds)

Class Learner

Learner类在fastai的basic_train下，训练模型利用数据通过优化器最小化损失函数，可以自动获得最优模型。一个epoch后所有变量都会出输出，并且可以得到回调函数.

Learner(data:DataBunch, model:Module, opt_func:Callable='Adam', loss_func:Callable=None, metrics:Collection[Callable]=None, true_wd:bool=True, bn_wd:bool=True, wd:Floats=0.01, train_bn:bool=True, path:str=None, model_dir:PathOrStr='models', callback_fns:Collection[Callable]=None, callbacks:Collection[Callback]=<factory>, layer_groups:ModuleList=None, add_time:bool=True, silent:bool=None)

fit(epochs:int, lr:Union[float, Collection[float], slice]=slice(None, 0.003, None), wd:Floats=None, callbacks:Collection[Callback]=None)

learn.recorder.plot() #画出loss与learning rate变化关系图

predict(item:ItemBase, return_x:bool=False, batch_first:bool=True, with_dropout:bool=False, **kwargs)

class GeneralScheduler

GeneralScheduler(learn:Learner, phases:Collection[TrainingPhase], start_epoch:int=None) :: LearnerCallback

class TrainingPhase[source][test]

TrainingPhase(length:int)

schedule_hp[source][test]
schedule_hp(name, vals, anneal=None)
Adds a schedule for name between vals using anneal.
The phase will make the hyper-parameter vary from the first value in vals to the second, following anneal. If an annealing function is specified but vals is a float, it will decay to 0. If no annealing function is specified, the default is a linear annealing for a tuple, a constant parameter if it's a float.

几个基础的超参数已经命名了，比如例子中的lr，同样也可以添加优化器中有的超参数，比如eps，如果在使用adam的话

'lr' for learning rate
'mom' for momentum (or beta1 in Adam)
'beta' for the beta2 in Adam or the alpha in RMSprop
'wd' for weight decay

代码举例

phases = [(TrainingPhase(n * (cycle_len * cycle_mult**i))
         .schedule_hp('lr', lr, anneal=annealing_cos)
         .schedule_hp('mom', mom)) for i in range(n_cycles)]
sched = GeneralScheduler(learn, phases)
learn.callbacks.append(sched)

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)