首先感谢https://www.jianshu.com/p/9dc9f41f0b29作者的文章,让我对LSTM有了初步的认识。
还有我要推荐李宏毅老师讲的LSTM课程,讲的实在是太容易理解了,https://www.youtube.com/watch?v=xCGidAeyS4M
理解RNN
想要理解LSTM的前提是理解RNN,RNN(Recurrent Neural Network)是一类用于处理序列数据的神经网络。RNN的典型应用就是NLP,自然语言就是一个时间序列数据,上下文之间存在着一定的联系。RNN 是包含循环的网络,允许信息的持久化。
有时候当前预测位置与上下文中的相关信息间隔较远,当相关信息和当前预测位置之间的间隔不断增大时,RNN 会丧失学习到连接如此远的信息的能力,LSTM对于这种情况却有很好的效果。
理解LSTM
Long Short Term 网络—— 一般就叫做 LSTM ——是一种 RNN 特殊的类型,可以学习长期依赖信息。
所有 RNN 都具有一种重复神经网络模块的链式的形式。在标准的 RNN 中,这个重复的模块只有一个非常简单的结构,例如一个 tanh
层。
LSTM 同样是这样的结构,但是重复的模块拥有一个不同的结构。不同于 单一神经网络层,这里是有四个,以一种非常特殊的方式进行交互。
门结构
LSTM 有通过“门”结构去除或增加信息到细胞状态的能力。门是一种让信息选择式通过的方法。他们包含一个 sigmoid
神经网络层和一个按位的乘法操作。下图就是这个门
Sigmoid 层输出 0到 1 之间的数值,描述每个部分有多少量可以通过。0代表“不许任何量通过”,1就指“允许任意量通过”!
第一步决定丢弃信息
在我们 LSTM 中的第一步是决定我们会从细胞状态中丢弃什么信息。这个决定通过一个称为忘记门层完成。该门会读取 和 ,输出一个在 0到 1之间的数值给每个在细胞状态 中的数字。1表示“完全保留”,0表示“完全舍弃”。
例如在nlp中,当我们看到新的主语,我们希望忘记旧的主语。
第二步决定更新的信息
下一步是确定什么样的新信息被存放在细胞状态中。这里包含两个部分。第一,sigmoid
层称 “输入门层” 决定什么值我们将要更新。然后,一个 tanh
层创建一个新的候选值向量,,会被加入到状态中。下一步,我们会讲这两个信息来产生对状态的更新。比如增加新的主语,来替代旧的需要忘记的主语。
第三步更新细胞状态
将 更新为 ,把旧状态与 相乘,丢弃掉我们确定需要丢弃的信息。接着加上 。这就是新的候选值,根据我们决定更新每个状态的程度进行变化。
第四步确定输出信息
运行一个 sigmoid
层来确定细胞状态的哪个部分将输出出去。接着,把细胞状态通过 tanh
进行处理(得到一个在 到 之间的值)并将它和 sigmoid
门的输出相乘,最终仅仅会输出我们确定输出的那部分。
LSTM参数及其应用的简单解释
LSTM的参数
查找pytorch官方文档中的nn.LSTM可以看到,LSTM模型定义的时候有7个参数
-
input_size – The number of expected features in the input x
-
hidden_size – The number of features in the hidden state h
-
num_layers – Number of recurrent layers. E.g., setting num_layers=2
would mean stacking two LSTMs together to form a stacked LSTM, with the second LSTM taking in outputs of the first LSTM and computing the final results. Default: 1
-
bias – If False
, then the layer does not use bias weights b_ih and b_hh. Default: True
-
batch_first – If True
, then the input and output tensors are provided as (batch, seq, feature). Default: False
-
dropout – If non-zero, introduces a Dropout layer on the outputs of each LSTM layer except the last layer, with dropout probability equal to dropout
. Default: 0
-
bidirectional – If True
, becomes a bidirectional LSTM. Default: False
num_layers 为LSTM 堆叠的层数,默认值是1层,如果设置为2,第二个LSTM接收第一个LSTM的计算结果。也就是第一层输入 [ X0 X1 X2 ... Xt],计算出 [ h0 h1 h2 ... ht ],第二层将 [ h0 h1 h2 ... ht ] 作为 [ X0 X1 X2 ... Xt] 输入再次计算,输出最后的 [ h0 h1 h2 ... ht ]。
bias 是偏置值,或者偏移值。没有偏置值就是以0为中轴,或以0为起点。
batch_first: 输入输出的第一维是否为 batch_size,默认值 False。input 默认是(seq_len, batch, input_size), batch_first 设置为True,input变成(batch, seq, feature)。
bidirectional: 是否是双向 RNN,默认为:false,若为 true,则:num_directions=2,否则为1。
LSTM的输入
Inputs: input(seq_len, batch, input_size), (h_0, c_0)
h_0 of shape (num_layers * num_directions, batch, hidden_size)
c_0 of shape (num_layers * num_directions, batch, hidden_size)
If (h_0, c_0) is not provided, both h_0 and c_0 default to zero.
LSTM的输出
Outputs: output(seq_len, batch, num_directions * hidden_size), (h_n, c_n)
h_n of shape (num_layers * num_directions, batch, hidden_size)
c_n of shape (num_layers * num_directions, batch, hidden_size)
例子
>>> rnn = nn.LSTM(10, 20, 2)
>>> input = torch.randn(5, 3, 10)
>>> h0 = torch.randn(2, 3, 20)
>>> c0 = torch.randn(2, 3, 20)
>>> output, (hn, cn) = rnn(input, (h0, c0))
下面我拿kaggle比赛用过的LSTM代码来对参数和应用来做一下
声明LSTM的NeuralNet
class NeuralNet(nn.Module):
def __init__(self, embedding_matrix, num_aux_targets):
super(NeuralNet, self).__init__()
embed_size = embedding_matrix.shape[1]
self.embedding = nn.Embedding(max_features, embed_size)
self.embedding.weight = nn.Parameter(torch.tensor(embedding_matrix, dtype=torch.float32))
self.embedding.weight.requires_grad = False
self.embedding_dropout = SpatialDropout(0.3)
self.lstm1 = nn.LSTM(embed_size, LSTM_UNITS, bidirectional=True, batch_first=True)
self.lstm2 = nn.LSTM(LSTM_UNITS * 2, LSTM_UNITS, bidirectional=True, batch_first=True)
self.linear1 = nn.Linear(DENSE_HIDDEN_UNITS, DENSE_HIDDEN_UNITS)
self.linear2 = nn.Linear(DENSE_HIDDEN_UNITS, DENSE_HIDDEN_UNITS)
self.linear_out = nn.Linear(DENSE_HIDDEN_UNITS, 1)
self.linear_aux_out = nn.Linear(DENSE_HIDDEN_UNITS, num_aux_targets)
def forward(self, x, lengths=None):
h_embedding = self.embedding(x.long())
h_embedding = self.embedding_dropout(h_embedding)
h_lstm1, _ = self.lstm1(h_embedding)
h_lstm2, _ = self.lstm2(h_lstm1)
# global average pooling
avg_pool = torch.mean(h_lstm2, 1)
# global max pooling
max_pool, _ = torch.max(h_lstm2, 1)
h_conc = torch.cat((max_pool, avg_pool), 1)
h_conc_linear1 = F.relu(self.linear1(h_conc))
h_conc_linear2 = F.relu(self.linear2(h_conc))
hidden = h_conc + h_conc_linear1 + h_conc_linear2
result = self.linear_out(hidden)
aux_result = self.linear_aux_out(hidden)
out = torch.cat([result, aux_result], 1)
return out
self.lstm1 = nn.LSTM(embed_size, LSTM_UNITS, bidirectional=True, batch_first=True)
#input_size = embed_size hidden_size = LSTM_UNITS
self.lstm2 = nn.LSTM(LSTM_UNITS * 2, LSTM_UNITS, bidirectional=True, batch_first=True)
#input_size = LSTM_UNITS * 2 hidden_size = LSTM_UNITS
h_lstm1, _ = self.lstm1(h_embedding)
#h_lstm1 的size为LSTM_UNITS * 2
h_lstm2, _ = self.lstm2(h_lstm1)
#h_lstm2 的size也为LSTM_UNITS * 2
显然这个例子中是双层的LSTM,示意图如下所示
h_conc = torch.cat((max_pool, avg_pool), 1)
Concatenates the given sequence of seq tensors in the given dimension. All tensors must either have the same shape (except in the concatenating dimension) or be empty.
h_conc_linear1 = F.relu(self.linear1(h_conc))
self.linear1 = nn.Linear(DENSE_HIDDEN_UNITS, DENSE_HIDDEN_UNITS) 输入为隐藏层size,输出也是隐藏层size
最终result输出为1,aux_result输出为7
训练LSTM模型
def train_model(learn,test,output_dim,lr=0.001,
batch_size=512, n_epochs=5,
enable_checkpoint_ensemble=True):
all_test_preds = []
checkpoint_weights = [1,2,4,8,6]
test_loader = torch.utils.data.DataLoader(test, batch_size=batch_size, shuffle=False)
n = len(learn.data.train_dl)
phases = [(TrainingPhase(n).schedule_hp('lr', lr * (0.6**(i)))) for i in range(n_epochs)]
sched = GeneralScheduler(learn, phases)
learn.callbacks.append(sched)
for epoch in range(n_epochs):
learn.fit(1) # 表示epoch为1
test_preds = np.zeros((len(test), output_dim))
for i, x_batch in enumerate(test_loader):
X = x_batch[0].cuda()
y_pred = sigmoid(learn.model(X).detach().cpu().numpy())
test_preds[i * batch_size:(i+1) * batch_size, :] = y_pred
all_test_preds.append(test_preds)
if enable_checkpoint_ensemble:
test_preds = np.average(all_test_preds, weights=checkpoint_weights, axis=0)
else:
test_preds = all_test_preds[-1]
return test_preds
主函数
model = NeuralNet(embedding_matrix, y_aux_train.shape[-1])
#这里的model就是上述的双层LSTM模型
learn = Learner(databunch, model, loss_func=custom_loss)
#实例化Learner,Learner是fastai下面的类
test_preds = train_model(learn,test_dataset,output_dim=7)
all_test_preds.append(test_preds)
Class Learner
Learner类在fastai的basic_train下,训练模型利用数据通过优化器最小化损失函数,可以自动获得最优模型。一个epoch后所有变量都会出输出,并且可以得到回调函数.
Learner
(data
:DataBunch
, model
:Module
, opt_func
:Callable
='Adam'
, loss_func
:Callable
=None
, metrics
:Collection
[Callable
]=None
, true_wd
:bool
=True
, bn_wd
:bool
=True
, wd
:Floats
=0.01
, train_bn
:bool
=True
, path
:str
=None
, model_dir
:PathOrStr
='models'
, callback_fns
:Collection
[Callable
]=None
, callbacks
:Collection
[Callback
]=<factory>
, layer_groups
:ModuleList
=None
, add_time
:bool
=True
, silent
:bool
=None
)
fit
(epochs
:int
, lr
:Union
[float
, Collection
[float
], slice
]=slice(None, 0.003, None)
, wd
:Floats
=None
, callbacks
:Collection
[Callback
]=None
)
learn.recorder.plot() #画出loss与learning rate变化关系图
predict
(item
:ItemBase
, return_x
:bool
=False
, batch_first
:bool
=True
, with_dropout
:bool
=False
, **kwargs
)
class GeneralScheduler
GeneralScheduler
(learn
:Learner
, phases
:Collection
[TrainingPhase
], start_epoch
:int
=None
) :: LearnerCallback
class
TrainingPhase
[source][test]
TrainingPhase
(length
:int
)
schedule_hp[source][test]
schedule_hp(name, vals, anneal=None)
Adds a schedule for name between vals using anneal.
The phase will make the hyper-parameter vary from the first value in vals to the second, following anneal. If an annealing function is specified but vals is a float, it will decay to 0. If no annealing function is specified, the default is a linear annealing for a tuple, a constant parameter if it's a float.
几个基础的超参数已经命名了,比如例子中的lr,同样也可以添加优化器中有的超参数,比如eps,如果在使用adam的话
- 'lr' for learning rate
- 'mom' for momentum (or beta1 in Adam)
- 'beta' for the beta2 in Adam or the alpha in RMSprop
- 'wd' for weight decay
代码举例
phases = [(TrainingPhase(n * (cycle_len * cycle_mult**i))
.schedule_hp('lr', lr, anneal=annealing_cos)
.schedule_hp('mom', mom)) for i in range(n_cycles)]
sched = GeneralScheduler(learn, phases)
learn.callbacks.append(sched)