NLP-实体&关系联合抽取-2021：GPLinker

2023-11-19

基础思路

关系抽取乍看之下是三元组 ( s , p , o ) (s,p,o) (s,p,o)（即subject, predicate, object)的抽取，但落到具体实现上，它实际是“五元组” ( s h , s t , p , o h , o t ) (s_h,s_t,p,o_h,o_t) (sh,st,p,oh,ot) 的抽取，其中 s h , s t s_h,s_t sh,st 分别是 s s s 的首、尾位置，而 o h , o t o_h,o_t oh,ot 则分别是 o o o 的首、尾位置。

从概率图的角度来看，我们可以这样构建模型：

设计一个五元组的打分函数 S ( s h , s t , p , o h , o t ) S(s_h,s_t,p,o_h,o_t) S(sh,st,p,oh,ot)；
训练时让标注的五元组 S ( s h , s t , p , o h , o t ) > 0 S(s_h,s_t,p,o_h,o_t) > 0 S(sh,st,p,oh,ot)>0，其余五元组则 S ( s h , s t , p , o h , o t ) < 0 S(s_h,s_t,p,o_h,o_t) < 0 S(sh,st,p,oh,ot)<0；
预测时枚举所有可能的五元组，输出 S ( s h , s t , p , o h , o t ) > 0 S(s_h,s_t,p,o_h,o_t) > 0 S(sh,st,p,oh,ot)>0 的部分。

然而，直接枚举所有的五元组数目太多，假设句子长度为 l l l， p p p的总数为 n n n，即便加上 s h ≤ s t s_h≤s_t sh≤st 和 o h ≤ o t o_h≤o_t oh≤ot 的约束，所有五元组的数目也有

n × l ( l + 1 ) 2 × l ( l + 1 ) 2 = 1 4 n l 2 ( l + 1 ) 2 \begin{aligned} n\times \frac{l(l+1)}{2}\times \frac{l(l+1)}{2}=\frac{1}{4}nl^2(l+1)^2 \end{aligned} n×2l(l+1)×2l(l+1)=41nl2(l+1)2

这是长度的四次方级别的计算量，实际情况下难以实现，所以必须做一些简化。

简化分解

以我们目前的算力来看，一般最多也就能接受长度平方级别的计算量，所以我们每次顶多能识别“一对”首或尾，为此，我们可以用以下的分解：
S ( s h , s t , p , o h , o t ) = S ( s h , s t ) + S ( o h , o t ) + S ( s h , o h ∣ p ) + S ( s t , o t ∣ p ) \begin{aligned} S(s_h,s_t,p,o_h,o_t) = S(s_h,s_t) + S(o_h,o_t) + S(s_h,o_h| p) + S(s_t, o_t| p) \end{aligned} S(sh,st,p,oh,ot)=S(sh,st)+S(oh,ot)+S(sh,oh∣p)+S(st,ot∣p)

要注意的是，该等式属于模型假设，是基于我们对任务的理解以及算力的限制所设计出来的，而不是理论推导出来的。其中，每一项都具直观的意义，比如：

S ( s h , s t ) S(s_h,s_t) S(sh,st)、 S ( o h , o t ) S(o_h,o_t) S(oh,ot) 分别是subject、object的首尾打分，通过 S ( s h , o h ∣ p ) S(s_h,o_h|p) S(sh,oh∣p) 和 S ( o h , o t ) > 0 S(o_h,o_t) > 0 S(oh,ot)>0 来析出所有的subject和object。
至于后两项，则是predicate的匹配：
- S ( s h , o h ∣ p ) S(s_h,o_h|p) S(sh,oh∣p) 这一项代表以subject和object的首特征作为它们自身的表征来进行一次匹配，如果我们能确保subject内和object内是没有嵌套实体的，那么理论上 S ( s h , o h ∣ p ) > 0 S(s_h,o_h|p) > 0 S(sh,oh∣p)>0 就足够析出所有的predicate了，但考虑到存在嵌套实体的可能，所以我们还要对实体的尾再进行一次匹配，即 S ( s t , o t ∣ p ) S(s_t, o_t|p) S(st,ot∣p) 这一项。

此时，训练和预测过程变为：

训练时让标注的五元组 S ( s h , s t ) > 0 S(s_h,s_t) > 0 S(sh,st)>0、 S ( o h , o t ) > 0 S(o_h,o_t) > 0 S(oh,ot)>0、 S ( s h , o h ∣ p ) > 0 S(s_h,o_h| p) > 0 S(sh,oh∣p)>0、 S ( s t , o t ∣ p ) > 0 S(s_t, o_t| p) > 0 S(st,ot∣p)>0，其余五元组则 S ( s h , s t ) < 0 S(s_h,s_t) < 0 S(sh,st)<0、 S ( o h , o t ) < 0 S(o_h,o_t) < 0 S(oh,ot)<0、 S ( s h , o h ∣ p ) < 0 S(s_h,o_h| p) < 0 S(sh,oh∣p)<0、 S ( s t , o t ∣ p ) < 0 S(s_t, o_t| p) < 0 S(st,ot∣p)<0；
预测时枚举所有可能的五元组，逐次输出 S ( s h , s t ) > 0 S(s_h,s_t) > 0 S(sh,st)>0、 S ( o h , o t ) > 0 S(o_h,o_t) > 0 S(oh,ot)>0、 S ( s h , o h ∣ p ) > 0 S(s_h,o_h| p) > 0 S(sh,oh∣p)>0、 S ( s t , o t ∣ p ) > 0 S(s_t, o_t| p) > 0 S(st,ot∣p)>0 的部分，然后取它们的交集作为最终的输出（即同时满足4个条件）。

在实现上：

由于 S ( s h , s t ) S(s_h,s_t) S(sh,st)、 S ( o h , o t ) S(o_h,o_t) S(oh,ot) 是用来识别subject、object对应的实体的，它相当于有两种实体类型的NER任务，所以我们可以用一个GlobalPointer来完成；
至于 S ( s h , o h ∣ p ) S(s_h,o_h| p) S(sh,oh∣p)，它是用来识别predicate为 p p p 的 ( s h , o h ) (s_h,o_h) (sh,oh) 对，跟NER不同的是，它这里不需要 s h ≤ o h s_h≤o_h sh≤oh 的约束，这里我们同样用GlobalPointer来完成，但为了识别出 s h > o h s_h>o_h sh>oh 的部分，要去掉GlobalPointer默认的下三角mask；最后 S ( s t , o t ∣ p ) S(s_t,o_t|p) S(st,ot∣p) 跟 S ( s h , o h ∣ p ) S(s_h,o_h|p) S(sh,oh∣p) 同理，不再赘述。

这里再回顾一遍：我们知道，作为NER模块，GlobalPointer可以统一识别嵌套和非嵌套的实体，而这是它基于token-pair的识别来做到的。所以，我们应该进一步将GlobalPointer理解为一个token-pair的识别模型，而不是局限在NER范围内理解它。认识到这一点之后，我们就能明白上述 S ( s h , s t ) S(s_h,s_t) S(sh,st)、 S ( o h , o t ) S(o_h,o_t) S(oh,ot)、 S ( s h , o h ∣ p ) S(s_h,o_h| p) S(sh,oh∣p)、 S ( s t , o t ∣ p ) S(s_t, o_t|p) S(st,ot∣p) 其实都可以用GlobalPointer来实现了，而要不要加下三角mask，则自行根据具体任务背景设置就好。

损失函数

现在我们已经把打分函数都设计好了，那么为了训练模型，就差损失函数了。这里继续使用GlobalPointer默认使用的、在《将“softmax+交叉熵”推广到多标签分类问题》中提出的多标签交叉熵，它的一般形式为：

log ⁡ ( 1 + ∑ i ∈ P e − S i ) + log ⁡ ( 1 + ∑ i ∈ N e S i ) \begin{aligned} \log \left(1 + \sum\limits_{i\in \mathcal{P}} e^{-S_i}\right) + \log \left(1 + \sum\limits_{i\in \mathcal{N}} e^{S_i}\right) \end{aligned} log(1+i∈P∑e−Si)+log(1+i∈N∑eSi)

实验结果

为了方便称呼，我们暂且将上述模型称为GPLinker（GlobalPointer-based Linking），一个基于bert4keras的参考实现如下：

脚本链接：task_relation_extraction_gplinker.py

在LIC2019上的实验结果如下（CasRel的代码为task_relation_extraction.py）：

模型 F1 CasRel 0.8220 GPLinker (Standard) 0.8272 GPLinker (Efficient) 0.8268 \begin{array}{c|c} \hline \text{模型} & \text{F1} \\ \hline \text{CasRel} & 0.8220 \\ \text{GPLinker (Standard)} & 0.8272\\ \text{GPLinker (Efficient)} & 0.8268\\ \hline \end{array} 模型CasRelGPLinker (Standard)GPLinker (Efficient)F10.82200.82720.8268

预训练模型是BERT base，Standard和Efficient的区别是分别使用了标准版GlobalPointer和Efficient GlobalPointer。该实验结果说明了两件事情，一是GPLinker确实比CasRel更加有效，二是Efficient GlobalPointer的设计确实能在更少参数的情况下媲美标准版GlobalPointer的效果。要知道在LIC2019这个任务下，如果使用标准版GlobalPointer，那么GPLinker的参数量接近1千万，而用Efficient GlobalPointer的话只有30万左右。

此外，在3090上，相比于“multi hot”版的多标签交叉熵，使用稀疏版多标签交叉熵的模型在训练速度上能提高1.5倍而不会损失精度，跟CasRel相比，使用了稀疏版多标签交叉熵的GPLinker在训练速度上只慢15%，但是解码速度快将近一倍，算得上又快又好了。