Attention
Attention = 注意力,从两个不同的主体开始。
论文:https://arxiv.org/pdf/1703.03906.pdf
seq2seq代码仓:https://github.com/google/seq2seq
计算方法:
加性Attention,如(Bahdanau attention):
v
a
⊤
tanh
(
W
1
h
t
+
W
2
h
‾
s
)
\boldsymbol{v}_a^{\top} \tanh \left(\boldsymbol{W}_{\mathbf{1}} \boldsymbol{h}_t+\boldsymbol{W}_{\mathbf{2}} \overline{\boldsymbol{h}}_s\right)
va⊤tanh(W1ht+W2hs)
乘性Attention,如(Luong attention):
score
(
h
t
,
h
‾
s
)
=
{
h
t
⊤
h
‾
s
dot
h
t
⊤
W
a
h
‾
s
general
v
a
⊤
tanh
(
W
a
[
h
t
;
h
‾
s
]
)
concat
\operatorname{score}\left(\boldsymbol{h}_{t}, \overline{\boldsymbol{h}}_{s}\right)=\left\{\begin{array}{ll} \boldsymbol{h}_{t}^{\top} \overline{\boldsymbol{h}}_{s} & \text { dot } \\ \boldsymbol{h}_{t}^{\top} \boldsymbol{W}_{a} \overline{\boldsymbol{h}}_{s} & \text { general } \\ \boldsymbol{v}_{a}^{\top} \tanh \left(\boldsymbol{W}_{a}\left[\boldsymbol{h}_{t} ; \overline{\boldsymbol{h}}_{s}\right]\right) & \text { concat } \end{array}\right.
score(ht,hs)=⎩
⎨
⎧ht⊤hsht⊤Wahsva⊤tanh(Wa[ht;hs]) dot general concat
来源论文:https://arxiv.org/pdf/1508.04025.pdf
From Attention to SelfAttention
Self Attention
“Attention is All You Need” 这篇论文提出了Multi-Head Self-Attention,是一种:Scaled Dot-Product Attention。
Attention ( Q , K , V ) = softmax ( Q K T d k ) V \operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V Attention(Q,K,V)=softmax(dkQKT)V
来源论文:https://arxiv.org/pdf/1706.03762.pdf
Scaled
Scaled 的目的是调节内积,使其结果不至于太大(太大的话softmax后就非0即1了,不够“soft”了)。
来源论文: https://kexue.fm/archives/4765
Multi-Head
Multi-Head可以理解为多个注意力模块,期望不同注意力模块“注意”到不一样的地方,类似于CNN的Kernel。
Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions.
MultiHead
(
Q
,
K
,
V
)
=
Concat
(
head
1
,
…
,
head
h
)
W
O
where head
i
=
Attention
(
Q
W
i
Q
,
K
W
i
K
,
V
W
i
V
)
\begin{aligned} \operatorname{MultiHead}(Q, K, V) & =\operatorname{Concat}\left(\operatorname{head}_1, \ldots, \text { head }_{\mathrm{h}}\right) W^O \\ \text { where head }_{\mathrm{i}} & =\operatorname{Attention}\left(Q W_i^Q, K W_i^K, V W_i^V\right) \end{aligned}
MultiHead(Q,K,V) where head i=Concat(head1,…, head h)WO=Attention(QWiQ,KWiK,VWiV)
来源论文: https://arxiv.org/pdf/1706.03762.pdf