【Datawhale组队学习：Sora原理与技术实战】Attention

Attention

Attention = 注意力，从两个不同的主体开始。
在这里插入图片描述
论文：https://arxiv.org/pdf/1703.03906.pdf
seq2seq代码仓：https://github.com/google/seq2seq

计算方法：
加性Attention，如（Bahdanau attention）：

$\boldsymbol{v}_a^{\top} \tanh \left(\boldsymbol{W}_{\mathbf{1}} \boldsymbol{h}_t+\boldsymbol{W}_{\mathbf{2}} \overline{\boldsymbol{h}}_s\right)$
乘性Attention，如（Luong attention）：

$\operatorname{score}\left(\boldsymbol{h}_{t}, \overline{\boldsymbol{h}}_{s}\right)=\left\{\begin{array}{ll} \boldsymbol{h}_{t}^{\top} \overline{\boldsymbol{h}}_{s} & \text { dot } \\ \boldsymbol{h}_{t}^{\top} \boldsymbol{W}_{a} \overline{\boldsymbol{h}}_{s} & \text { general } \\ \boldsymbol{v}_{a}^{\top} \tanh \left(\boldsymbol{W}_{a}\left[\boldsymbol{h}_{t} ; \overline{\boldsymbol{h}}_{s}\right]\right) & \text { concat } \end{array}\right.$
来源论文：https://arxiv.org/pdf/1508.04025.pdf

From Attention to SelfAttention

Self Attention

“Attention is All You Need” 这篇论文提出了Multi-Head Self-Attention，是一种：Scaled Dot-Product Attention。

$\operatorname{Attention}(Q, K, V)=\operatorname{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V$

来源论文：https://arxiv.org/pdf/1706.03762.pdf

Scaled

Scaled 的目的是调节内积，使其结果不至于太大（太大的话softmax后就非0即1了，不够“soft”了）。

来源论文: https://kexue.fm/archives/4765

Multi-Head

Multi-Head可以理解为多个注意力模块，期望不同注意力模块“注意”到不一样的地方，类似于CNN的Kernel。

Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions.

$\begin{aligned} \operatorname{MultiHead}(Q, K, V) & =\operatorname{Concat}\left(\operatorname{head}_1, \ldots, \text { head }_{\mathrm{h}}\right) W^O \\ \text { where head }_{\mathrm{i}} & =\operatorname{Attention}\left(Q W_i^Q, K W_i^K, V W_i^V\right) \end{aligned}$
来源论文: https://arxiv.org/pdf/1706.03762.pdf