Self-Attention with Relative Position Representations
带相对位置表示的自注意力
https://arxiv.org/pdf/1803.02155v1
Abstract
Relying entirely on an attention mechanism, the Transformer introduced by Vaswani et al. (2017) achieves state-of-the-art results for machine translation. In contrast to recurrent and convolutional neural networks, it does not explicitly model relative or absolute position information in its structure. Instead, it requires adding representations of absolute positions to its inputs. In this work we present an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements. On the WMT 2014 English-to-German and English-to-French translation tasks, this approach yields improvements of 1.3 BLEU and 0.3 BLEU over absolute position representations, respectively. Notably, we observe that combining relative and absolute position representations yields no further improvement in translation quality. We describe an efficient implementation of our method and cast it as an instance of relation-aware self-attention mechanisms that can generalize to arbitrary graphlabeled inputs.
完全依赖于注意力机制,Vaswani 等人在 2017 年引入的 Transformer 在机器翻译方面取得了最先进的成果。与循环和卷积神经网络不同,它在其结构中没有明确地模拟相对或绝对位置信息。相反,它需要在其输入中添加绝对位置的表示。在这项工作中,我们提出了一种替代方法,将自注意力机制扩展到有效地考虑序列元素之间相对位置或距离的表示。在 WMT 2014 英语到德语和英语到法语的翻译任务中,这种方法分别比绝对位置表示提高了 1.3 BLEU 和 0.3 BLEU。值得注意的是,我们观察到结合相对和绝对位置表示并不会进一步提高翻译质量。我们描述了我们方法的有效实现,并将其构建为可以推广到任意图标记输入的关系感知自注意力机制的一个实例。
1 Introduction
Recent approaches to sequence to sequence learning typically leverage recurrence (Sutskever et al., 2014), convolution (Gehring et al., 2017; Kalchbrenner et al., 2016), attention (Vaswani et al., 2017), or a combination of recurrence and attention (Bahdanau et al., 2014; Cho et al., 2014; Luong et al., 2015; Wu et al., 2016) as basic building blocks. These approaches incorporate information about the sequential position of elements differently.
近期的序列到序列学习方法通常利用递归(Sutskever 等人,2014 年)、卷积(Gehring 等人,2017 年;Kalchbrenner 等人,2016 年)、注意力(Vaswani 等人,2017 年)或者递归和注意力的结合(Bahdanau 等人,2014 年;Cho 等人,2014 年;Luong 等人,2015 年;Wu 等人,2016 年)作为基本构建块。这些方法以不同的方式整合了关于元素序列位置的信息。
Recurrent neural networks (RNNs) typically compute a hidden state ht, as a function of their input at time t and a previous hidden state ht−1, capturing relative and absolute positions along the time dimension directly through their sequential structure. Non-recurrent models do not necessarily consider input elements sequentially and may hence require explicitly encoding position information to be able to use sequence order. One common approach is to use position encodings which are combined with input elements to expose position information to the model. These position encodings can be a deterministic function of position (Sukhbaatar et al., 2015; Vaswani et al., 2017) or learned representations. Convolutional neural networks inherently capture relative positions within the kernel size of each convolution. They have been shown to still benefit from position encodings (Gehring et al., 2017), however.
循环神经网络(RNN)通常计算隐藏状态
h
t
h_t
ht,作为时间点
t
t
t 的输入和前一个隐藏状态
h
t
−
1
h_{t-1}
ht−1 的函数,通过它们的顺序结构直接捕获时间维度上的相对和绝对位置。非循环模型不一定按顺序考虑输入元素,因此可能需要明确编码位置信息,以便能够使用序列顺序。一种常见的方法是使用位置编码,它们与输入元素结合,向模型暴露位置信息。这些位置编码可以是位置的确定性函数(Sukhbaatar 等人,2015 年;Vaswani 等人,2017 年),或者是学习到的表示。卷积神经网络在每个卷积的核心大小内固有地捕获相对位置。然而,已经证明它们仍然可以从位置编码中受益(Gehring 等人,2017 年)。
For the Transformer, which employs neither convolution nor recurrence, incorporating explicit representations of position information is an especially important consideration since the model is otherwise entirely invariant to sequence ordering.
对于不使用卷积和递归的 Transformer 来说,纳入位置信息的显式表示尤其重要,因为否则模型完全无法感知序列的顺序。
In this work we present an efficient way of incorporating relative positions in the self-attention mechanism of the Transformer. Even when entirely replacing its absolute position encodings, we demonstrate significant improvements in translation quality on two machine translation tasks.
在这项工作中,我们提出了一种有效的方法,将相对位置纳入 Transformer 的自注意力机制中。即使完全替换其绝对位置编码,我们也在两个机器翻译任务上展示了翻译质量的显著提高。
Our approach can be cast as a special case of generalizing the self-attention mechanism of the Transformer to considering arbitrary relations between any two elements of the input, a direction we plan to explore in future work on modeling labeled, directed graphs.
我们的方法可以被视为将 Transformer 的自注意力机制推广到考虑输入中任意两个元素之间的任意关系的一个特例,这是我们计划在未来对标记有向图建模工作中探索的方向。
2 Background
2.1 Transformer
The Transformer (Vaswani et al., 2017) employs an encoder-decoder structure, consisting of stacked encoder and decoder layers. Encoder layers consist of two sublayers: self-attention followed by a position-wise feed-forward layer. Decoder layers consist of three sublayers: selfattention followed by encoder-decoder attention, followed by a position-wise feed-forward layer. It uses residual connections around each of the sublayers, followed by layer normalization (Ba et al., 2016). The decoder uses masking in its selfattention to prevent a given output position from incorporating information about future output positions during training.
Transformer(Vaswani 等人,2017 年)采用了编码器-解码器结构,由堆叠的编码器和解码器层组成。编码器层由两个子层组成:自注意力,然后是位置感知的前馈层。解码器层由三个子层组成:自注意力,然后是编码器-解码器注意力,最后是位置感知的前馈层。它在每个子层周围使用残差连接,然后进行层归一化(Ba 等人,2016 年)。解码器在其自注意力中使用掩码,以防止在训练期间某个给定的输出位置包含关于未来输出位置的信息。
学习一下layer nomalization的论文
这种结构使得 Transformer 能够有效地处理序列数据,并且通过注意力机制能够捕捉序列中的长距离依赖关系。编码器-解码器架构特别适合于诸如机器翻译之类的任务,其中输入序列需要被转换成具有相同顺序的输出序列。自注意力机制允许模型在序列的不同部分之间动态地分配不同的注意力权重,而位置编码则确保模型能够理解序列中单词的顺序。通过这种方式,Transformer 成为了处理各种序列到序列任务的强大模型。
Position encodings based on sinusoids of varying frequency are added to encoder and decoder input elements prior to the first layer. In contrast to learned, absolute position representations, the authors hypothesized that sinusoidal position encodings would help the model to generalize to sequence lengths unseen during training by allowing it to learn to attend also by relative position. This property is shared by our relative position representations which, in contrast to absolute position representations, are invariant to the total sequence length.
在第一层之前,将基于不同频率的正弦波的位置编码添加到编码器和解码器的输入元素中。与学习到的绝对位置表示相比,作者假设正弦波位置编码将通过允许模型学习根据相对位置进行关注,从而帮助模型泛化到训练期间未见过的序列长度。这一特性也为我们相对位置表示所共有,与绝对位置表示不同,它们对总序列长度是不变的。
Residual connections help propagate position information to higher layers.
残差连接有助于将位置信息传递到更高层。
2.2 Self-Attention
Self-attention sublayers employ h attention heads. To form the sublayer output, results from each head are concatenated and a parameterized linear transformation is applied.
自注意力子层采用
h
h
h 个注意力头。为了形成子层的输出,每个头的结果被连接起来,然后应用一个参数化的线性变换。
Each attention head operates on an input sequence, x = (x1; : : : ; xn) of n elements where xi 2 Rdx, and computes a new sequence z = (z1; : : : ; zn) of the same length where zi 2 Rdz.
每个注意力头处理一个包含
n
n
n 个元素的输入序列
x
=
(
x
1
,
…
,
x
n
)
x = (x_1, \ldots, x_n)
x=(x1,…,xn),其中
x
i
∈
R
d
x
x_i \in \mathbb{R}^{d_x}
xi∈Rdx,并且计算一个同样长度的新序列
z
=
(
z
1
,
…
,
z
n
)
z = (z_1, \ldots, z_n)
z=(z1,…,zn),其中
z
i
∈
R
d
z
z_i \in \mathbb{R}^{d_z}
zi∈Rdz。
Each output element, zi, is computed as weighted sum of a linearly transformed input elements:
每个输出元素
z
i
z_i
zi 是通过对输入元素进行线性变换后的加权和来计算的:
Each weight coefficient, αij, is computed using a softmax function:
每个权重系数
α
i
j
\alpha_{ij}
αij 是使用 softmax 函数计算的:
And eij is computed using a compatibility function that compares two input elements:
每个输出元素
z
i
z_i
zi 被计算为输入元素的线性变换后的加权和:
Dot product was chosen for the compatibility function, which enables efficient computation.Linear transformations of the inputs add sufficient expressive power.
点积被选为兼容性函数,这使得计算变得高效。对输入进行线性变换增加了足够的表达能力。
WQ, WK, WV 2 Rdx×dz are parameter matrices. These parameter matrices are unique per layer and attention head.
W
Q
,
W
K
,
W
V
W_Q, W_K, W_V
WQ,WK,WV 是
R
d
x
×
d
z
\mathbb{R}^{d_x \times d_z}
Rdx×dz 中的参数矩阵。这些参数矩阵在每个层和注意力头是唯一的。
3 Proposed Architecture
3.1 Relation-aware Self-Attention
We propose an extension to self-attention to consider the pairwise relationships between input elements. In this sense, we model the input as a labeled, directed, fully-connected graph. The edge between input elements xi and xj is represented by aij 2 Rda.
我们提出了自注意力的一个扩展,以考虑输入元素之间的成对关系。在这个意义上,我们将输入建模为一个标记有标签的、有向的、全连接图。输入元素
x
i
x_i
xi 和
x
j
x_j
xj 之间的边由
a
i
j
a_{ij}
aij 表示,
a
i
j
∈
R
d
a
a_{ij} \in \mathbb{R}^{d_a}
aij∈Rda。
We modify eq. (1) to propagate edge information to the output, using addition which avoids significantly increasing computation:
为了将边信息传递到输出,我们修改了方程(1),使用加法来避免显著增加计算量:
We also, importantly, modify eq. (2) to consider edges when determining compatibility:
我们同样重要地修改了方程(2),在确定兼容性时考虑了边:
We continue to use dot product as the main operation in the compatibility function, which also avoids significantly increasing computation.
我们继续使用点积作为兼容性函数中的主要操作,这也避免了计算量的显著增加。
We learn two unique sets of relationships, aV and aK, suitable for use in eq. (3) and eq. (4), respectively, without requiring additional linear transformations. These representations can be shared across heads. We use da = dz.
我们学习两组独特的关系,
a
V
a^V
aV 和
a
K
a^K
aK,分别适用于方程(3)和方程(4),而不需要额外的线性变换。这些表示可以在不同的头之间共享。我们使用
d
a
=
d
z
d_a = d_z
da=dz。
3.2 Relative Position Representations
For linear sequences, edges can capture information about the relative position differences between input elements. The maximum relative position we consider is clipped to a maximum absolute value of k. We hypothesized that precise relative position information is not useful beyond a certain distance. Clipping the maximum distance also enables the model to generalize to sequence lengths not seen during training. Therefore, we consider 2k + 1 unique edge labels.
对于线性序列,边可以捕获输入元素之间相对位置差异的信息。我们考虑的最大相对位置被限制在一个最大的绝对值
k
k
k 内。我们假设超过一定距离的精确相对位置信息并不是有用的。限制最大距离还使模型能够泛化到训练期间未见过的序列长度。因此,我们考虑
2
k
+
1
2k + 1
2k+1 个独特的边标签。
3.3 Efficient Implementation
There are practical space complexity concerns when considering edges between input elements, as noted by Velickovi ˇ c et al. ( ´ 2017), which considers unlabeled graph inputs to an attention model.
正如 Velickovi ˇ c 等人在 2017 年所指出的,当考虑输入元素之间的边时,存在实际的空间复杂度问题,他们考虑的是将未标记的图输入到注意力模型中。在全连接图中,如果为每一对元素都学习一个独特的关系表示,那么空间复杂度会随着序列长度的增加而呈平方增长,这在长序列中是不可行的。因此,需要设计一种有效的方式来表示和学习这些关系,以保持模型的可扩展性和效率。
For a sequence of length n and h attention heads, we reduce the space complexity of storing relative position representations from O(hn2da) to O(n2da) by sharing them across each heads. Additionally, relative position representations can be shared across sequences. Therefore, the overall self-attention space complexity increases from O(bhndz) to O(bhndz + n2da). Given da = dz, the size of the relative increase depends on bh n .
对于长度为
n
n
n 的序列和
h
h
h 个注意力头,我们通过在每个头之间共享相对位置表示,将存储相对位置表示的空间复杂度从
O
(
h
n
2
d
a
)
O(h n^2 d_a)
O(hn2da) 降低到
O
(
n
2
d
a
)
O(n^2 d_a)
O(n2da)。此外,相对位置表示可以在序列之间共享。因此,总体自注意力的空间复杂度从
O
(
b
h
n
d
z
)
O(b h n d_z)
O(bhndz) 增加到
O
(
b
h
n
d
z
+
n
2
d
a
)
O(b h n d_z + n^2 d_a)
O(bhndz+n2da)。给定
d
a
=
d
z
d_a = d_z
da=dz,相对增加的大小取决于
n
/
b
h
n/b h
n/bh。
The Transformer computes self-attention efficiently for all sequences, heads, and positions in a batch using parallel matrix multiplication operations (Vaswani et al., 2017). Without relative position representations, each eij can be computed using bh parallel multiplications of n×dz and dz×n matrices. Each matrix multiplication computes eij for all sequence positions, for a particular head and sequence. For any sequence and head, this requires sharing the same representation for each position across all compatibility function applications (dot products) with other positions.
Transformer 通过并行矩阵乘法操作高效地计算所有序列、头和批次中位置的自注意力(Vaswani 等人,2017 年)。如果没有相对位置表示,每个
e
i
j
e_{ij}
eij 可以通过
b
h
b h
bh 个并行乘法计算
n
×
d
z
n \times d_z
n×dz 和
d
z
×
n
d_z \times n
dz×n 矩阵。每次矩阵乘法计算所有序列位置的
e
i
j
e_{ij}
eij,对于特定的头和序列。对于任何序列和头,这需要在与其他位置的所有兼容性函数应用(点积)中共享每个位置的相同表示。
When we consider relative positions the representations differ with different pairs of positions. This prevents us from computing all eij for all pairs of positions in a single matrix multiplication. We also want to avoid broadcasting relative position representations. However, both issues can be resolved by splitting the computation of eq. (4) into two terms:
当我们考虑相对位置时,不同位置对的表示会有所不同。这阻止了我们通过单次矩阵乘法计算所有位置对的
e
i
j
e_{ij}
eij。我们还想避免广播相对位置表示。然而,这两个问题都可以通过将方程(4)的计算分成两个术语来解决:
The first term is identical to eq. (2), and can be computed as described above. For the second term involving relative position representations, tensor reshaping can be used to compute n parallel multiplications of bh×dz and dz×n matrices. Each matrix multiplication computes contributions to eij for all heads and batches, corresponding to a particular sequence position. Further reshaping allows adding the two terms. The same approach can be used to efficiently compute eq. (3).
第一项与方程(2)相同,可以按照上述描述进行计算。对于涉及相对位置表示的第二项,可以使用张量重塑来计算
n
n
n 个并行乘法,矩阵大小为
b
h
×
d
z
b h \times d_z
bh×dz 和
d
z
×
n
d_z \times n
dz×n。每次矩阵乘法计算所有头和批次对
e
i
j
e_{ij}
eij 的贡献,对应于特定的序列位置。进一步的重塑允许将两个项相加。相同的方法可以用来高效地计算方程(3)。
For our machine translation experiments, the result was a modest 7% decrease in steps per second, but we were able to maintain the same model and batch sizes on P100 GPUs as Vaswani et al. (2017).
在我们进行的机器翻译实验中,结果是每秒步骤数略有下降,减少了7%,但我们能够在 P100 GPU 上保持与 Vaswani 等人(2017 年)相同的模型和批次大小。
4 Experiments
4.1 Experimental Setup
We use the tensor2tensor 1 library for training and evaluating our model.
我们使用 tensor2tensor 库来训练和评估我们的模型。
We evaluated our model on the WMT 2014 machine translation task, using the WMT 2014 English-German dataset consisting of approximately 4.5M sentence pairs and the 2014 WMT English-French dataset consisting of approximately 36M sentence pairs.
我们使用包含大约 450 万句对的 2014 WMT 英语-德语数据集和包含大约 3600 万句对的 2014 WMT 英语-法语数据集,在 2014 WMT 机器翻译任务上评估了我们的模型。
For all experiments, we split tokens into a 32,768 word-piece vocabulary (Wu et al., 2016). We batched sentence pairs by approximate length, and limited input and output tokens per batch to 4096 per GPU. Each resulting training batch contained approximately 25,000 source and 25,000 target tokens.
在所有实验中,我们将标记分割成 32,768 个词片词汇表(Wu 等人,2016 年)。我们按大致长度对句子对进行批处理,并限制每个 GPU 每批输入和输出标记为 4096 个。每个结果训练批次包含大约 25,000 个源标记和 25,000 个目标标记。
We used the Adam optimizer (Kingma and Ba, 2014) with β1 = 0:9, β2 = 0:98, and = 10−9. We used the same warmup and decay strategy for learning rate as Vaswani et al. (2017), with 4,000 warmup steps. During training, we employed label smoothing of value ls = 0:1 (Szegedy et al., 2016). For evaluation, we used beam search with a beam size of 4 and length penalty α = 0:6 (Wu et al., 2016).
我们使用了 Adam 优化器(Kingma 和 Ba,2014 年),其中
β
1
=
0.9
\beta_1 = 0.9
β1=0.9,
β
2
=
0.98
\beta_2 = 0.98
β2=0.98,
ϵ
=
1
0
−
9
\epsilon = 10^{-9}
ϵ=10−9。我们采用了与 Vaswani 等人(2017 年)相同的预热和衰减策略来调整学习率,预热步数为 4000 步。在训练期间,我们使用了标签平滑值
ϵ
l
s
=
0.1
\epsilon_{ls} = 0.1
ϵls=0.1(Szegedy 等人,2016 年)。对于评估,我们使用了光束搜索,光束大小为 4,长度惩罚
α
=
0.6
\alpha = 0.6
α=0.6(Wu 等人,2016 年)。
For our base model, we used 6 encoder and decoder layers, dx = 512, dz = 64, 8 attention heads, 1024 feed forward inner-layer dimensions, and Pdropout = 0:1. When using relative position encodings, we used clipping distance k = 16, and used unique edge representations per layer and head. We trained for 100,000 steps on 8 K40 GPUs, and did not use checkpoint averaging.
对于我们的基线模型,我们使用了 6 个编码器和解码器层,
d
x
=
512
d_x = 512
dx=512,
d
z
=
64
d_z = 64
dz=64,8 个注意力头,1024 个前馈内层维度,以及
P
dropout
=
0.1
P_{\text{dropout}} = 0.1
Pdropout=0.1。在使用相对位置编码时,我们使用了剪辑距离
k
=
16
k = 16
k=16,并为每层和每个头使用了独特的边表示。我们在 8 个 K40 GPU 上训练了 100,000 步,并且没有使用检查点平均。
For our big model, we used 6 encoder and decoder layers, dx = 1024, dz = 64, 16 attention heads, 4096 feed forward inner-layer dimensions, and Pdropout = 0:3 for EN-DE and Pdropout = 0:1 for EN-FR. When using relative position encodings, we used k = 8, and used unique edge representations per layer. We trained for 300,000 steps on 8 P100 GPUs, and averaged the last 20 checkpoints, saved at 10 minute intervals.
对于我们的大型模型,我们使用了 6 个编码器和解码器层,
d
x
=
1024
d_x = 1024
dx=1024,
d
z
=
64
d_z = 64
dz=64,16 个注意力头,4096 个前馈内层维度,对于英语-德语(EN-DE)数据集
P
dropout
=
0.3
P_{\text{dropout}} = 0.3
Pdropout=0.3,对于英语-法语(EN-FR)数据集
P
dropout
=
0.1
P_{\text{dropout}} = 0.1
Pdropout=0.1。在使用相对位置编码时,我们使用了
k
=
8
k = 8
k=8,并为每层使用了独特的边表示。我们在 8 个 P100 GPU 上训练了 300,000 步,并且平均了最后 20 个检查点,这些检查点每 10 分钟保存一次。
4.2 Machine Translation
We compared our model using only relative position representations to the baseline Transformer (Vaswani et al., 2017) with sinusoidal position encodings. We generated baseline results to isolate the impact of relative position representations from any other changes to the underlying library and experimental configuration.
我们将仅使用相对位置表示的模型与使用正弦位置编码的基线 Transformer(Vaswani 等人,2017 年)进行了比较。我们生成了基线结果,以隔离相对位置表示的影响,避免底层库和实验配置的任何其他变化的干扰。
For English-to-German our approach improved performance over our baseline by 0.3 and 1.3 BLEU for the base and big configurations, respectively. For English-to-French it improved by 0.5 and 0.3 BLEU for the base and big configurations, respectively. In our experiments we did not observe any benefit from including sinusoidal position encodings in addition to relative position representations. The results are shown in Table 1.
在英语到德语的翻译中,我们的方法在基础和大型配置上分别比我们的基线提高了 0.3 和 1.3 BLEU。在英语到法语的翻译中,基础和大型配置分别提高了 0.5 和 0.3 BLEU。在我们的实验中,我们没有观察到除了相对位置表示之外,还包括正弦位置编码会带来任何好处。结果如 表1 所示。
表1:使用 newstest2014 测试集的 2014 WMT 英语到德语(EN-DE)和英语到法语(EN-FR)翻译任务的实验结果。
4.3 Model Variations
We performed several experiments modifying various aspects of our model. All of our experiments in this section use the base model configuration and calculate BLEU scores on the WMT Englishto-German task using the development set, newstest2013.
我们进行了几项实验,修改了我们模型的各个方面。本节中的所有实验都使用基础模型配置,并在 WMT 英语到德语任务上使用开发集 newstest2013 计算 BLEU 分数。
We evaluated the effect of varying the clipping distance, k, of the maximum absolute relative position difference. For k ≥ 2, there does not appear to be much variation in BLEU scores. The results are shown in Table 2.
我们评估了改变最大绝对相对位置差异的剪辑距离
k
k
k 的效果。对于
k
≥
2
k \geq 2
k≥2,BLEU 分数似乎没有太大变化。结果如 表2 所示。
表2:改变剪辑距离
k
k
k 的实验结果。
We also evaluated the impact of ablating each of the two relative position representations defined in section 3.1, aV in eq. (3) and aK in eq. (4). Including relative position representations solely when determining compatibility between elements may be sufficient, but further work is needed to determine whether this is true for other tasks. The results are shown in Table 3.
表3:评估第3.1节中定义的两种相对位置表示(方程(3)中的
a
V
a^V
aV 和方程(4)中的
a
K
a^K
aK各自消融的影响。仅在确定元素之间的兼容性时包含相对位置表示可能就足够了,但需要进一步的工作来确定这是否适用于其他任务。结果如 表3 所示。
表3:消融相对位置表示
a
V
a^V
aV 和
a
K
a^K
aK 的实验结果。
5 Conclusions
In this paper we presented an extension to selfattention that can be used to incorporate relative position information for sequences, which improves performance for machine translation.
在这篇论文中,我们提出了自注意力的一个扩展,它可以用于将相对位置信息纳入序列,从而提高机器翻译的性能。
For future work, we plan to extend this mechanism to consider arbitrary directed, labeled graph inputs to the Transformer. We are also interested in nonlinear compatibility functions to combine input representations and edge representations. For both of these extensions, a key consideration will be determining efficient implementations.
在未来的工作中,我们计划将这种机制扩展到考虑任意有向、标记图输入到 Transformer。我们还对使用非线性兼容性函数来结合输入表示和边表示感兴趣。对于这两个扩展,一个关键考虑因素将是确定高效的实现方式。