语义分割文献阅读——SETR：使用Transformer从序列到序列的角度重新思考语义分割

摘要

Abstract

1 引言

2 Vision Transformer(ViT)

2.1 图片预处理：分块和降维

2.2 Patch Embedding

2.3 位置编码

2.4 Transformer Encoder的前向过程

3 SETR

3.1 图像序列化处理

3.2 Transformer

3.3 解码器

总结

摘要

本周阅读的论文题目是《Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers》(《使用Transformer从序列到序列的角度重新思考语义分割》)。由于典型的语义分割FCN和编码器-解码器架构以多次下采样损失空间分辨率为代价来抽取局部/全局特征，而固定的网络层会使造成每一层的感受野是受限的，因此要获得更大范围的语义信息，理论上需要更大的感受野即更深的网络结构。所以本文中通过将语义分割视为序列到序列预测任务，提出了SETR(SEgmentation TRansformer)，使用纯Transformer(即不使用卷积和分辨率降低)将图像编码为一系列图像块，通过在Transformer的每一层中建模全局上下文，这个编码器可以与简单的解码器结合，从而提供了一个强大的分割模型。SETR在众多数据集上取得了比较好的一个效果。

Abstract

This week's paper is titled "Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers." Since the typical semantic segmentation FCN and encoder-decoder architectures extract local/global features at the cost of multiple downsampling and loss of spatial resolution, and the fixed network layer will limit the receptive field of each layer, a larger receptive field is theoretically needed to obtain a larger range of semantic information, that is, a deeper network structure. Therefore, by treating semantic segmentation as a sequence-to-sequence prediction task, this paper proposes SETR (SEgmentation TRansformer), which uses pure Transformer (i.e., without convolution and resolution reduction) to encode the image into a series of image blocks, and by modeling the global context in each layer of the Transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model. SETR has achieved relatively good results on many datasets.

文献链接🔗：Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

1 引言

目前，现有的语义分割模型主要基于FCN，例如前面几周学到的FCN和SegNet。标准的FCN语义分割模型采用编码器-解码器架构，编码器用于特征表示学习，解码器用于对编码器生成的特征表示进行像素级分类。在这两者中，特征表示学习（即编码器）被认为是最重要的模型组成部分。编码器像大多数用于图像理解的CNN一样，由堆叠的卷积层组成。由于计算成本的考虑，特征图的分辨率会逐渐降低，因此编码器能够逐渐增大感受野来学习更抽象/语义的视觉概念。这种设计之所以受欢迎，有两个有利之处，即平移等变性和局部性。前者很好地尊重了图像处理的本质，从而支持模型对未见过的图像数据的泛化能力。而后者通过在空间上共享参数来控制模型的复杂性。

然而，这也带来了一个基本限制，即学习长距离依赖信息，对于在非受限场景图像中进行语义分割是至关重要的，由于感受野仍然有限，这变得具有挑战性。为了克服上述的限制，后来也引入了一些方法，例如DeepLab系列直接操作卷积操作，使用大的卷积核大小、空洞卷积和图像/特征金字塔。
在本文中，重新思考了语义分割模型的设计，提出用纯Transformer替代基于堆叠卷积层的编码器，该编码器逐渐降低空间分辨率，从而形成一种新的分割模型，称为SEgmentation TRansformer (SETR)。这个仅由Transformer构成的编码器将输入图像视为由学习到的patch嵌入表示的图像patch序列，并利用全局自注意力建模对序列进行转换，用于区分性特征表示学习。

这种纯Transformer的设计灵感来自于它在自然语言处理中取得的巨大成功，纯视觉Transformer或ViT在图像分类任务中也显示出了有效性。这直接证明了传统的CNN设计可以受到挑战，并且图像特征不一定需要通过降低空间分辨率逐渐从局部到全局上下文进行学习。

本文的贡献如下：

从序列到序列学习的角度重新定义了图像语义分割问题，提供了一种替代主导的编码器-解码器FCN模型设计的方法；
作为一种实例化方法，利用Transformer框架通过对图像进行序列化来实现完全的注意力特征表示编码器；
为了广泛地研究自注意力特征表示，本文进一步引入了三种不同复杂性的解码器设计：原始上采样(Naive)、渐进上采样(PUP)和多级特征融合(MLA)。

2 Vision Transformer(ViT)

Transformer和自注意力机制的成功，启发了语义分割领域的工作研究，ViT的出现更是将纯Transformer结构引入到图像分类中，将图像分块、嵌入以后使用Transformer进行计算，通过MLP来实现分类，并在ImageNet中取得优秀的效果。

ViT的网络结构如上图所示，具体流程为：

将输入图片切分成 $16\times 16\times 3$ 大小的Patch；
将Patch经过Embedding层进行编码，进行编码后每一个Patch就得到一个长度为 $768\times1$ 的Token向量，在代码实现中是使用一个卷积核大小为 $16\times16$ ，步距为 $16$ 的卷积核个数为 $768$ 卷积层实现的，对于一个 $244\times244\times3$ 的输入图像，通过卷积后就得到 $14\times14\times768$ 特征层，然后将特征层的 $H$ 和 $W$ 方向进行展平就获得一个 $196\times768$ 的二维向量，也即 $196$ 个Token；
再加上一个Class Token用于输出分类结果，因此以上Transformer的输入就变成一个 $197\times768$ 的二维向量
进行位置编码操作；
将加入了位置编码的 $197\times768$ 的二维向量输入 $L$ 个堆叠的的Transformer Encoder
从Class Token获得输出并输入MLP Head并最终得到分类的结果。

2.1 图片预处理：分块和降维

首先把 $x\in H\times W\times C$ 的图像，变成一个 $x_p\in N\times (P^2\cdot C)$ 的二维patch序列。它可以看做是一系列的展平的二维patch的序列，这个序列中一共有 $N=HW/P^2$ 个展平的二维patch，每个块的维度是 $(P^2\cdot C)$ 。其中 $P$ 是块大小， $C$ 是通道数。

由于Transformer希望输入一个二维的矩阵 $(N,D)$ ，其中 $N$ 是序列的长度， $D$ 是序列的每个向量的维度，常用 $256$ 。所以这里也要设法把 $H\times W\times C$ 的三维图片转化成 $(N,D)$ 的二维输入。所以有： $H\times W\times C\rightarrow N\times (P^2\cdot C),where \: N=HW/P^2$ 。代码是：

x = rearrange(img, 'b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1=p, p2=p)

现在得到的向量维度是： $x_p\in N\times (P^2\cdot C)$ ，要转化成 $(N,D)$ 的二维输入，还需要做一步叫做Patch Embedding的步骤。

2.2 Patch Embedding

Patch Embedding的方法是对每个向量都做一个线性变换(即全连接层 $E$ )，输入维度大小为 $(P^2\cdot C)$ ，压缩后的维度为 $D$ ，这里称其为Patch Embedding：

$z_0=[x_{class};x_p^1E;x_p^2E;...;x_p^NE]+E_{pos}$

# 将3072变成dim，假设是1024
self.patch_to_embedding = nn.Linear(patch_dim, dim)
x = self.patch_to_embedding(x)

假设切成9个块，但是最终到Transfomer输入是10个向量，这是人为增加了一个分类向量 $x_{class}$ 。这个向量是可学习的嵌入向量，它和那9个向量一并输入Transfomer Encoder，输出1+9个编码向量。然后就用第0个编码向量的输出进行分类预测即可。即ViT其实只用到了Transformer的Encoder，而并没有用到Decoder，而分类向量的作用有点类似于解码器中的Query的作用，相对应的Key、Value就是其他9个编码向量的输出。

代码为：

# dim=1024
self.cls_token = nn.Parameter(torch.randn(1, 1, dim))

# forward前向代码
# 变成(b,64,1024)
cls_tokens = repeat(self.cls_token, '() n d -> b n d', b=b)
# 跟前面的分块进行concat
# 额外追加token，变成b,65,1024
x = torch.cat((cls_tokens, x), dim=1)

2.3 位置编码

按照Transformer的位置编码的习惯，ViT也使用了位置编码。引入了一个 $E_{pos}$ 来加入序列的位置信息，同样在这里也引入了pos_embedding，是用一个可训练的变量：

$z_0=[x_{class};x_p^1E;x_p^2E;...;x_p^NE]+E_{pos}$

# num_patches=64，dim=1024,+1是因为多了一个cls开启解码标志
self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))

2.4 Transformer Encoder的前向过程

$z_0=[x_{class};x_p^1E;x_p^2E;...;x_p^NE]+E_{pos}$ ， $E\in \mathbb{R}^{(P^2\cdot C)\times D},E_{pos}\in \mathbb{R}^{(N+1)\times D}$

$z'_l=MSA(LN(z_{l-1}))+z_{l-1},\;\; \; \; \; \; l=1...L$

$z_l=MLP(LN(z_l))+z'_l,\;\; \; \; \; \; \; \; \; \;\; \; l=1...L$

$y=LN(z^0_L)$

其中：

第1个式子为上面讲到的Patch Embedding和位置编码的过程；
第2个式子为Transformer Encoder的Multi-head Self-attention、Add and Norm的过程，重复 𝐿 次；
第3个式子为Transformer Encoder的Feed Forward Network、 Add and Norm的过程，重复 𝐿 次。

采用的是没有任何改动的transformer。

最后是一个MLP的Classification Head ，变量的维度变化过程如下图标注：

class ViT(nn.Module):
    def __init__(self, *, image_size, patch_size, dim, depth, heads, mlp_dim, channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0., out_indices = (9, 14, 19, 23)):
        super().__init__()
        image_height, image_width = pair(image_size)
        patch_height, patch_width = pair(patch_size)
 
        assert image_height % patch_height == 0 and image_width % patch_width == 0, 'Image dimensions must be divisible by the patch size.'
 
        num_patches = (image_height // patch_height) * (image_width // patch_width)
        patch_dim = channels * patch_height * patch_width
 
        self.to_patch_embedding = nn.Sequential(
            Rearrange('b c (h p1) (w p2) -> b (h w) (p1 p2 c)', p1 = patch_height, p2 = patch_width),
            nn.Linear(patch_dim, dim),
        )
 
        self.pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, dim))
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        self.dropout = nn.Dropout(emb_dropout)
 
        self.transformer = Transformer(dim, depth, heads, dim_head, mlp_dim, dropout, out_indices=out_indices)
 
        self.out = Rearrange("b (h w) c->b c h w", h=image_height//patch_height, w=image_width//patch_width)
 
    def forward(self, img):
        x = self.to_patch_embedding(img)
        b, n, _ = x.shape
        cls_tokens = repeat(self.cls_token, '1 1 d -> b 1 d', b = b)
        x = torch.cat((cls_tokens, x), dim=1)
        x += self.pos_embedding[:, :(n + 1)]
        x = self.dropout(x)
 
        out = self.transformer(x)
 
        for index, transformer_out in enumerate(out):
            # 删除 cls_tokens 并将输出转换为[b, c, h, w]
            out[index] = self.out(transformer_out[:,1:,:])
 
        return out

3 SETR

ViT的出现启发了语义分割领域，Transformer这种基于自注意力的机制，比起CNN需要使用卷积来提升感受野的操作，自注意力机制无疑更加优秀。在任意层，Transformer就能实现全局的感受野，建立全局依赖。而且，CNN网络往往需要将原始图像的分辨率采用到8倍甚至32倍，这样就会损失一些信息，而Transformer无需进行下采样就能实现特征提取，保留了图像的更多信息。
因此，SETR采取了ViT作为语义分割编码器-解码器结构中的编码器结构，作为编码器来提取图像特征，所以SETR本质上是一个ViT+Decoder结构：

具体操作如下：

将图像分解为固定大小的patch网格，形成一个patch序列；
对每个patch的扁平化像素向量应用线性嵌入层，得到一系列特征嵌入向量作为Transformer的输入；
在编码器Transformer学习到特征之后，使用解码器恢复原始图像分辨率；
在编码器Transformer的每一层都没有降采样的空间分辨率，而是进行全局上下文建模，从而为语义分割问题提供了全新的视角。

3.1 图像序列化处理

首先，需要将原始的输入图片处理成Transformer能够支持的格式，这里参考了ViT的做法，即对输入图像进行切片处理，将每一个2D的图像patch视为一个一维的序列作为整体输入到网络当中。通常来说，Transformer接收的输入是一个1维的特征嵌入序列 $Z\in \mathbb{R}^{L\times C}$ ，其中 $L$ 为序列的长度， $C$ 为隐藏层的通道尺寸。因此，对于图像序列而言，也需要将输入 $x\in \mathbb{R}^{H\times W\times 3}$ 转换为 $Z$ 。

采用切片的方式，每个切片大小为 $16\times 16$ ，那么对于一张 $256\times 256$ 大小的图片来说就可以切成 $\frac{256}{16}\times \frac{256}{16}=256$ 块( $L=256$ )。为了对每个切片的空间信息进行编码，可以为每个局部位置 $i$ 都学习一个特定的嵌入 $p_i$ ，并将其添加到一个线性的投影函数 $e_i$ 中来形成最终的输入序列 $E=\begin{Bmatrix} e_1+p_1,...,e_L+p_L \end{Bmatrix}$ 。如此一来，进行Transofomer是无序的，也仍然可以保留相对应的空间位置信息，因为对原始的位置信息进行了关联。

3.2 Transformer

如下图，将上一步得到的序列 $E$ 输入到24个串联的Transformer中，即每个Transformer的感受野是整张image，每个Transformer层由多头注意力、LN层、MLP层构成。

Transformer Encoder由 $L_e$ 个Transformer层组成。其中第 $l$ 层的输入是 $Z^{l -1} \in \mathbb R^{L \times C}$ 的向量。

自注意力的输入是由 $Z^{l -1}$ 计算得到的三维元组 $(query, key, value)$ ：

$query=Z^{l-1}W_Q,\: key=Z^{l-1}W_K,\: value=Z^{l-1}W_V$

其中， $W_Q$ 、 $W_K$ 、 $W_V$ 是可学习权重矩阵，且 $W_Q/W_K/W_V \in \mathbb R^{C \times d}$ ， $d$ 为 $(query, key, value)$ 的维度。

则自注意力可以表示为：

$SA=Z^{l-1}+softmax(\frac{QK^T}{\sqrt{d}})V$

多层自注意力即是由 $m$ 个 $SA$ 拼接起来得到：

$MSA(Z^{l - 1}) = [SA_1(Z^{l - 1}); SA_2(Z^{l - 1}); \cdots; SA_m(Z^{l - 1})]W_O$

其中 $W_O \in \mathbb R^{md \times C}$ ，最后 $MSA$ 的输出通过一个带有残差连接的全连接层得到Encoder的输出：

$Z^l = MSA(Z^{l-1}) + MLP(MSA(Z^{l-1})) \in \mathbb R^{L \times C}$

进而有Transformer Encoder各层的输出 $\begin{Bmatrix} Z^1,Z^2,Z^3,...,Z^{L_e} \end{Bmatrix}$ 。

3.3 解码器

Decoder的目标是在原始的二维图像 $(H \times W)$ 上生成分割结果，需要将Encoder的输出 $Z$ 从二维 $\frac{HW}{256} \times C$ reshape为三维特征图 $\frac{H}{16} \times \frac{W}{16} \times C$ 。

class PUPHead(nn.Module):
    def __init__(self, num_classes):
        super(PUPHead, self).__init__()
        
        self.UP_stage_1 = nn.Sequential(
            nn.Conv2d(1024, 256, 3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True)
        )        
        self.UP_stage_2 = nn.Sequential(
            nn.Conv2d(256, 256, 3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True)
        )        
        self.UP_stage_3= nn.Sequential(
            nn.Conv2d(256, 256, 3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True)
        )        
        self.UP_stage_4= nn.Sequential(
            nn.Conv2d(256, 256, 3, padding=1),
            nn.BatchNorm2d(256),
            nn.ReLU(),
            nn.Upsample(scale_factor=2, mode="bilinear", align_corners=True)
        )
    
        self.cls_seg = nn.Conv2d(256, num_classes, 3, padding=1)
 
    def forward(self, x):
        x = self.UP_stage_1(x)
        x = self.UP_stage_2(x)
        x = self.UP_stage_3(x)
        x = self.UP_stage_4(x)
        x = self.cls_seg(x)
        return x

本文中设计了三种上采样模式来设计decoder：

(a)Naive upsampling：利用一个2层的网络，即“1*1卷积+sync BN+ReLU+1*1卷积”，然后直接双线性上采样回原图分辨率；
(b)Progressive UPsampling(PUP)：采用渐进式上采样。为了避免引入过度的噪声，同时避免边缘出现锯齿状，避免一步上采样，类似于U-Net的操作，卷积->2倍上采样->卷积->2倍上采样...的逐步2倍上采样模式；
(c)Multi-Level feature Aggregation (MLA)：获取Transformer中间层结果，聚合后4倍上采样，为了增强不同层特征之间的交互，采用了自顶向下逐层融合的策略，同时在每一层的融合后面外接一个3×3的卷积操作。最后，再将顶层特征图以及三层融合后的输出层特征分别按通道维度进行拼接级联，然后直接4倍双线性上采样回去，最终的输出维度为 $H\times W\times C$ ，这里还需要接个根据类别数进行转换输出。

将 ViT 和多个上采样模块组合在一起，ViT 提取多尺度特征图，每个尺度的特征图通过一个对应的上采样进行解码，最终输出多个尺度的分割结果：

class SETR(nn.Module):
    def __init__(self, num_classes, image_size, patch_size, dim, depth, heads, mlp_dim, channels = 3, dim_head = 64, dropout = 0., emb_dropout = 0., out_indices = (9, 14, 19, 23)):
        super(SETR, self).__init__()
        self.out_indices = out_indices
        self.num_classes = num_classes
        self.VIT = ViT( image_size=image_size, patch_size=patch_size, dim=dim, depth=depth, heads=heads, mlp_dim=mlp_dim, 
                        channels = channels, dim_head = dim_head, dropout = dropout, emb_dropout = emb_dropout, out_indices = out_indices)
 
        
        self.Head = nn.ModuleDict()
 
        for index, indices in enumerate(self.out_indices):
            self.Head["Head"+str(indices)] = PUPHead(num_classes)
        
    def forward(self, x):
        VIT_OUT = self.VIT(x)
 
        out = []
        for index, indices in enumerate(self.out_indices):
            # 最后一个是最后层的输出
            out.append(self.Head["Head"+str(indices)](VIT_OUT[index]))
        return out