[Multi-Modal] MDETR 论文及代码学习笔记

代码地址：https://github.com/ashkamath/mdetr

论文地址：https://arxiv.org/abs/2104.12763

多模态推理系统依靠预先训练的目标检测器从图像中提取感兴趣区域（边界框包围区域）。然而，这个关键模块通常被用作黑匣子，并在有固定词汇表示的目标和属性的下游任务上独立训练。这使得此类系统很难捕捉以自由形式文本表达的视觉概念的长尾（longtail of visual concepts）目标。

本文所提出的MDETR，是一种端到端调制检测器（modulated detector），检测以原始文本查询（如标题或问题）为条件的图像中的目标。使用基于Transformer的架构，通过在模型的早期阶段融合两种模态，对文本和图像进行联合推理。作者在1.3M文本-图像对上预训练网络，这些文本-图像对是从预先存在的多模态数据集中挖掘出来的，这些数据集在文本中的短语和图像中的目标之间具有明确的对齐方式。之后对几个下游任务进行微调，例如短语基础（phrase grounding）、引用表达式理解（referring expression comprehension REC）和分割（segmentation RES），在流行的基准上实现最先进的结果。

本文主要贡献：

1、提出了一种DETR的端到端文本调制检测系统。

2、调制检测方法可以无缝地应用于解决Phrase Grounding和引用表达式理解（referring expression comprehension）等任务并且取得新的最先进的性能。

3、MDETR预训练的方式对下游任务转化的适配能力好，例如在视觉问答、引用表达式分割和少样本长尾目标检测方面实现了具有竞争力的性能。

MDETR

如图所示，MDETR的架构与DETR一样，图像由卷积主干编码并展平。为了保存空间信息，将二维位置编码添加到这个展平的向量中。使用预训练的transformer语言模型（roberta）对文本进行编码，以生成与输入大小相同的隐藏向量序列。

之后将模态相关的线性投影应用于图像和文本特征，以将它们投影到共享嵌入空间中。然后，这些特征向量在序列维度上连接起来，以生成单个图像和文本特征序列。该序列被馈送到称为cross encoder的联合transformer编码器。在DETR之后，在object queries上应用transformer解码器，同时交叉处理cross encoder的最终隐藏状态。解码器的输出用于预测实际的边界框。

模型结构到这里就讲完了，基本结果和DETR一样，如果对DETR不是很了解，可以参看以下文章：

DETR代码学习笔记（一）_detr位置编码-CSDN博客

DETR代码学习笔记（二）-CSDN博客

DETR代码学习笔记（三）_detr代码学习笔记(三)-CSDN博客

网络结构

transformer部分的代码如下：

class Transformer(nn.Module):
    def __init__(
        self,
        d_model=512,
        nhead=8,
        num_encoder_layers=6,
        num_decoder_layers=6,
        dim_feedforward=2048,
        dropout=0.1,
        activation="relu",
        normalize_before=False,
        return_intermediate_dec=False,
        pass_pos_and_query=True,
        text_encoder_type="roberta-base",
        freeze_text_encoder=False,
        contrastive_loss=False,
    ):
        super().__init__()

        self.pass_pos_and_query = pass_pos_and_query
        encoder_layer = TransformerEncoderLayer(d_model, nhead, dim_feedforward, dropout, activation, normalize_before)
        encoder_norm = nn.LayerNorm(d_model) if normalize_before else None
        self.encoder = TransformerEncoder(encoder_layer, num_encoder_layers, encoder_norm)

        decoder_layer = TransformerDecoderLayer(d_model, nhead, dim_feedforward, dropout, activation, normalize_before)
        decoder_norm = nn.LayerNorm(d_model)
        self.decoder = TransformerDecoder(
            decoder_layer, num_decoder_layers, decoder_norm, return_intermediate=return_intermediate_dec
        )

        self.CLS = nn.Embedding(1, d_model) if contrastive_loss else None

        self._reset_parameters()

        self.tokenizer = RobertaTokenizerFast.from_pretrained(text_encoder_type,force_download=True,local_files_only=True)
        self.text_encoder = RobertaModel.from_pretrained(text_encoder_type)

        if freeze_text_encoder:
            for p in self.text_encoder.parameters():
                p.requires_grad_(False)

        self.expander_dropout = 0.1
        config = self.text_encoder.config
        self.resizer = FeatureResizer(
            input_feat_size=config.hidden_size,
            output_feat_size=d_model,
            dropout=self.expander_dropout,
        )

        self.d_model = d_model
        self.nhead = nhead

    def _reset_parameters(self):
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def forward(
        self,
        src=None,
        mask=None,
        query_embed=None,
        pos_embed=None,
        text=None,
        encode_and_save=True,
        text_memory=None,
        img_memory=None,
        text_attention_mask=None,
    ):
        if encode_and_save:
            # flatten NxCxHxW to HWxNxC
            bs, c, h, w = src.shape
            src = src.flatten(2).permute(2, 0, 1)  # [2, 256, 31, 22]->[682, 2, 256]
            device = src.device
            pos_embed = pos_embed.flatten(2).permute(2, 0, 1)  # [2, 256, 31, 22]->[682, 2, 256]
            query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)  # [100, 256]->[100, 2, 256]
            mask = mask.flatten(1)  # [2, 31, 22]->[2, 682]

            if self.CLS is not None:
                # We add a CLS token to the image, to be used for contrastive loss

                CLS = self.CLS.weight.view(1, 1, -1).repeat(1, bs, 1)
                # Add the CLS token to the incoming features
                src = torch.cat((CLS, src))

                # Adding zeros as the first token in the sequence to be compatible with the CLS token
                pos_embed = torch.cat((torch.zeros(1, bs, self.d_model, device=device), pos_embed))

                # Adding one mask item to the beginning of the mask to be compatible with CLS token
                cls_pad = torch.zeros(bs, 1).bool().to(device)
                mask = torch.cat((cls_pad, mask), dim=1)

            if self.pass_pos_and_query:
                tgt = torch.zeros_like(query_embed)  # [100, 2, 256]的全0张量
            else:
                src, tgt, query_embed, pos_embed = src + 0.1 * pos_embed, query_embed, None, None

            device = src.device
            if isinstance(text[0], str):
                # Encode the text
                tokenized = self.tokenizer.batch_encode_plus(text, padding="longest", return_tensors="pt").to(device)
                encoded_text = self.text_encoder(**tokenized) # 对文本信息进行编码

                # Transpose memory because pytorch's attention expects sequence first  # 编码长度根据batch中单词最多的句子确定，单词不足的句子会进行padding，如本例中最大单词数为24，最小为13，则长度为24
                text_memory = encoded_text.last_hidden_state.transpose(0, 1)  # [2, 24, 768]->[24, 2, 768]
                # Invert attention mask that we get from huggingface because its the opposite in pytorch transformer
                text_attention_mask = tokenized.attention_mask.ne(1).bool()  # [2, 24]  # 原本有词的部分为False，padding的部分为True

                # Resize the encoder hidden states to be of the same d_model as the decoder
                text_memory_resized = self.resizer(text_memory) # 主要是Linear(in_features=768, out_features=256, bias=True) [24, 2, 768]->[24, 2, 256] 和图像的特征维度对齐
            else:
                # The text is already encoded, use as is.
                text_attention_mask, text_memory_resized, tokenized = text

            # Concat on the sequence dimension
            src = torch.cat([src, text_memory_resized], dim=0)  # cat([682, 2, 256] [24, 2, 256])->[706, 2, 256]
            # For mask, sequence dimension is second
            mask = torch.cat([mask, text_attention_mask], dim=1)  # cat([2, 682] [2, 24])->[2, 706]
            # Pad the pos_embed with 0 so that the addition will be a no-op for the text tokens
            pos_embed = torch.cat([pos_embed, torch.zeros_like(text_memory_resized)], dim=0)  # cat([682, 2, 256] [24, 2, 256](全0))->[706, 2, 256]
            # src [706, 2, 256]  src_key_padding_mask [2, 706] pos [706, 2, 256]
            img_memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)  # [706, 2, 256]

            text_memory = img_memory[-len(text_memory_resized) :]  # 从img_memory中取出text的部分 [24, 2, 256]

            assert img_memory.shape[1] == text_memory.shape[1] == tgt.shape[1]
            memory_cache = {
                "text_memory_resized": text_memory_resized,  # encoder之前经过Linear resize的text [24, 2, 256]
                "text_memory": text_memory,  # encoder之后的text [24, 2, 256]
                "img_memory": img_memory, # image特征和text特征一同经过encoder的输出  [706, 2, 256]
                "text_pooled_op": encoded_text.pooler_output if self.CLS is not None else None,  # None
                "img_pooled_op": img_memory[0] if self.CLS is not None else None,  # Return the CLS token None
                "mask": mask, # 图像和text的mask[2, 706]，有图像与单词的位置为False，padding的部分为True
                "text_attention_mask": text_attention_mask, #text mask [2, 24]  原本有词的部分为False，padding的部分为True
                "pos_embed": pos_embed,  # 图像mask生成的位置编码与torch.zeros_like(text_memory_resized)拼接->[706, 2, 256]
                "query_embed": query_embed,  # nn.Embedding(100, 256)->[100, 2, 256]
                "tokenized": tokenized, # data={input_ids:[2, 24](单词索引),attention_mask:[2, 24]}
            }
            return memory_cache

        else:
            if self.pass_pos_and_query:
                tgt = torch.zeros_like(query_embed)  # [100, 2, 256]的全0张量
            else:
                src, tgt, query_embed, pos_embed = src + 0.1 * pos_embed, query_embed, None, None

            assert img_memory.shape[1] == text_memory.shape[1] == tgt.shape[1]

            hs = self.decoder(
                tgt, # [100, 2, 256]的全0张量
                img_memory,  # image特征和text特征一同经过encoder的输出  [706, 2, 256]
                text_memory, # encoder之前经过Linear resize的text [24, 2, 256]
                memory_key_padding_mask=mask, # 图像和text的mask[2, 706]，有图像与单词的位置为False，padding的部分为True
                text_memory_key_padding_mask=text_attention_mask, #text mask [2, 24]  原本有词的部分为False，padding的部分为True
                pos=pos_embed,  # 图像mask生成的位置编码与torch.zeros_like(text_memory_resized)拼接 [706, 2, 256]
                query_pos=query_embed,  # nn.Embedding(100, 256)->[100, 2, 256]
            ) # [6, 100, 2, 256]
            return hs.transpose(1, 2)

其中的文本编码器，可以根据词典，把输入的文字转化为编码信息

self.tokenizer =RobertaTokenizerFast.from_pretrained(text_encoder_type,force_download=True,local_files_only=True)加载预训练模型的词典

self.text_encoder = RobertaModel.from_pretrained(text_encoder_type)

这里会从hugging face上直接下载模型，需要一些手段进行科学上网。

例如文本信息是“我爱你”将转化为[2,10,3]，其中”我“在字典里对应数字2，”爱“在字典里对应数字10，经过转化之后的文本，就可以作为模型的输入了。因此我们可以知道如果字典不同，那意味着同一句话编码出来的数字也就是不同的，所以对于一个训练好的NLP模型，基本都是有着自己tokenizer工具。

用本文的例子：

展开text：

['four children are dancing around a pole in a city street .',

'A woman is wearing a black jacket while riding in the back of a motorcycle with a woman in red pants .']

展开tokenized：

tensor([[ 0, 10231, 408, 32, 7950, 198, 10, 9438, 11, 10, 343, 2014, 479, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[ 0, 250, 693, 16, 2498, 10, 909, 8443, 150, 5793, 11, 5, 124, 9, 10, 10218, 19, 10, 693, 11, 1275, 9304, 479, 2]], device='cuda:0')

其中’input_ids’就是这句话每个词在词典中的位置，为什么最长的句子包括‘.’一共22个字会编码出来24个数字，其实NLP模型任务中需要在每一句话句首加入[CLS]符号，句尾加[SEP]符号，因此编码会增加2个。

作为NLP模型的输入，对于一些长句子我们还需要对齐进行padding使得每个batch的句子长度是一致的，这个过程tokenizer也可以帮我们完成，比如上面最短的句子就是用‘1’进行padding。

常使用的参数如下，具体的模型有不同的设置：

padding：给序列补全到一定长度，True or ‘longest’:
是补全到batch中的最长长度，max_length’:补到给定max-length或没给定时，补到模型能接受的最长长度。
truncation：截断操作，true or ‘longest_first’：给定max_length时，按照max_length截断，没给定max_lehgth时，达到模型接受的最长长度后截断，适用于所有序列（单或双）。‘only_first’：这个只针对第一个序列。’only_second’：只针对第二个序列。
max_length：控制padding和truncation的长度。
return_tensors：返回数据的类型，可选’tf’，‘pt’， ‘np’ ，分别表示tf.constant, torch.Tensor或np.ndarray类型。
return_token_type_ids ：默认返回token_type_id（属于哪个句子）。
return_attention_mask ：默认返回attention_mask（是否参与attention计算）

text_encoder则是对tokenized的词进行特征编码，每个词用768的张量表示

因为图像的输入维度是256，所以需要对这个768为的文本 hidden states 进行维度转换，映射到图像的特征空间

这里backbone使用的是resnet101，输入的图像batch为2，该批次中最大的H为981，最大的W为704。通过resnet101的下采样后输出[2, 2048, 31, 22]的特征图以及[2, 31, 22]的mask，该mask用来标记padding后原始输入中有真实图像的位置。特征图则需要降维到[2, 256, 31, 22]，并对mask进行位置编码，得到[2, 256, 31, 22]位置嵌入。特征图[2, 256, 31, 22]在输入encoder之前还要将其展平成[682, 2, 256]。

在输入encoder之前需要将text和image的特征进行拼接，生成图像和文本特征序列

之后就是encoder的6个cross encoder层。

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, activation="relu", normalize_before=False):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        # Implementation of Feedforward model
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)

        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        self.dropout2 = nn.Dropout(dropout)

        self.activation = _get_activation_fn(activation)
        self.normalize_before = normalize_before

    def with_pos_embed(self, tensor, pos: Optional[Tensor]):
        return tensor if pos is None else tensor + pos

    def forward_post(
        self,
        src,
        src_mask: Optional[Tensor] = None,
        src_key_padding_mask: Optional[Tensor] = None,
        pos: Optional[Tensor] = None,
    ):  # src [706, 2, 256]  src_key_padding_mask [2, 706] pos [706, 2, 256]
        q = k = self.with_pos_embed(src, pos)  # [706, 2, 256]
        src2 = self.self_attn(q, k, value=src, attn_mask=src_mask, key_padding_mask=src_key_padding_mask)[0] # [706, 2, 256]
        src = src + self.dropout1(src2)
        src = self.norm1(src)
        src2 = self.linear2(self.dropout(self.activation(self.linear1(src)))) # linear1:Linear(in_features=256, out_features=2048, bias=True) linear2:Linear(in_features=2048, out_features=256, bias=True)
        src = src + self.dropout2(src2)
        src = self.norm2(src)
        return src

cross encoder输出的final hidden state包括image，text特征

img_memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)  # [706, 2, 256]

text_memory = img_memory[-len(text_memory_resized):]  # 从img_memory中取出text的部分 [24, 2, 256]

assert img_memory.shape[1] == text_memory.shape[1] == tgt.shape[1]
memory_cache = {
    "text_memory_resized": text_memory_resized,  # encoder之前经过Linear resize的text [24, 2, 256]
    "text_memory": text_memory,  # encoder之后的text [24, 2, 256]
    "img_memory": img_memory,  # image特征和text特征一同经过encoder的输出  [706, 2, 256]
    "text_pooled_op": encoded_text.pooler_output if self.CLS is not None else None,  # None
    "img_pooled_op": img_memory[0] if self.CLS is not None else None,  # Return the CLS token None
    "mask": mask,  # 图像和text的mask[2, 706]，有图像与单词的位置为False，padding的部分为True
    "text_attention_mask": text_attention_mask,  # text mask [2, 24]  原本有词的部分为False，padding的部分为True
    "pos_embed": pos_embed,  # 图像mask生成的位置编码与torch.zeros_like(text_memory_resized)拼接->[706, 2, 256]
    "query_embed": query_embed,  # nn.Embedding(100, 256)->[100, 2, 256]
    "tokenized": tokenized,  # data={input_ids:[2, 24](单词索引),attention_mask:[2, 24]}
}
return memory_cache

将cross encoder输出的final hidden state作为decoder的输入，并初始化一个[100, 2, 256]的全0张量作为object query

if self.pass_pos_and_query:
    tgt = torch.zeros_like(query_embed)  # [100, 2, 256]的全0张量
else:
    src, tgt, query_embed, pos_embed = src + 0.1 * pos_embed, query_embed, None, None

assert img_memory.shape[1] == text_memory.shape[1] == tgt.shape[1]

hs = self.decoder(
    tgt,  # [100, 2, 256]的全0张量
    img_memory,  # image特征和text特征一同经过encoder的输出  [706, 2, 256]
    text_memory,  # encoder之前经过Linear resize的text [24, 2, 256]
    memory_key_padding_mask=mask,  # 图像和text的mask[2, 706]，有图像与单词的位置为False，padding的部分为True
    text_memory_key_padding_mask=text_attention_mask,  # text mask [2, 24]  原本有词的部分为False，padding的部分为True
    pos=pos_embed,  # 图像mask生成的位置编码与torch.zeros_like(text_memory_resized)拼接 [706, 2, 256]
    query_pos=query_embed,  # nn.Embedding(100, 256)->[100, 2, 256]
)  # [6, 100, 2, 256]
return hs.transpose(1, 2)

其中query_pos 由nn.Embedding(100, 256)生成，输入时根据batch size调整输入encoder的维度，作为object query的位置编码。

class TransformerDecoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1, activation="relu", normalize_before=False):
        super().__init__()
        self.self_attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.cross_attn_image = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        # self.cross_attn_text = nn.MultiheadAttention(d_model, nhead, dropout=dropout)

        # Implementation of Feedforward model
        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout = nn.Dropout(dropout)
        self.linear2 = nn.Linear(dim_feedforward, d_model)

        self.norm1 = nn.LayerNorm(d_model)
        # self.norm2 = nn.LayerNorm(d_model)
        self.norm3 = nn.LayerNorm(d_model)
        self.norm4 = nn.LayerNorm(d_model)
        self.dropout1 = nn.Dropout(dropout)
        # self.dropout2 = nn.Dropout(dropout)
        self.dropout3 = nn.Dropout(dropout)
        self.dropout4 = nn.Dropout(dropout)

        self.activation = _get_activation_fn(activation)
        self.normalize_before = normalize_before

    def with_pos_embed(self, tensor, pos: Optional[Tensor]):
        return tensor if pos is None else tensor + pos

    # For now, trying one version where its self attn -> cross attn text -> cross attn image -> FFN
    def forward_post(
        self,
        tgt,
        memory,
        text_memory,
        tgt_mask: Optional[Tensor] = None,
        memory_mask: Optional[Tensor] = None,
        text_memory_key_padding_mask: Optional[Tensor] = None,
        tgt_key_padding_mask: Optional[Tensor] = None,
        memory_key_padding_mask: Optional[Tensor] = None,
        pos: Optional[Tensor] = None,
        query_pos: Optional[Tensor] = None,
    ):
        q = k = self.with_pos_embed(tgt, query_pos)  # tgt最初是[100, 2, 256]的全0张量

        # Self attention
        tgt2 = self.self_attn(q, k, value=tgt, attn_mask=tgt_mask, key_padding_mask=tgt_key_padding_mask)[0]
        tgt = tgt + self.dropout1(tgt2)
        tgt = self.norm1(tgt)

        # Cross attention to text
        # tgt2 = self.cross_attn_text(
        #     query=self.with_pos_embed(tgt, query_pos),
        #     key=text_memory,
        #     value=text_memory,
        #     attn_mask=None,
        #     key_padding_mask=text_memory_key_padding_mask,
        # )[0]
        # tgt = tgt + self.dropout2(tgt2)
        # tgt = self.norm2(tgt)

        # Cross attention to image
        tgt2 = self.cross_attn_image(
            query=self.with_pos_embed(tgt, query_pos),  # [100, 2, 256]
            key=self.with_pos_embed(memory, pos), # [706, 2, 256]
            value=memory, # [706, 2, 256]
            attn_mask=memory_mask,
            key_padding_mask=memory_key_padding_mask, # [2, 706]
        )[0]
        tgt = tgt + self.dropout3(tgt2)  # [100, 2, 256]
        tgt = self.norm3(tgt)

        # FFN
        tgt2 = self.linear2(self.dropout(self.activation(self.linear1(tgt))))
        tgt = tgt + self.dropout4(tgt2)
        tgt = self.norm4(tgt)
        return tgt

网络结构和DETR是一样的，就是输入中增加了文本特征，联合文本特征一起学习。

在计算loss之前还要对decoder的输出做一些映射，方便后续计算Soft token prediction loss 和Contrastive alignment loss 以及iou loss等

其中的一些维度变换以及映射函数在上图中进行了注释。

LOSS

匈牙利匹配

class HungarianMatcher(nn.Module):
    """This class computes an assignment between the targets and the predictions of the network

    For efficiency reasons, the targets don't include the no_object. Because of this, in general,
    there are more predictions than targets. In this case, we do a 1-to-1 matching of the best predictions,
    while the others are un-matched (and thus treated as non-objects).
    """

    def __init__(self, cost_class: float = 1, cost_bbox: float = 1, cost_giou: float = 1):
        """Creates the matcher

        Params:
            cost_class: This is the relative weight of the classification error in the matching cost
            cost_bbox: This is the relative weight of the L1 error of the bounding box coordinates in the matching cost
            cost_giou: This is the relative weight of the giou loss of the bounding box in the matching cost
        """
        super().__init__()
        self.cost_class = cost_class
        self.cost_bbox = cost_bbox
        self.cost_giou = cost_giou
        self.norm = nn.Softmax(-1)
        assert cost_class != 0 or cost_bbox != 0 or cost_giou != 0, "all costs cant be 0"

    @torch.no_grad()
    def forward(self, outputs, targets, positive_map):
        """Performs the matching

        Params:
            outputs: This is a dict that contains at least these entries:
                 "pred_logits": Tensor of dim [batch_size, num_queries, num_classes] with the classification logits
                 "pred_boxes": Tensor of dim [batch_size, num_queries, 4] with the predicted box coordinates

            targets: This is a list of targets (len(targets) = batch_size), where each target is a dict containing:
                 "labels": Tensor of dim [num_target_boxes] (where num_target_boxes is the number of ground-truth
                           objects in the target) containing the class labels
                 "boxes": Tensor of dim [num_target_boxes, 4] containing the target box coordinates

        Returns:
            A list of size batch_size, containing tuples of (index_i, index_j) where:
                - index_i is the indices of the selected predictions (in order)
                - index_j is the indices of the corresponding selected targets (in order)
            For each batch element, it holds:
                len(index_i) = len(index_j) = min(num_queries, num_target_boxes)
        """
        bs, num_queries = outputs["pred_logits"].shape[:2]  # [2, 100]

        # We flatten to compute the cost matrices in a batch
        out_prob = self.norm(outputs["pred_logits"].flatten(0, 1))  # [batch_size * num_queries, num_classes] [200, 256]
        out_bbox = outputs["pred_boxes"].flatten(0, 1)  # [batch_size * num_queries, 4] [200, 4]

        # Also concat the target labels and boxes
        tgt_bbox = torch.cat([v["boxes"] for v in targets])  # [9, 4]
        assert len(tgt_bbox) == len(positive_map)

        # Compute the soft-cross entropy between the predicted token alignment and the GT one for each box
        cost_class = -(out_prob.unsqueeze(1) * positive_map.unsqueeze(0)).sum(-1)  # [200, 9]

        # Compute the L1 cost between boxes
        cost_bbox = torch.cdist(out_bbox, tgt_bbox, p=1)  # [200, 9]
        assert cost_class.shape == cost_bbox.shape

        # Compute the giou cost betwen boxes
        cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox))

        # Final cost matrix
        C = self.cost_bbox * cost_bbox + self.cost_class * cost_class + self.cost_giou * cost_giou # [200, 9]
        C = C.view(bs, num_queries, -1).cpu()  # [200, 9]->[2, 100, 9]

        sizes = [len(v["boxes"]) for v in targets] # [6, 3] 其中batch0中有6个目标，batch1中有3个目标
        # 匈牙利算法的实现,指派最优的目标索引,输出一个二维列表,第一维是batch为0,即一个batch中第一张图像通过匈
        # 牙利算法计算得到的最优解的行列坐标,第二维是batch为1,即一个batch中第二张图像,后面的batch维度以此类推
        indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))] # C.split->[2, 100, 6] [2, 100, 3]
        return [(torch.as_tensor(i, dtype=torch.int64), torch.as_tensor(j, dtype=torch.int64)) for i, j in indices]
        #[(array([11, 32, 76, 81, 83, 88], dtype=int64), array([3, 4, 1, 0, 2, 5])),(array([22, 41, 65], dtype=int64), array([1, 2, 0]))]

其中

out_prob = self.norm(outputs["pred_logits"].flatten(0, 1)) # [batch_size * num_queries, num_classes] [200, 256]

输出的不是类别，而是L=256的分布

其中gt的分布由create_positive_map函数得到

def create_positive_map(tokenized, tokens_positive): #
    """construct a map such that positive_map[i,j] = True iff box i is associated to token j"""
    positive_map = torch.zeros((len(tokens_positive), 256), dtype=torch.float)
    for j, tok_list in enumerate(tokens_positive):
        for (beg, end) in tok_list:
            beg_pos = tokenized.char_to_token(beg)  # 标记位置
            end_pos = tokenized.char_to_token(end - 1)
            if beg_pos is None:
                try:
                    beg_pos = tokenized.char_to_token(beg + 1)
                    if beg_pos is None:
                        beg_pos = tokenized.char_to_token(beg + 2)
                except:
                    beg_pos = None
            if end_pos is None:
                try:
                    end_pos = tokenized.char_to_token(end - 2)
                    if end_pos is None:
                        end_pos = tokenized.char_to_token(end - 3)
                except:
                    end_pos = None
            if beg_pos is None or end_pos is None:
                continue

            assert beg_pos is not None and end_pos is not None
            positive_map[j, beg_pos : end_pos + 1].fill_(1)
    return positive_map / (positive_map.sum(-1)[:, None] + 1e-6)

Soft token prediction

与标准目标检测不同，MDETR对预测每个检测到的对象的分类类别不感兴趣。而是让模型预测每个框对应的tokens所占的位置（跨度），具体来说就是将任何给定句子的tokens最大设置为L=256。通过二分匹配算法与GT box匹配的每个预测框，模型将被训练用来预测与对象相对应的所有token positions上的均匀分布，和DETR主要区别在于每个对象没有预测标签，而是预测文本中与该对象（Soft token prediction）相对应的相关位置的均匀分布。

如图6中 cat 框被训练为预测前两个单词的均匀分布，其中任何与目标没有匹配上的query都会用来预测“无对象”标签∅。文本中的几个单词可以对应于图像中的同一对象，相反，多个对象可以对应于相同的文本。例如，图像中两个框所指的“a couple”可以在同一标题中进一步单独指代。通过以这种方式设计损失函数，MDETR能够从同一个引用表达式中学习共同引用的对象。

代码实现：

def loss_labels(self, outputs, targets, positive_map, indices, num_boxes):
    """Classification loss (NLL)
    targets dicts must contain the key "labels" containing a tensor of dim [nb_target_boxes]
    """

    logits = outputs["pred_logits"].log_softmax(-1)  # BS x (num_queries) x (num_tokens) [2, 100, 256]

    src_idx = self._get_src_permutation_idx(indices)
    tgt_idx = []
    offset = 0
    for i, (_, tgt) in enumerate(indices):
        tgt_idx.append(tgt + offset)
        offset += len(targets[i]["boxes"])
    tgt_idx = torch.cat(tgt_idx)  # tensor([3, 4, 1, 0, 2, 5, 7, 8, 6])

    tgt_pos = positive_map[tgt_idx]  # positive_map由create_positive_map得到，根据tgt_idx重新排序
    target_sim = torch.zeros_like(logits)  # [2, 100, 256]全0张量
    target_sim[:, :, -1] = 1  # 没有目标的最后一维用1表示
    target_sim[src_idx] = tgt_pos  # 构建GT的分布

    loss_ce = -(logits * target_sim).sum(-1)  # [2, 100, 256]->[2, 100]

    eos_coef = torch.full(loss_ce.shape, self.eos_coef, device=target_sim.device)  # 类似加权
    eos_coef[src_idx] = 1

    loss_ce = loss_ce * eos_coef
    loss_ce = loss_ce.sum() / num_boxes

    losses = {"loss_ce": loss_ce}

    return losses

L=256的均匀分布[2,100,256] 与create_positive_map函数得到的GT 分布计算loss，对于没有对应目标的情况，GT 分布在第256维上用1表示。

Contrastive alignment

文中提到Soft token prediction使用位置信息来对齐目标和文本，Contrastive Alignment则是从特征层面进行对齐。来让图文对应的特征的空间距离尽可能接近。这个约束比Soft token prediction更强，因为它直接对特征表示进行监督，而不仅仅是基于位置信息，损失函数采用了InfoNCE。

给定一个对象 $o_{i}$ 它有一组tokens集合 $T^{+}_{i}$ 与之对齐，给定一个token $t_{i}$ 有一组对象集合 $O^{+}_{i}$ 与之对齐，对于所有对象的对比loss可以表示为：

与之相对的，对于所有tokens，对比loss可以表示为：

代码实现：

def loss_contrastive_align(self, outputs, targets, positive_map, indices, num_boxes):
    bs = outputs["proj_queries"].shape[0]
    tokenized = outputs["tokenized"]  # data={input_ids:[2, 24](单词索引),attention_mask:[2, 24]}

    normalized_text_emb = outputs["proj_tokens"]  # BS x (num_tokens) x hdim  # [2, 24, 64]
    normalized_img_emb = outputs["proj_queries"]  # BS x (num_queries) x hdim # [2, 100, 64]

    logits = (
            torch.matmul(normalized_img_emb, normalized_text_emb.transpose(-1, -2)) / self.temperature
    )  # BS x (num_queries) x (num_tokens) # [2, 100, 24]

    # construct a map such that positive_map[k, i,j] = True iff query i is associated to token j in batch item k
    # For efficency, the construction happens on CPU, then the whole matrix is transferred to GPU in one go.
    positive_map = torch.zeros(logits.shape, dtype=torch.bool)
    for i, ((idx_src, idx_tgt), tgt) in enumerate(zip(indices, targets)):
        if "tokens_positive" in tgt:  # 对tokens_positive按列索引排序
            cur_tokens = [tgt["tokens_positive"][j] for j in
                          idx_tgt]  # [[[0, 13]], [[43, 56]], [[0, 13]], [[0, 13]], [[33, 39]], [[0, 13]]]->[[[0, 13]], [[33, 39]], [[43, 56]], [[0, 13]], [[0, 13]], [[0, 13]]]
        else:
            cur_tokens = [tgt["tokens"][j] for j in idx_tgt]

        for j, tok_list in enumerate(cur_tokens):
            for (beg, end) in tok_list:
                beg_pos = tokenized.char_to_token(i, beg)
                end_pos = tokenized.char_to_token(i, end - 1)
                if beg_pos is None:
                    try:
                        beg_pos = tokenized.char_to_token(beg + 1)
                        if beg_pos is None:
                            beg_pos = tokenized.char_to_token(beg + 2)
                    except:
                        beg_pos = None
                if end_pos is None:
                    try:
                        end_pos = tokenized.char_to_token(end - 2)
                        if end_pos is None:
                            end_pos = tokenized.char_to_token(end - 3)
                    except:
                        end_pos = None
                if beg_pos is None or end_pos is None:
                    continue

                assert beg_pos is not None and end_pos is not None
                positive_map[i, idx_src[j], beg_pos: end_pos + 1].fill_(True)

    positive_map = positive_map.to(logits.device)
    positive_logits = -logits.masked_fill(~positive_map, 0)
    negative_logits = logits  # .masked_fill(positive_map, -1000000)

    boxes_with_pos = positive_map.any(2)  # 第三维上有一个为True则整行为True [2, 100]
    pos_term = positive_logits.sum(2)  # 整行的值相加 [2, 100]
    neg_term = negative_logits.logsumexp(2)  # [2, 100]

    nb_pos = positive_map.sum(2) + 1e-6  # [2, 100]

    box_to_token_loss = ((pos_term / nb_pos + neg_term)).masked_fill(~boxes_with_pos, 0).sum()

    tokens_with_pos = positive_map.any(1)  # 第二维上有一个为True则整行为True [2, 24]
    pos_term = positive_logits.sum(1)  # 整列的值相加[2, 24]
    neg_term = negative_logits.logsumexp(1)  # [2, 24]

    nb_pos = positive_map.sum(1) + 1e-6  # [2, 24]

    tokens_to_boxes_loss = ((pos_term / nb_pos + neg_term)).masked_fill(~tokens_with_pos, 0).sum()
    tot_loss = (box_to_token_loss + tokens_to_boxes_loss) / 2

    return {"loss_contrastive_align": tot_loss / num_boxes}

iou loss

def loss_boxes(self, outputs, targets, positive_map, indices, num_boxes):
    """Compute the losses related to the bounding boxes, the L1 regression loss and the GIoU loss
    targets dicts must contain the key "boxes" containing a tensor of dim [nb_target_boxes, 4]
    The target boxes are expected in format (center_x, center_y, h, w), normalized by the image size.
    """
    assert "pred_boxes" in outputs
    idx = self._get_src_permutation_idx(indices)
    src_boxes = outputs["pred_boxes"][idx]  # 从预测结果中取出匈牙利分配的bbox
    target_boxes = torch.cat([t["boxes"][i] for t, (_, i) in zip(targets, indices)],
                             dim=0)  # target_boxes由targets['boxes'] 根据 indices的列索引重新排序得到

    loss_bbox = F.l1_loss(src_boxes, target_boxes, reduction="none")

    losses = {}
    losses["loss_bbox"] = loss_bbox.sum() / num_boxes

    loss_giou = 1 - torch.diag(
        box_ops.generalized_box_iou(box_ops.box_cxcywh_to_xyxy(src_boxes), box_ops.box_cxcywh_to_xyxy(target_boxes))
    )
    losses["loss_giou"] = loss_giou.sum() / num_boxes
    return losses

最后就是对decoder的每一层的中间层输出做辅助监督loss，和以上的计算相同。