ARTrack 阅读记录

目录

环境配置与脚本编写

前向传播过程

网络结构


环境配置与脚本编写

按照官网执行并没有顺利完成,将yaml文件中的 pip 项 手动安装的

conda create -n artrack python=3.9
# 启动该环境,并跳转到项目主目录路径下
astor==0.8.1 configparser==5.2.0
data==0.4 docker-pycreds==0.4.0 easydict==1.9 einops==0.4.1 formulaic==0.5.2 funcsigs==1.0.2 future==0.18.2
gitdb==4.0.9 gitpython==3.1.27 interface-meta==1.3.0 iopath==0.1.9 jpeg4py==0.1.4 jsonpatch==1.32 jsonpointer==2.3 latex==0.7.0
libarchive-c==2.9 linearmodels==4.29 lmdb==1.3.0 loguru==0.6.0 mat73==0.59 memory-profiler==0.60.0 msgpack==1.0.2 ninja==1.11.1
opencv-python==4.5.5.64 pathtools==0.1.2 promise==2.3 property-cached==1.6.4 protobuf==3.20.0 pycocotools==2.0.4 pyhdfe==0.1.2
ruamel-yaml-conda==0.15.100 sentry-sdk==1.5.8 setproctitle==1.2.2 setuptools-scm==7.1.0 shapely==1.8.1.post1 shortuuid==1.0.8
shutilwhich==1.1.0 smmap==5.0.0 tables==3.6.1 tempdir==0.7.1 tensorboardx==2.5.1 thop==0.1.0.post2207010342 tikzplotlib==0.10.1
timm==0.5.4 tomli==2.0.1 torch==1.11.0 torchfile==0.1.0 visdom==0.1.8.9 wandb==0.12.11 webcolors==1.12 yaspin==2.1.0

里面的默认路径需要改写

python tracking/create_default_local_file.py --workspace_dir . --data_dir ./data --save_dir ./outpu

官网下载训练好的模型,创建路径,将模型放在该路径下

ARTrack-main/output/checkpoints/train/artrack_seq/artrack_seq_256_full/ARTrackSeq_ep0060.pth.tar

创建encoder的预训练模型路径,并把预训练模型放入这里,在yaml文件中进行更改,并且源脚本文件 artrack_seq.py中也需要更改

mkdir pretrained_model
#
mae_pretrain_vit_base.pth 文件名

# artrack_seq_256_full.yaml 中用绝对路径改写
PRETRAIN_PTH: "/root/data/zjx/Code-subject/ARTrack/ARTrack-main/pretrained_models"

# 同时将artrack_seq.py --100 中的
load_from = cfg.MODEL.PRETRAIN_PTH
# 改为
load_from = cfg.MODEL.PRETRAIN_PTH +'/' + cfg.MODEL.PRETRAIN_FILE
#同时将 artrack_seq.py -- 103 中的
missing_keys, unexpected_keys = model.load_state_dict(checkpoint["net"], strict=False)
# 改为
missing_keys, unexpected_keys = model.load_state_dict(checkpoint["model"], strict=False)

代码中没有实现 run video 的脚本,这里需要自定义一个脚本实现

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import random
import argparse
import multiprocessing

import cv2
import torch
import torch.nn as nn
import numpy as np
from glob import glob

from lib.test.evaluation.tracker import Tracker

import sys

prj_path = os.path.join(os.path.dirname(__file__), '..')
if prj_path not in sys.path:
    sys.path.append(prj_path)

torch.set_num_threads(1)

parser = argparse.ArgumentParser(description='Run tracker on sequence or dataset.')
parser.add_argument('tracker_name', type=str, help='Name of tracking method.')
parser.add_argument('tracker_param', type=str, help='Name of config file.')
parser.add_argument('--runid', type=int, default=None, help='The run id.')
parser.add_argument('--video_path', type=str, default='None', help='Name of dataset (otb, nfs, uav, tpl, vot, tn, gott, gotv, lasot).')
parser.add_argument('--sequence', type=str, default=None, help='Sequence number or name.')
parser.add_argument('--debug', type=int, default=0, help='Debug level.')
parser.add_argument('--threads', type=int, default=0, help='Number of threads.')
parser.add_argument('--num_gpus', type=int, default=8)

args = parser.parse_args()


def main():  # 这里已经是图片了
    colors = [random.randint(0, 255) for _ in range(3)]
    print('[INFO] Loading the model')

    # load config
    trackers = Tracker(args.tracker_name, args.tracker_param, None, args.runid)

    try:
        worker_name = multiprocessing.current_process().name
        worker_id = int(worker_name[worker_name.find('-') + 1:]) - 1
        gpu_id = worker_id % args.num_gpus
        torch.cuda.set_device(gpu_id)
    except:
        pass

    trackers.run_video(args.video_path, None, None, None, False)

if __name__=='__main__':
    main()

执行

python tracking/run_video.py artrack_seq artrack_seq_256_full --video_path /root/data/zjx/Code-subject/OSTrack-main/experiments/video/soccer1.avi 

前向传播过程

裁剪模板区域和OSTrack代码一样,初始化的时候,为需要保留的N帧的bbox的坐标信息创建了一个buffer--self.store_result,初始化时全为 init bbox,N的值此时设置为7

        for i in range(self.save_all - 1):
            self.store_result.append(info['init_bbox'].copy())

搜索区域的裁剪和OSTrack的一样。将之前帧的坐标进行变换,  以前一帧预测的坐标为参考点计算相对坐标,因为当前帧的裁剪的搜索区域的就是以上一帧预测的bbox为中心进行裁剪的,所以搜索区域的中心实则是前一帧预测的bbox的中心。只不过前一帧预测的bbox为原img的尺度,而搜索区域为crop size上的尺度,因此,只需要将计算原img尺度上的也就是之前帧的预测的坐标与前一帧预测的坐标的相对坐标,再乘以resize factor就可以将相对坐标转换到crop size 的尺度下。并且,前一帧的预测的bbox转换实则移到了搜索区域的中心点,也就是 (crop_size/2, crop_ size/2)。 

转换后除以 crop size 进行了归一化,不过这里有可能会 小于0 或者 大于 1,因为坐标变换可能会超出边界。接下来将xywh转换成 xyxy 形式,并筛选只保留(-0.5,1.5)区间的。然后对坐标进行量化。加上0.5 为了防止 出现负数,最终将bbox量化到 2*(bins-1)之间。最终,包含时空上下文信息的坐标输入为

seqs_out = seqs_out.unsqueeze(0)  # (1,28)

将 模板 和 搜索区域送入 ViT backbone中进行特征提取,这个过程中一共 16倍 下采样。然后将 提取的 sequence patch、以及位置编码、外加之前转换后的之前帧的bbox的信息 送入 接下来的Transformer中。

首先进入一个 encoder,在FeatureFusionEncoder类中进行一些预处理,主要的基本模块是 FeatureFusion 模块。这个encoder的主要过程如下所示,最终返回 z 和 x 一样shape的特征 patch。

接下来 ,将 之前帧的 bbox 坐标序列以及开始标志拼接在一起,作为decoder的输入 sequence。因为只需要预测bbox的坐标,所以不需要额外的结束标志,输出的序列长度直接为4即可。

1、 将输入的sequence 进行词汇嵌入,词向量的长度是crop img 下采样得到的特征patch的分辨率
2、 将初始输入tgt、模板特征、搜索特征、patch z的位置编码、 x patch的位置编码、identity高斯截断分布、高斯截断分布、查询嵌入、输入序列的掩码 送入decoder

 decoder主要由TargetQueryDecoderLayer层组成。该模块的前向过程如下所示,一共有6层

最终输出和 tgt shape一样的token sequence。得到的输出的shape为(1,length,768),这个length为tgt的长度,随sequence的预测而逐渐增加。接下来,

1、 拿出得到的 query的 最后一个单词嵌入,并与词向量的权重矩阵进行矩阵乘法,得到与每个位置量化后的相关联的预测值。
2、 取softmax,得到关于量化后的坐标的概率分布。
3、 采用argmax sampleing,也就是看最大概率的位置。
4、 将当前预测的量化后的坐标加入到 tgt当中,执行循环。
5、 最终得到预测的bbox的量化坐标。

得到网络的输出预测后,

1、 bbox坐标反量化
2、 xyxy 转为 xywh  中心点加长宽
3、 尺度返回到原img, 转成 xywh, 左顶点加长宽
4、 平滑处理,去掉bbox超出图片的部分
5、 对于之前保存的坐标信息,将最靠前的弹出去,在最靠后的也就是前一帧的坐标加入当前预测的。好比出栈入栈操作。

 

网络结构

ARTrackSeq(
  (backbone): VisionTransformer(
    (patch_embed): PatchEmbed(
      (proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
      (norm): Identity()
    )
    (pos_drop): Dropout(p=0.0, inplace=False)
    (blocks): Sequential(
      (0): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): Identity()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (1): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (2): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (3): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (4): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (5): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (6): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (7): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (8): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (9): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (10): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (11): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
  )
  (pix_head): Pix2Track(
    (word_embeddings): Embedding(802, 768, padding_idx=800, max_norm=1)
    (position_embeddings): Embedding(5, 768)
    (prev_position_embeddings): Embedding(28, 768)
    (encoder): FeatureFusionEncoder(
      (layers): ModuleList(
        (0): FeatureFusion(
          (z_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_self_attn): SelfAttention(
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (x_self_attn): SelfAttention(
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (z_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_x_cross_attention): CrossAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (kv): Linear(in_features=768, out_features=1536, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (x_z_cross_attention): CrossAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (kv): Linear(in_features=768, out_features=1536, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (z_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (x_mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
        (1): FeatureFusion(
          (z_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_self_attn): SelfAttention(
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (x_self_attn): SelfAttention(
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (z_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_x_cross_attention): CrossAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (kv): Linear(in_features=768, out_features=1536, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (x_z_cross_attention): CrossAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (kv): Linear(in_features=768, out_features=1536, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (z_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (x_mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
        (2): FeatureFusion(
          (z_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_self_attn): SelfAttention(
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (x_self_attn): SelfAttention(
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (z_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_x_cross_attention): CrossAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (kv): Linear(in_features=768, out_features=1536, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (x_z_cross_attention): CrossAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (kv): Linear(in_features=768, out_features=1536, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (z_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (x_mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
      )
      (z_pos_enc): Untied2DPositionalEncoder(
        (pos): Learned2DPositionalEncoder()
        (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (pos_q_linear): Linear(in_features=768, out_features=768, bias=True)
        (pos_k_linear): Linear(in_features=768, out_features=768, bias=True)
      )
      (x_pos_enc): Untied2DPositionalEncoder(
        (pos): Learned2DPositionalEncoder()
        (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (pos_q_linear): Linear(in_features=768, out_features=768, bias=True)
        (pos_k_linear): Linear(in_features=768, out_features=768, bias=True)
      )
      (z_rel_pos_bias_table): RelativePosition2DEncoder()
      (x_rel_pos_bias_table): RelativePosition2DEncoder()
      (z_x_rel_pos_bias_table): RelativePosition2DEncoder()
      (x_z_rel_pos_bias_table): RelativePosition2DEncoder()
    )
    (decoder): TargetQueryDecoderBlock(
      (layers): ModuleList(
        (0): TargetQueryDecoderLayer(
          (norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (self_attn1): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlpz): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
        (1): TargetQueryDecoderLayer(
          (norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (self_attn1): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlpz): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
        (2): TargetQueryDecoderLayer(
          (norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (self_attn1): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlpz): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
        (3): TargetQueryDecoderLayer(
          (norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (self_attn1): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlpz): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
        (4): TargetQueryDecoderLayer(
          (norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (self_attn1): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlpz): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
        (5): TargetQueryDecoderLayer(
          (norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (self_attn1): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlpz): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
      )
      (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
  )
)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/300333.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

ssm基于HTML5的交流论坛的设计与实现+vue论文

摘 要 信息数据从传统到当代,是一直在变革当中,突如其来的互联网让传统的信息管理看到了革命性的曙光,因为传统信息管理从时效性,还是安全性,还是可操作性等各个方面来讲,遇到了互联网时代才发现能补上自古…

一文掌握文本语义分割:从朴素切分、Cross-Segment到阿里SeqModel

前言 之所以写本文,源于以下两点 在此文《基于LangChainLLM的本地知识库问答:从企业单文档问答到批量文档问答》的3.5节中,我们曾分析过langchain-chatchat项目中文本分割相关的代码,当时曾提到该项目中的文档语义分割模型为达摩…

56K star!一键拥有跨平台 ChatGPT 应用:ChatGPT-Next-Web

前言 现在围绕 openai 的客户端层出不穷,各路开发大神可以说是各出绝招,我也试用过几个国内外的不同客户端。 今天我们推荐的开源项目是目前我用过最好的ChatGPT应用,在GitHub超过56K Star的开源项目:ChatGPT-Next-Web。 ChatGP…

【计算机组成原理】通过带符号整数的减法运算中加法器的溢出标志 OF 和符号标志 SF 对两个带符号整数的大小进行比较

对于带符号整数的减法运算,能否直接根据 CF 的值对两个带符号整数的大小进行比较? 对于带符号整数的减法运算,不能直接根据CF(进/借位标志)的值对两个带符号整数的大小进行比较。 CF标志位在带符号整数运算中主要用于…

Java最大优先队列设计与实现

Java 学习面试指南:https://javaxiaobear.cn 1、API设计 类名MaxPriorityQueue构造方法MaxPriorityQueue(int capacity):创建容量为capacity的MaxPriorityQueue对象成员方法private boolean less(int i,int j):判断堆中索引i处的元素是否小…

二叉树算法题(一)

根据二叉树创建字符串 根据二叉树创建字符串 给你二叉树的根节点 root ,请你采用前序遍历的方式,将二叉树转化为一个由括号和整数组成的字符串,返回构造出的字符串。 空节点使用一对空括号对 "()" 表示,转化后需要省…

基于JSP+Servlet+Mysql的学生信息管理系统

基于JSPServletMysql的学生信息管理系统 一、系统介绍二、功能展示1.目录2.数据库3.登陆4.注册5.主页 四、其它1.其他系统实现五.获取源码 一、系统介绍 项目名称:基于JSPServletMysql的学生信息管理系统 项目架构:B/S架构 开发语言:Java语…

软件装一送三了!还附带弹窗资讯,你确定不试一下?

前言 前几天一个朋友向我吐槽,说电脑太卡了。自己好像都没安装什么软件,怎么就那么多弹窗广告。 我看了一下他的电脑,笑了一下说:你的电脑真好,都会只能给你推荐美女看,这资讯来之不易啊,好好享…

libexif库介绍

libexif是一个用于解析、编辑和保存EXIF数据的库。它支持EXIF 2.1标准(以及2.2中的大多数)中描述的所有EXIF标签。它是用纯C语言编写的,不需要任何额外的库。源码地址:https://github.com/libexif/libexif ,最新发布版本为0.6.24,…

如何保障开放网络边界安全?

针对开放式网络(办事大厅、视频网络等),如何在内部网络构建起一道安全屏障,有效解决广大用户普遍存在的无法保证网络边界完整、边界安全、公共场所终端摄像头管理、办事大厅智能设备(一体机等)管理、开放场…

【AI视野·今日CV 计算机视觉论文速览 第283期】Thu, 4 Jan 2024

AI视野今日CS.CV 计算机视觉论文速览 Thu, 4 Jan 2024 Totally 85 papers 👉上期速览✈更多精彩请移步主页 Daily Computer Vision Papers LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry Authors Weirong Chen, Le Chen, Rui Wang, Marc P…

信源编码与信道转移矩阵

目录 一. 信息论模型 二. 点对点通信模型 三. 信源编码 四. 信道转移矩阵 4.1 二进制对称信道 4.2 二进制擦除信道 五. 小结 (1)信道直射与反射 (2)信道散射 (3) 信道时变性 一. 信息论模型 194…

Python 面向对象之反射

Python 面向对象之反射 【一】概念 反射是指通过对象的属性名或者方法名来获取对象的属性或调用方法的能力反射还指的是在程序额运行过程中可以动态获取对象的信息(属性和方法) 【二】四个内置函数 又叫做反射函数 万物皆对象(整数、字符串、函数、模块、类等等…

thinkphp6入门(15)-- 模型动态构建查询条件

背景 我使用thinkphp6的模型写数据库查询,有多个where条件,但是不确定是否需要添加某个where条件,怎么才能动态得生成查询 链式查询 在ThinkPHP 6中,可以使用链式查询方法来动态地构建查询条件。可以根据参数的值来决定是否添加…

【Docker基础一】Docker安装Elasticsearch,Kibana,IK分词器

安装elasticsearch 下载镜像 查看版本:Elasticsearch Guide [8.11] | Elastic # 下载镜像 docker pull elasticsearch:7.17.16 # 查看镜像是否下载成功 docker images创建网络 因为需要部署kibana容器,要让es和kibana容器互联 # 创建一个网络&…

2024阿里云优惠_阿里云活动中心

2024年阿里云优惠活动大全,包括阿里云服务器优惠活动清单、配置价格表、域名优惠活动、阿里云建站活动、阿里云优惠代金券免费领取、对象存储OSS活动、企业邮箱优惠、无影云电脑优惠、CDN特惠等等,阿里云百科aliyunbaike.com分享2024阿里云优惠活动大全_…

【JAVA】Iterator 和 ListIterator 有什么区别?

🍎个人博客:个人主页 🏆个人专栏: JAVA ⛳️ 功不唐捐,玉汝于成 目录 前言 在Java中,遍历集合是日常编程中常见的任务,而Iterator和ListIterator作为遍历集合的两个主要接口&#xff0…

MacOS M1/M2 Go Debug 配置

前言 换电脑,Go 环境带来一些麻烦,耽误很多时间,稍作记录。 原始电脑是 Mac 旧款,CPU x86 构型,新电脑 M2,因为旧电脑里本地文件很多,为了简化搬迁,还是用了 Mac 自带的迁移&#x…

[论文阅读] Revisiting Feature Propagation and Aggregation in Polyp Segmentation

[论文地址] [代码] [MICCAI 23] Abstract 息肉的准确分割是筛查过程中有效诊断结直肠癌的关键步骤。 由于能够有效捕获多尺度上下文信息,普遍采用类似UNet 的编码器-解码器框架。 然而,两个主要限制阻碍了网络实现有效的特征传播和聚合。 首先&#xff…

AI教我学编程之C#关键字

AI教我学编程系列学习第三课 — C#关键字 前言重点先知关键字分类保留字上下文关键字 对话AI首遇波澜调整指令第一次第二次第三次直到我提出如下指令 人工智能?阶段总结 知识拓展1、Ecma和ISO是什么?2、System,dllhost.exe,taskmg…