RT-DETR那么火，如何使用它、如何改进它？热滚滚的使用攻略出炉了

前言

RT-DETR是百度出品的新一代检测器，属于DETR系列，即基于Transformer的目标检测框架。官方代码是基于Paddle平台的，当然也有Pytorch版本的，对于习惯使用Pytorch平台和YOLO风格的研究人员而言，调试和改进RT-DETR是比较痛苦的。不过，YOLO这么强大的社区，肯定会解决这样的问题，这不在yolov8中已经存在RT-DETR的版本。

前言
一、YOLOv8的安装
二、RT-DETR的测试
- 2.1 先测试YOLOv8的效果
- 2.2 RT-DETR的使用与检测效果
三、RT-DETR的使用
- 3.1 RT-DETR的原始结构：
- 3.2 使用RT-DETR任务迁移：在钢板缺陷检测NEU数据集上
- 3.3 使用RT-DETR任务迁移：在PKU-PCB数据集检测PCB表面缺陷
- - 3.3.1 数据集的准备：
  - 3.3.2 模型的配置
四、如何改进RT-DETR：驾驭这个优秀的检测模型
- - 4.1 RepHGStem的构建
  - Bug1修复：
  - 解决办法：
  - BUG 2
  - 4.2 开始训练
  - 效果对比：改进后的Rep模型与原始的模型对比
总结

一、YOLOv8的安装

参考v10的安装过程；
在conda创建一个名为yolov8的新环境，并在其中安装python3.8,这个环境是独立的不会影响系统中的其他环境:

conda create -n yolov8 python=3.8

然后再激活yolov8的Conda环境：

conda activate yolov8

通过pip命令安装python包及其版本号：

pip install ultralytics

二、RT-DETR的测试

2.1 先测试YOLOv8的效果

如果我们使用yolov8来测试以下图片的话：

0: 640x576 22 cars, 5 trucks, 89.4ms
Speed: 4.7ms preprocess, 89.4ms inference, 497.4ms postprocess per image at shape (1, 3, 640, 576)
Results saved to /home/wqt/Projects/ultralytics/runs/detect/predict
4.7+89.4+497.4=595ms

可以看到效果如下，检测非常精细，很多远方的car都能检测出来，说明YOLOv8在小目标检测中还是有优势的，但是运行速度为：4.7+89.4+497.4≈600ms，这个推理速度着实有点慢，特别是后处理即NMS，耗时太多。

在这里插入图片描述

2.2 RT-DETR的使用与检测效果

RT-DETR的使用

from ultralytics import RTDETR
from PIL import Image
import cv2
# 加载预训练的COCO RT-DETR-l模型
model = RTDETR('rtdetr-l.pt')
# 显示模型信息（可选）
model.info()
# from PIL
im1 = Image.open("figures/16pic_6868839_b.jpg")
results = model.predict(source=im1, save=True)  # save plotted images

同样的图片检测效果如下：
在这里插入图片描述

rt-detr-l summary: 673 layers, 32970476 parameters, 0 gradients, 108.3 GFLOPs
0: 640x640 1 0, 21 2s, 3 7s, 20.6ms
Speed: 2.6ms preprocess, 20.6ms inference, 1.5ms postprocess per image at shape (1, 3, 640, 640)
Results saved to /home/wqt/Projects/ultralytics/runs/detect/predict2
2.6+20.6+1.5=24.7ms

说明RT-DETR优点体现的很明显，无需后处理，识别大物体好，但是对于小目标还是有短板

三、RT-DETR的使用

3.1 RT-DETR的原始结构：

参考它的构造文件/home/wqt/Projects/ultralytics/ultralytics/cfg/models/rt-detr/rtdetr-resnet50.yaml：

# Parameters
nc: 80 # number of classes
scales: # model compound scaling constants, i.e. 'model=yolov8n-cls.yaml' will call yolov8-cls.yaml with scale 'n'
  # [depth, width, max_channels]
  l: [1.00, 1.00, 1024]

backbone:
  # [from, repeats, module, args]
  - [-1, 1, ResNetLayer, [3, 64, 1, True, 1]] # 0
  - [-1, 1, ResNetLayer, [64, 64, 1, False, 3]] # 1
  - [-1, 1, ResNetLayer, [256, 128, 2, False, 4]] # 2
  - [-1, 1, ResNetLayer, [512, 256, 2, False, 6]] # 3
  - [-1, 1, ResNetLayer, [1024, 512, 2, False, 3]] # 4

head:
  - [-1, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 5
  - [-1, 1, AIFI, [1024, 8]]
  - [-1, 1, Conv, [256, 1, 1]] # 7

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [3, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 9
  - [[-2, -1], 1, Concat, [1]]
  - [-1, 3, RepC3, [256]] # 11
  - [-1, 1, Conv, [256, 1, 1]] # 12

  - [-1, 1, nn.Upsample, [None, 2, "nearest"]]
  - [2, 1, Conv, [256, 1, 1, None, 1, 1, False]] # 14
  - [[-2, -1], 1, Concat, [1]] # cat backbone P4
  - [-1, 3, RepC3, [256]] # X3 (16), fpn_blocks.1

  - [-1, 1, Conv, [256, 3, 2]] # 17, downsample_convs.0
  - [[-1, 12], 1, Concat, [1]] # cat Y4
  - [-1, 3, RepC3, [256]] # F4 (19), pan_blocks.0

  - [-1, 1, Conv, [256, 3, 2]] # 20, downsample_convs.1
  - [[-1, 7], 1, Concat, [1]] # cat Y5
  - [-1, 3, RepC3, [256]] # F5 (22), pan_blocks.1

  - [[16, 19, 22], 1, RTDETRDecoder, [nc]] # Detect(P3, P4, P5)

3.2 使用RT-DETR任务迁移：在钢板缺陷检测NEU数据集上

from ultralytics import RTDETR
model = RTDETR(model='rtdetr-l.yaml')
model.train(data='NEU.yaml', epochs=120, batch=16, imgsz=640)

需要提前配置好NEU的数据集，转为YOLO对应的格式即可。网上有很多对应版本的数据集可下载！

3.3 使用RT-DETR任务迁移：在PKU-PCB数据集检测PCB表面缺陷

3.3.1 数据集的准备：

相关文件下载：PCB数据集
注意：需要将原本数据集里面的xml文件和图片都要移动到上一级目录中才能读取到文件
(1)训练集划分
主要是将数据集分类成训练代码和测试代码，默认数据集会随机按比例8：1：1分类在train,val,test三个文本文件中，运行代码之后ImageSets/Main会出现四个文件，主要是写入的内容是训练数据集和测试数据集的图片名称

# -*- coding: utf-8 -*-
import os
import random
import argparse
 
parser = argparse.ArgumentParser()
 
#xml文件的地址，根据自己的数据进行修改 xml一般存放在Annotations下
parser.add_argument('--xml_path', default='Annotations', type=str, help='input xml label path')
 
#数据集的划分，地址选择自己数据下的ImageSets/Main
parser.add_argument('--txt_path', default='ImageSets/Main', type=str, help='output txt label path')
 
opt = parser.parse_args()
 
trainval_percent = 0.9
train_percent = 0.8
xmlfilepath = opt.xml_path
txtsavepath = opt.txt_path
total_xml = os.listdir(xmlfilepath)
if not os.path.exists(txtsavepath):
    os.makedirs(txtsavepath)
 
num = len(total_xml)
list_index = range(num)
tv = int(num * trainval_percent)
tr = int(tv * train_percent)
trainval = random.sample(list_index, tv)
train = random.sample(trainval, tr)
 
file_trainval = open(txtsavepath + '/trainval.txt', 'w')
file_test = open(txtsavepath + '/test.txt', 'w')
file_train = open(txtsavepath + '/train.txt', 'w')
file_val = open(txtsavepath + '/val.txt', 'w')
 
for i in list_index:
    name = total_xml[i][:-4] + '\n'
    if i in trainval:
        file_trainval.write(name)
        if i in train:
            file_train.write(name)
        else:
            file_val.write(name)
    else:
        file_test.write(name)
 
file_trainval.close()
file_train.close()
file_val.close()
file_test.close()

（2）用于yolo训练的txt格式代码
主要是将图片数据集标注后的xml文件中的标注信息读取出来并写入txt文件，运行后在labels文件夹中出现所有图片数据集的标注信息

# -*- coding: utf-8 -*-
import xml.etree.ElementTree as ET
import os
from os import getcwd
 
sets = ['train', 'val']
classes = ["missing_hole","mouse_bite","open_circuit","short","spur","spurious_copper"]  # 改成自己的类别
abs_path = os.getcwd()
print(abs_path)
 
 
def convert(size, box):
    dw = 1. / (size[0])
    dh = 1. / (size[1])
    x = (box[0] + box[1]) / 2.0 - 1
    y = (box[2] + box[3]) / 2.0 - 1
    w = box[1] - box[0]
    h = box[3] - box[2]
    x = x * dw
    w = w * dw
    y = y * dh
    h = h * dh
    return x, y, w, h
 
 
# Spur  Spurious_copper

def conver_annotation(image_id):
    in_file = open('Annotations/%s.xml' % (image_id), encoding='UTF-8')
    out_file = open('labels/%s.txt' % (image_id), 'w')
    tree = ET.parse(in_file)
    root = tree.getroot()
    size = root.find('size')
    w = int(size.find('width').text)
    h = int(size.find('height').text)
    for obj in root.iter('object'):
        difficult = obj.find('difficult').text
        #difficult = obj.find('Difficult').text
        cls = obj.find('name').text
        if cls not in classes or int(difficult) == 1:
            continue
        cls_id = classes.index(cls)
        xmlbox = obj.find('bndbox')
        b = (float(xmlbox.find('xmin').text), float(xmlbox.find('xmax').text), float(xmlbox.find('ymin').text), float(xmlbox.find('ymax').text))
        b1, b2, b3, b4 = b
 
        # 标注越界修正
        if b2 > w:
            b2 = w
        if b4 > h:
            b4 = h
        b = (b1, b2, b3, b4)
        bb = convert((w, h), b)
        out_file.write(str(cls_id) + " " + " ".join([str(a) for a in bb]) + '\n')
 
wd = getcwd()
for image_set in sets:
    if not os.path.exists('labels/'):
        os.makedirs('labels/')
    image_ids = open('ImageSets/Main/%s.txt' % (image_set)).read().strip().split()
    list_file = open('%s.txt' % (image_set), 'w')
    for image_id in image_ids:
        list_file.write(abs_path + '/images/%s.jpg\n' % (image_id))
        conver_annotation(image_id)
    list_file.close()

3.3.2 模型的配置

使用预训练好的模型，而不是从头开始训练，它会带来更快的拟合；如果从零开始训练RT-DETR，不太容易训练收敛。

      Epoch    GPU_mem  giou_loss   cls_loss    l1_loss  Instances       Size
  0%|          | 0/32 [00:00<?, ?it/s]/home/wqt/anaconda3/envs/yolov8/lib/python3.8/site-packages/torch/autograd/graph.py:744: UserWarning: grid_sampler_2d_backward_cuda does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True, warn_only=True)'. You can file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation. (Triggered internally at ../aten/src/ATen/Context.cpp:83.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    120/120      13.1G     0.3344     0.3707    0.02792          7        640: 100%|██████████| 32/32 [00:08<00:00,  3.95it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 4/4 [00:00<00:00,  7.21it/s]
                   all        125        533      0.959      0.948      0.966      0.499

120 epochs completed in 0.321 hours.
Optimizer stripped from /home/wqt/Projects/ultralytics/runs/detect/train6/weights/last.pt, 66.2MB
Optimizer stripped from /home/wqt/Projects/ultralytics/runs/detect/train6/weights/best.pt, 66.2MB

Validating /home/wqt/Projects/ultralytics/runs/detect/train6/weights/best.pt...
Ultralytics YOLOv8.2.31 🚀 Python-3.8.19 torch-2.3.1+cu121 CUDA:0 (NVIDIA GeForce RTX 4090, 24209MiB)
rt-detr-l summary: 498 layers, 31996070 parameters, 0 gradients, 103.5 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 4/4 [00:00<00:00,  5.40it/s]
                   all        125        533      0.956      0.966      0.971      0.508
          missing_hole         18         77      0.989          1      0.995      0.589
            mouse_bite         19         84      0.952      0.949      0.967      0.492
          open_circuit         22         92      0.956      0.944       0.96      0.508
                 short         21         86      0.953      0.965      0.959      0.493
                  spur         22         93      0.927      0.961       0.96      0.471
       spurious_copper         23        101      0.961      0.975      0.987      0.495
Speed: 0.1ms preprocess, 3.0ms inference, 0.0ms loss, 0.1ms postprocess per image
Results saved to /home/wqt/Projects/ultralytics/runs/detect/train6

通过上述训练，我们可以在该数据集上获得领先的性能，达到了95.6%的精度；

效果实测：
在这里插入图片描述

四、如何改进RT-DETR：驾驭这个优秀的检测模型

这里，我们尝试改进RT-DETR的stem模块，原始的Stem模块为HGStem，我们尝试加入RepConv来替换原始的Conv，改进后的模块称之为R额怕HGStem。

4.1 RepHGStem的构建

class RepHGStem(nn.Module):
    """
    StemBlock of PPHGNetV2 with 5 convolutions and one maxpool2d.
    https://github.com/PaddlePaddle/PaddleDetection/blob/develop/ppdet/modeling/backbones/hgnet_v2.py
    """
    def __init__(self, c1, cm, c2):
        """Initialize the SPP layer with input/output channels and specified kernel sizes for max pooling."""
        super().__init__()
        self.stem1 = RepConv(c1, cm, 3, 2, act=nn.ReLU())
        self.stem2a = Conv(cm, cm // 2, 2, 1, 0, act=nn.ReLU())
        self.stem2b = Conv(cm // 2, cm, 2, 1, 0, act=nn.ReLU())
        self.stem3 = RepConv(cm * 2, cm, 3, 2, act=nn.ReLU())
        self.stem4 = Conv(cm, c2, 1, 1, act=nn.ReLU())
        self.pool = nn.MaxPool2d(kernel_size=2, stride=1, padding=0, ceil_mode=True)
    def forward(self, x):
        """Forward pass of a PPHGNetV2 backbone layer."""
        x = self.stem1(x)
        x = F.pad(x, [0, 1, 0, 1])
        x2 = self.stem2a(x)
        x2 = F.pad(x2, [0, 1, 0, 1])
        x2 = self.stem2b(x2)
        x1 = self.pool(x)
        x = torch.cat([x1, x2], dim=1)
        x = self.stem3(x)
        x = self.stem4(x)
        return x

为了加入这个模块，并且需要加载之前训练好的Pretrain模型，我们需要在源文件中更改，打开ultralytics/engine/model.py，找到train方法，找到如下图这一行，并替换：
在这里插入图片描述
这样的作用，既能加载预训练好的权重，又能加入新的模块，也称之为“冻结训练”；

Bug1修复：

  File "/home/wqt/Projects/ultralytics/ultralytics/nn/tasks.py", line 855, in parse_model
    m = getattr(torch.nn, m[3:]) if "nn." in m else globals()[m]  # get module
KeyError: 'RepHGStem'

解决办法：

在这里插入图片描述

BUG 2

  File "/home/wqt/Projects/ultralytics/ultralytics/nn/tasks.py", line 10, in <module>
    from ultralytics.nn.modules import (
ImportError: cannot import name 'RepHGStem' from 'ultralytics.nn.modules' (/home/wqt/Projects/ultralytics/ultralytics/nn/modules/__init__.py)

在这里插入图片描述

4.2 开始训练

Overriding model.yaml nc=80 with nc=6
WARNING ⚠️ no model scale passed. Assuming scale='l'.
                   from  n    params  module                                       arguments                     
  0                  -1  1     27520  ultralytics.nn.modules.block.RepHGStem       [3, 32, 48]                   
  1                  -1  6    155072  ultralytics.nn.modules.block.HGBlock         [48, 48, 128, 3, 6]           
  2                  -1  1      1408  ultralytics.nn.modules.conv.DWConv           [128, 128, 3, 2, 1, False]    
  3                  -1  6    839296  ultralytics.nn.modules.block.HGBlock         [128, 96, 512, 3, 6]          
  4                  -1  1      5632  ultralytics.nn.modules.conv.DWConv           [512, 512, 3, 2, 1, False]    
  5                  -1  6   1695360  ultralytics.nn.modules.block.HGBlock         [512, 192, 1024, 5, 6, True, False]
  6                  -1  6   2055808  ultralytics.nn.modules.block.HGBlock         [1024, 192, 1024, 5, 6, True, True]
  7                  -1  6   2055808  ultralytics.nn.modules.block.HGBlock         [1024, 192, 1024, 5, 6, True, True]
  8                  -1  1     11264  ultralytics.nn.modules.conv.DWConv           [1024, 1024, 3, 2, 1, False]  
  9                  -1  6   6708480  ultralytics.nn.modules.block.HGBlock         [1024, 384, 2048, 5, 6, True, False]
 10                  -1  1    524800  ultralytics.nn.modules.conv.Conv             [2048, 256, 1, 1, None, 1, 1, False]
 11                  -1  1    789760  ultralytics.nn.modules.transformer.AIFI      [256, 1024, 8]                
 12                  -1  1     66048  ultralytics.nn.modules.conv.Conv             [256, 256, 1, 1]              
 13                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 14                   7  1    262656  ultralytics.nn.modules.conv.Conv             [1024, 256, 1, 1, None, 1, 1, False]
 15            [-2, -1]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 16                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 17                  -1  1     66048  ultralytics.nn.modules.conv.Conv             [256, 256, 1, 1]              
 18                  -1  1         0  torch.nn.modules.upsampling.Upsample         [None, 2, 'nearest']          
 19                   3  1    131584  ultralytics.nn.modules.conv.Conv             [512, 256, 1, 1, None, 1, 1, False]
 20            [-2, -1]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 21                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 22                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 23            [-1, 17]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 24                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 25                  -1  1    590336  ultralytics.nn.modules.conv.Conv             [256, 256, 3, 2]              
 26            [-1, 12]  1         0  ultralytics.nn.modules.conv.Concat           [1]                           
 27                  -1  3   2232320  ultralytics.nn.modules.block.RepC3           [512, 256, 3]                 
 28        [21, 24, 27]  1   7314182  ultralytics.nn.modules.head.RTDETRDecoder    [6, [256, 256, 256]]          
rtdetr-l-rep summary: 685 layers, 32820678 parameters, 32820678 gradients, 108.2 GFLOPs

Transferred 914/953 items from pretrained weights
AMP: running Automatic Mixed Precision (AMP) checks with YOLOv8n...
AMP: checks passed ✅
train: Scanning /home/wqt/Projects/data/PCB_DATASET2YOLO/labels... 498 images, 0 backgrounds, 0 corrupt: 100%|██████████| 498/498 [00:00<00:00, 674.01it/s]
train: New cache created: /home/wqt/Projects/data/PCB_DATASET2YOLO/labels.cache
val: Scanning /home/wqt/Projects/data/PCB_DATASET2YOLO/labels... 125 images, 0 backgrounds, 0 corrupt: 100%|██████████| 125/125 [00:00<00:00, 1984.77it/s]
val: New cache created: /home/wqt/Projects/data/PCB_DATASET2YOLO/labels.cache
Plotting labels to /home/wqt/Projects/ultralytics/runs/detect/train7/labels.jpg... 
optimizer: 'optimizer=auto' found, ignoring 'lr0=0.01' and 'momentum=0.937' and determining best 'optimizer', 'lr0' and 'momentum' automatically... 
optimizer: AdamW(lr=0.001, momentum=0.9) with parameter groups 145 weight(decay=0.0), 208 weight(decay=0.0005), 228 bias(decay=0.0)
Image sizes 640 train, 640 val
Using 8 dataloader workers
Logging results to /home/wqt/Projects/ultralytics/runs/detect/train7
Starting training for 230 epochs...

可以从打印内容中看出，RepHGStem已经添加成功，并且成功加载了预训练的模型，即Transferred 914/953 items from pretrained weights。

效果对比：改进后的Rep模型与原始的模型对比

这是改进后模型的测试结果：

 0%|          | 0/32 [00:00<?, ?it/s]/home/wqt/anaconda3/envs/yolov8/lib/python3.8/site-packages/torch/autograd/graph.py:744: UserWarning: grid_sampler_2d_backward_cuda does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True, warn_only=True)'. You can file an issue at https://github.com/pytorch/pytorch/issues to help us prioritize adding deterministic support for this operation. (Triggered internally at ../aten/src/ATen/Context.cpp:83.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    230/230      13.8G     0.3001     0.3589    0.02359         10        640: 100%|██████████| 32/32 [00:08<00:00,  3.85it/s]
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 4/4 [00:00<00:00,  7.08it/s]
                   all        125        533      0.967      0.956      0.971      0.492

230 epochs completed in 0.630 hours.
Optimizer stripped from /home/wqt/Projects/ultralytics/runs/detect/train7/weights/last.pt, 66.2MB
Optimizer stripped from /home/wqt/Projects/ultralytics/runs/detect/train7/weights/best.pt, 66.2MB

Validating /home/wqt/Projects/ultralytics/runs/detect/train7/weights/best.pt...
Ultralytics YOLOv8.2.31 🚀 Python-3.8.19 torch-2.3.1+cu121 CUDA:0 (NVIDIA GeForce RTX 4090, 24209MiB)
rtdetr-l-rep summary: 498 layers, 31996070 parameters, 0 gradients, 103.5 GFLOPs
                 Class     Images  Instances      Box(P          R      mAP50  mAP50-95): 100%|██████████| 4/4 [00:00<00:00,  5.41it/s]
                   all        125        533      0.972      0.947       0.97      0.503
          missing_hole         18         77      0.999          1      0.995       0.57
            mouse_bite         19         84      0.937      0.929       0.96        0.5
          open_circuit         22         92       0.96      0.935      0.955      0.537
                 short         21         86      0.988      0.956      0.969      0.478
                  spur         22         93      0.976      0.889      0.952      0.438
       spurious_copper         23        101       0.97      0.976      0.988      0.493
Speed: 0.1ms preprocess, 3.0ms inference, 0.0ms loss, 0.1ms postprocess per image
Results saved to /home/wqt/Projects/ultralytics/runs/detect/train7