YOLOv9改进策略：block优化 | ECVBlock即插即用的多尺度融合模块，助力小目标涨点

💡💡💡本文改进内容：ECVBlock即插即用的多尺度融合模块，助力检测任务有效涨点！

yolov9-c-EVCBlock summary: 1011 layers, 68102630 parameters, 68102598 gradients, 252.4 GFLOPs

改进结构图如下：

YOLOv9魔术师专栏

☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️ ☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️

包含注意力机制魔改、卷积魔改、检测头创新、损失&IOU优化、block优化&多层特征融合、轻量级网络设计、24年最新顶会改进思路、原创自研paper级创新等

☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️

✨✨✨ 新开专栏暂定免费限时开放，后续每月调价一次✨✨✨

🚀🚀🚀 本项目持续更新 | 更新完结保底≥50+ ，冲刺100+🚀🚀🚀

🍉🍉🍉 联系WX: AI_CV_0624 欢迎交流！🍉🍉🍉

YOLOv9魔改：注意力机制、检测头、blcok魔改、自研原创等

YOLOv9魔术师

💡💡💡全网独家首发创新（原创），适合paper ！！！

💡💡💡 2024年计算机视觉顶会创新点适用于Yolov5、Yolov7、Yolov8等各个Yolo系列，专栏文章提供每一步步骤和源码，轻松带你上手魔改网络！！！

💡💡💡重点：通过本专栏的阅读，后续你也可以设计魔改网络，在网络不同位置（Backbone、head、detect、loss等）进行魔改，实现创新！！！

1.YOLOv9原理介绍

论文： 2402.13616.pdf (arxiv.org)

代码：GitHub - WongKinYiu/yolov9: Implementation of paper - YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information摘要： 如今的深度学习方法重点关注如何设计最合适的目标函数，从而使得模型的预测结果能够最接近真实情况。同时，必须设计一个适当的架构，可以帮助获取足够的信息进行预测。然而，现有方法忽略了一个事实，即当输入数据经过逐层特征提取和空间变换时，大量信息将会丢失。因此，YOLOv9 深入研究了数据通过深度网络传输时数据丢失的重要问题，即信息瓶颈和可逆函数。作者提出了可编程梯度信息（programmable gradient information，PGI）的概念，来应对深度网络实现多个目标所需要的各种变化。PGI 可以为目标任务计算目标函数提供完整的输入信息，从而获得可靠的梯度信息来更新网络权值。此外，研究者基于梯度路径规划设计了一种新的轻量级网络架构，即通用高效层聚合网络（Generalized Efficient Layer Aggregation Network，GELAN）。该架构证实了 PGI 可以在轻量级模型上取得优异的结果。研究者在基于 MS COCO 数据集的目标检测任务上验证所提出的 GELAN 和 PGI。结果表明，与其他 SOTA 方法相比，GELAN 仅使用传统卷积算子即可实现更好的参数利用率。对于 PGI 而言，它的适用性很强，可用于从轻型到大型的各种模型。我们可以用它来获取完整的信息，从而使从头开始训练的模型能够比使用大型数据集预训练的 SOTA 模型获得更好的结果。对比结果如图1所示。

YOLOv9框架图

1.1 YOLOv9框架介绍

YOLOv9各个模型介绍

2.Centralized Feature Pyramid for Object Detection

论文地址： https://arxiv.org/abs/2210.02093

CFPNet即插即用，助力检测涨点，YOLOX/YOLOv5/YOLOV7均有效

摘要
视觉特征金字塔在广泛的应用中显示出其有效性和效率的优越性。

然而，现有的方法过分地集中于层间特征交互，而忽略了层内特征规则，这是经验证明是有益的。尽管一些方法试图在注意力机制或视觉变换器的帮助下学习紧凑的层内特征表示，但它们忽略了对密集预测任务非常重要的被忽略的角点区域。为了解决这一问题，本文提出了一种基于全局显式集中式特征规则的集中式特征金字塔（CFP）对象检测方法。具体而言，我们首先提出了一种空间显式视觉中心方案，其中使用轻量级MLP来捕捉全局长距离依赖关系，并使用并行可学习视觉中心机制来捕捉输入图像的局部角区域。在此基础上，我们以自上而下的方式对常用特征金字塔提出了一种全局集中的规则，其中使用从最深层内特征获得的显式视觉中心信息来调整正面浅层特征。与现有的特征金字塔相比，CFP不仅能够捕获全局长距离依赖关系，而且能够有效地获得全面但有区别的特征表示。在具有挑战性的MS-COCO上的实验结果验证了我们提出的CFP能够在最先进的YOLOv5和YOLOX目标检测基线上实现一致的性能增益。该代码发布于：CFPNet。

2.2 Centralized Feature Pyramid (CFP)

如图2所示，CFP主要由以下部分组成：输入图像、用于提取视觉特征金字塔的CNN主干、提出的显式视觉中心（EVC）、提出的全局集中规则（GCR）以及用于目标检测的去解耦head网络（由分类损失、回归损失和分割损失组成）。在图2中，EVC和GCR在提取的特征金字塔上实现。

2.3 Explicit Visual Center (EVC)

提出的EVC主要由两个并行连接的块组成，其中使用轻量级MLP来捕获顶级特征的全局长期依赖性（即全局信息）。

2.4 yolov9加入ECVBlock

如何将ECVBlock应用到yolov8是本文的关键，重点是增强用于这些检测器的特征金字塔的表示。

1）将ECVBlock添加到backbone或者是head在不同数据集的性能会不一致，比如本文添加到backbone，在NEU-DET钢材表面缺陷和道路缺陷如任务中取得的涨点也是不一样的；

2）比如在backbone添加的位置不同对最终的性能也是完全不一样的，这点也佐证了深度学习具有玄学，体现了调参的必要性，在不断的调参中自然会取得一定经验值；

3.ECVBlock加入到YOLOv9

3.1新建py文件，路径为models/block/ECVBlock.py

######################  EVC  ####  AI&CV   start ###############################

# by AI&CV EVCBlock

# ecvblcok


import torch
from torch import nn
from torch.nn import functional as F
from torch.autograd import Function, Variable
from torch.nn import Module, parameter

import warnings

try:
    from queue import Queue
except ImportError:
    from Queue import Queue

from torch.nn.modules.batchnorm import _BatchNorm
from functools import partial

from timm.models.layers import DropPath, trunc_normal_
from timm.models.registry import register_model
from timm.models.layers.helpers import to_2tuple
from timm.data import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD


# LVC
class Encoding(nn.Module):
    def __init__(self, in_channels, num_codes):
        super(Encoding, self).__init__()
        # init codewords and smoothing factor
        self.in_channels, self.num_codes = in_channels, num_codes
        num_codes = 64
        std = 1. / ((num_codes * in_channels) ** 0.5)
        # [num_codes, channels]
        self.codewords = nn.Parameter(
            torch.empty(num_codes, in_channels, dtype=torch.float).uniform_(-std, std), requires_grad=True)
        # [num_codes]
        self.scale = nn.Parameter(torch.empty(num_codes, dtype=torch.float).uniform_(-1, 0), requires_grad=True)

    @staticmethod
    def scaled_l2(x, codewords, scale):
        num_codes, in_channels = codewords.size()
        b = x.size(0)
        expanded_x = x.unsqueeze(2).expand((b, x.size(1), num_codes, in_channels))

        reshaped_codewords = codewords.view((1, 1, num_codes, in_channels))

        reshaped_scale = scale.view((1, 1, num_codes))  # N, num_codes

        scaled_l2_norm = reshaped_scale * (expanded_x - reshaped_codewords).pow(2).sum(dim=3)
        return scaled_l2_norm

    @staticmethod
    def aggregate(assignment_weights, x, codewords):
        num_codes, in_channels = codewords.size()

        reshaped_codewords = codewords.view((1, 1, num_codes, in_channels))
        b = x.size(0)

        expanded_x = x.unsqueeze(2).expand((b, x.size(1), num_codes, in_channels))

        assignment_weights = assignment_weights.unsqueeze(3)  # b, N, num_codes,

        encoded_feat = (assignment_weights * (expanded_x - reshaped_codewords)).sum(1)
        return encoded_feat

    def forward(self, x):
        assert x.dim() == 4 and x.size(1) == self.in_channels
        b, in_channels, w, h = x.size()

        # [batch_size, height x width, channels]
        x = x.view(b, self.in_channels, -1).transpose(1, 2).contiguous()

        # assignment_weights: [batch_size, channels, num_codes]
        assignment_weights = F.softmax(self.scaled_l2(x, self.codewords, self.scale), dim=2)

        # aggregate
        encoded_feat = self.aggregate(assignment_weights, x, self.codewords)
        return encoded_feat


#  1*1 3*3 1*1
class ConvBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1, res_conv=False, act_layer=nn.ReLU, groups=1,
                 norm_layer=partial(nn.BatchNorm2d, eps=1e-6), drop_block=None, drop_path=None):
        super(ConvBlock, self).__init__()
        self.in_channels = in_channels
        expansion = 4
        c = out_channels // expansion

        self.conv1 = nn.Conv2d(in_channels, c, kernel_size=1, stride=1, padding=0, bias=False)  # [64, 256, 1, 1]
        self.bn1 = norm_layer(c)
        self.act1 = act_layer(inplace=True)

        self.conv2 = nn.Conv2d(c, c, kernel_size=3, stride=stride, groups=groups, padding=1, bias=False)
        self.bn2 = norm_layer(c)
        self.act2 = act_layer(inplace=True)

        self.conv3 = nn.Conv2d(c, out_channels, kernel_size=1, stride=1, padding=0, bias=False)
        self.bn3 = norm_layer(out_channels)
        self.act3 = act_layer(inplace=True)

        if res_conv:
            self.residual_conv = nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, padding=0, bias=False)
            self.residual_bn = norm_layer(out_channels)

        self.res_conv = res_conv
        self.drop_block = drop_block
        self.drop_path = drop_path

    def zero_init_last_bn(self):
        nn.init.zeros_(self.bn3.weight)

    def forward(self, x, return_x_2=True):
        residual = x

        x = self.conv1(x)
        x = self.bn1(x)
        if self.drop_block is not None:
            x = self.drop_block(x)
        x = self.act1(x)

        x = self.conv2(x)  # if x_t_r is None else self.conv2(x + x_t_r)
        x = self.bn2(x)
        if self.drop_block is not None:
            x = self.drop_block(x)
        x2 = self.act2(x)

        x = self.conv3(x2)
        x = self.bn3(x)
        if self.drop_block is not None:
            x = self.drop_block(x)

        if self.drop_path is not None:
            x = self.drop_path(x)

        if self.res_conv:
            residual = self.residual_conv(residual)
            residual = self.residual_bn(residual)

        x += residual
        x = self.act3(x)

        if return_x_2:
            return x, x2
        else:
            return x


class Mean(nn.Module):
    def __init__(self, dim, keep_dim=False):
        super(Mean, self).__init__()
        self.dim = dim
        self.keep_dim = keep_dim

    def forward(self, input):
        return input.mean(self.dim, self.keep_dim)


class Mlp(nn.Module):
    """
    Implementation of MLP with 1*1 convolutions. Input: tensor with shape [B, C, H, W]
    """

    def __init__(self, in_features, hidden_features=None,
                 out_features=None, act_layer=nn.GELU, drop=0.):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Conv2d(in_features, hidden_features, 1)
        self.act = act_layer()
        self.fc2 = nn.Conv2d(hidden_features, out_features, 1)
        self.drop = nn.Dropout(drop)
        self.apply(self._init_weights)

    def _init_weights(self, m):
        if isinstance(m, nn.Conv2d):
            trunc_normal_(m.weight, std=.02)
            if m.bias is not None:
                nn.init.constant_(m.bias, 0)

    def forward(self, x):
        x = self.fc1(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        x = self.drop(x)
        return x


class LayerNormChannel(nn.Module):
    """
    LayerNorm only for Channel Dimension.
    Input: tensor in shape [B, C, H, W]
    """

    def __init__(self, num_channels, eps=1e-05):
        super().__init__()
        self.weight = nn.Parameter(torch.ones(num_channels))
        self.bias = nn.Parameter(torch.zeros(num_channels))
        self.eps = eps

    def forward(self, x):
        u = x.mean(1, keepdim=True)
        s = (x - u).pow(2).mean(1, keepdim=True)
        x = (x - u) / torch.sqrt(s + self.eps)
        x = self.weight.unsqueeze(-1).unsqueeze(-1) * x \
            + self.bias.unsqueeze(-1).unsqueeze(-1)
        return x


class GroupNorm(nn.GroupNorm):
    """
    Group Normalization with 1 group.
    Input: tensor in shape [B, C, H, W]
    """

    def __init__(self, num_channels, **kwargs):
        super().__init__(1, num_channels, **kwargs)


class SiLU(nn.Module):
    """export-friendly version of nn.SiLU()"""

    @staticmethod
    def forward(x):
        return x * torch.sigmoid(x)


def get_activation(name="silu", inplace=True):
    if name == "silu":
        module = nn.SiLU(inplace=inplace)
    elif name == "relu":
        module = nn.ReLU(inplace=inplace)
    elif name == "lrelu":
        module = nn.LeakyReLU(0.1, inplace=inplace)
    else:
        raise AttributeError("Unsupported act type: {}".format(name))
    return module


class BaseConv(nn.Module):
    """A Conv2d -> Batchnorm -> silu/leaky relu block"""  # CBL

    def __init__(
            self, in_channels, out_channels, ksize, stride, groups=1, bias=False, act="silu"
    ):
        super().__init__()
        # same padding
        pad = (ksize - 1) // 2
        self.conv = nn.Conv2d(
            in_channels,
            out_channels,
            kernel_size=ksize,
            stride=stride,
            padding=pad,
            groups=groups,
            bias=bias,
        )
        self.bn = nn.BatchNorm2d(out_channels)
        self.act = get_activation(act, inplace=True)

    def forward(self, x):
        return self.act(self.bn(self.conv(x)))

    def fuseforward(self, x):
        return self.act(self.conv(x))


class DWConv(nn.Module):
    """Depthwise Conv + Conv"""

    def __init__(self, in_channels, out_channels, ksize, stride=1, act="silu"):
        super().__init__()
        self.dconv = BaseConv(
            in_channels,
            in_channels,
            ksize=ksize,
            stride=stride,
            groups=in_channels,
            act=act,
        )
        self.pconv = BaseConv(
            in_channels, out_channels, ksize=1, stride=1, groups=1, act=act
        )

    def forward(self, x):
        x = self.dconv(x)
        return self.pconv(x)


class LVCBlock(nn.Module):
    def __init__(self, c1, c2, num_codes, channel_ratio=0.25, base_channel=64):
        super(LVCBlock, self).__init__()
        self.c2 = c2
        self.num_codes = num_codes
        num_codes = 64

        self.conv_1 = ConvBlock(c1, c1, res_conv=True, stride=1)

        self.LVC = nn.Sequential(
            nn.Conv2d(c1, c1, 1, bias=False),
            nn.BatchNorm2d(c1),
            nn.ReLU(inplace=True),
            Encoding(c1, num_codes=num_codes),
            nn.BatchNorm1d(num_codes),
            nn.ReLU(inplace=True),
            Mean(dim=1))
        self.fc = nn.Sequential(nn.Linear(c1, c1), nn.Sigmoid())

    def forward(self, x):
        x = self.conv_1(x, return_x_2=False)
        en = self.LVC(x)
        gam = self.fc(en)
        b, in_channels, _, _ = x.size()
        y = gam.view(b, in_channels, 1, 1)
        x = F.relu_(x + x * y)
        return x


# LightMLPBlock
class LightMLPBlock(nn.Module):
    def __init__(self, c1, c2, ksize=1, stride=1, act="silu",
                 mlp_ratio=4., drop=0., act_layer=nn.GELU,
                 use_layer_scale=True, layer_scale_init_value=1e-5, drop_path=0.,
                 norm_layer=GroupNorm):  # act_layer=nn.GELU,
        super().__init__()
        self.dw = DWConv(c1, c2, ksize=1, stride=1, act="silu")
        self.linear = nn.Linear(c2, c2)  # learnable position embedding
        self.c2 = c2

        self.norm1 = norm_layer(c1)
        self.norm2 = norm_layer(c1)

        mlp_hidden_dim = int(c1 * mlp_ratio)
        self.mlp = Mlp(in_features=c1, hidden_features=mlp_hidden_dim, act_layer=nn.GELU,
                       drop=drop)

        self.drop_path = DropPath(drop_path) if drop_path > 0. \
            else nn.Identity()

        self.use_layer_scale = use_layer_scale
        if use_layer_scale:
            self.layer_scale_1 = nn.Parameter(
                layer_scale_init_value * torch.ones((c2)), requires_grad=True)
            self.layer_scale_2 = nn.Parameter(
                layer_scale_init_value * torch.ones((c2)), requires_grad=True)

    def forward(self, x):
        if self.use_layer_scale:
            x = x + self.drop_path(self.layer_scale_1.unsqueeze(-1).unsqueeze(-1) * self.dw(self.norm1(x)))
            x = x + self.drop_path(self.layer_scale_2.unsqueeze(-1).unsqueeze(-1) * self.mlp(self.norm2(x)))
        else:
            x = x + self.drop_path(self.dw(self.norm1(x)))
            x = x + self.drop_path(self.mlp(self.norm2(x)))
        return x


# EVCBlock
class EVCBlock(nn.Module):
    def __init__(self, c1, c2, channel_ratio=4, base_channel=16):
        super().__init__()
        expansion = 2
        ch = c2 * expansion
        # Stem stage: get the feature maps by conv block (copied form resnet.py) 进入conformer框架之前的处理
        self.conv1 = nn.Conv2d(c1, c1, kernel_size=7, stride=1, padding=3, bias=False)  # 1 / 2 [112, 112]
        self.bn1 = nn.BatchNorm2d(c1)
        self.act1 = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=1, padding=1)  # 1 / 4 [56, 56]

        # LVC
        self.lvc = LVCBlock(c1, c2, num_codes=64)  # c1值暂时未定
        # LightMLPBlock
        self.l_MLP = LightMLPBlock(c1, c2, ksize=1, stride=1, act="silu", act_layer=nn.GELU, mlp_ratio=4., drop=0.,
                                   use_layer_scale=True, layer_scale_init_value=1e-5, drop_path=0.,
                                   norm_layer=GroupNorm)
        self.cnv1 = nn.Conv2d(ch, c2, kernel_size=1, stride=1, padding=0)

    def forward(self, x):
        x1 = self.maxpool(self.act1(self.bn1(self.conv1(x))))
        # LVCBlock
        x_lvc = self.lvc(x1)
        # LightMLPBlock
        x_lmlp = self.l_MLP(x1)
        # concat
        x = torch.cat((x_lvc, x_lmlp), dim=1)
        x = self.cnv1(x)
        return x

######################  EVC  ####  AI&CV   end ###############################

3.2修改yolo.py

1)首先进行引用

from models.block.ECVBlock import EVCBlock

2）修改def parse_model(d, ch): # model_dict, input_channels(3)

在源码基础上加入EVCBlock

        n = n_ = max(round(n * gd), 1) if n > 1 else n  # depth gain
        if m in {
            Conv, AConv, ConvTranspose, 
            Bottleneck, SPP, SPPF, DWConv, BottleneckCSP, nn.ConvTranspose2d, DWConvTranspose2d, SPPCSPC, ADown,
            RepNCSPELAN4, SPPELAN,EVCBlock}:
            c1, c2 = ch[f], args[0]
            if c2 != no:  # if not output
                c2 = make_divisible(c2 * gw, 8)

            args = [c1, c2, *args[1:]]
            if m in {BottleneckCSP, SPPCSPC}:
                args.insert(2, n)  # number of repeats
                n = 1
        elif m is nn.BatchNorm2d:
            args = [ch[f]]