从源码上看,PaddleOCR一共支持四个版本,分别是PP-OCR、PP-OCRv2、PP-OCRv3、PP-OCRv4。本文选择PaddleOCR的v3版本的骨干网络作为研究对象,力图探究网络模型的内部结构。
文章目录
- 研究起点
- 卷归层
- 压发层
- 残差层
- 骨干网
- 代码实验
- 小结
研究起点
参考官网配置文件,提取其中21-36行描述模型架构的内容如下:
Architecture:
model_type: det
algorithm: DB
Transform:
Backbone:
name: MobileNetV3
scale: 0.5
model_name: large
disable_se: True
Neck:
name: RSEFPN
out_channels: 96
shortcut: True
Head:
name: DBHead
k: 50
从这段配置,描述了模型训练的网络架构。依次可以看出模型类型为文本检测det,检测算法为DB,骨干网backbone为MobileNetV3,颈部网络为RSEFPN,头部网络为DBHead。本文聚焦在骨干网络,单独列出其设置的参数值清单:
- scale=0.5
- model_name=‘large’
- disable_se=True
而MobileNetV3的代码位于det_mobilenet_v3.py文件中,具体可以参考官方源码。为对照上述配置,这里仅仅摘取其构造函数如下:
def __init__(self, in_channels=3, model_name="large", scale=0.5, disable_se=False,**kwargs):
可以看出参数配置基本保持了默认,只有disable_se参数由默认值False改为了True。各参数含义如下:
- in_channels,整数int类型,代表输入张量通道数,默认为3通道
- model_name,字符串str类型,代表模型型号。支持large和small两类,其中large模型包含15个残差层,small模型给包含11个残差层。
- scale,浮点float类型,代表模型通道拉伸系数,支持0.35/0.5/0.75/1.0/1.25五种,值越大中间层通道数越多,模型参数更多。以输入张量BCHW=5,3,64,192为例,scale分别选择0.5/1/1.25三种,模型的参数数量分别为1.57M/5.66M/8.77M。
- disable_se,布尔bool类型,代表是否在残差层中禁用SE模块,默认值为False。
卷归层
ConvBNLayer类的定义在det_mobilenet_v3.py文件中第159-201行,主要包含了一个卷积层二维卷积层(Convolution2D Layer)一个批正则化层(Batch Normalization Layer),这里将ConvBNLayer翻译为卷归层。阅读卷归层的代码,可以看出其内部结构如下图:
上图中的输入张量结构为c_in,h,w(为描述简便省略了张量实际结构中的批处理大小),分别代表输入通道数、高度、宽度。通过一个Conv2D操作,输出张量结构为c_out,h‘,w’,分别代表输出通道数、高度、宽度,高宽的计算可以参考官方文档conv2d。接着是一个批正则化操作,张量结构不变。最后,如何设置了if_act参数为True,还会接一个激活函数计算,支持relu和hardswish两种,两个函数的说明可以参考官方文档relu和hard_swish。
压发层
SEModule类的定义在det_mobilenet_v3.py文件中第261-289行,主要包含了一个均值池化层(Adaptive Average Pool Layer)和两个卷积层,英文名称为Squeeze and Excitation Network,这里将SEModule翻译为压发层。阅读压发层的代码,可以看出其内部结构如下图:
上图中的输入张量结构为c_in,h,w。结果一个均值池化操作后,张量结构变为c_in,1,1。接着是一个挤压的卷积操作,输出通道数压缩到c_in//r,r参数可以设置,默认值为4。后接一个relu激活函数调用,然后是一个激发的卷积操作,通道数恢复到c_in。紧接着一个HardSigmoid操作,参见hard_sigmoid官方文档。然后与输入张量做乘法,最终输出的张量结构与输入相同。
残差层
ResidualUnit类的定义在det_mobilenet_v3.py文件中第204-258行,主要包含了三个卷归层(ConvBNLayer)和一个压发层(SEModule),这里将ResidualUnit翻译为残差层。阅读残差层的代码,可以看出其内部结构如下图:
上图中的输入张量结构为c_in,h,w,其中c_in代表输入通道数。经过第一个卷归层后,输出张量结构c_mid,h,w,其中c_mid为设置的中间通道数。因为第一个卷积层设置的卷积核k为1x1,步长s为1,填充p为0,所以输出张量的高宽不变。紧接着进入第二个卷归层,卷积核大小、步长、填充均由参数设置决定,所以输出张量结构为c_mid,h‘,w’。如果设置use_se为True,那么进入到压发层,输出张量结构与输入相同。接着进入第三个卷归层,输出张量结构为c_out,h‘,w’,其中c_out为设置的输出通道数参数。如果s_in设置为1,并且c_out与c_in相同,那么if_shortcut就等于True,此时第三个卷积层输出的张量结构与输入张量结构相同,网络最后增加了一个张量相加操作。这时,残差层的含义就凸显出来了,可以简单理解为通过多层神经网络处理,将结果加到输入上,增加的部分就是差额。
骨干网
MobileNetV3类的定义在det_mobilenet_v3.py文件中第37-156行,这里将MobileNetV3理解为骨干网。有了前几节的基础知识,清楚了卷归层就是卷积+归一化,压发层就是平均池化+卷积,残差层就是卷归+压发,那么骨干网的代码很容易看懂。为了理解的直观性,减少参数理解障碍,假设输入张量结构为5,3,64,320,分别代表批处理大小为5,通道数3,图像高度64,宽度320。在此前提下,加上第一节配置文件中的设置,可以总结出骨干网结构如下图:
上图中,对于ConvBNLayer,只列举k、s、p三个参数,分别代表卷积核大小、步长、填充;对于ResidualUnit,只列举mid、k、s、r四个参数,分别代表中间通道数、卷积核大小、步长、是否做残差加法(if_shortcut);每个层前后的四个数字代表BCHW的张量结构,蓝色字体表示层操作前后张量结构有变化。以下分五个部分来解释上图:
- 第一个卷归层
通过一个卷积核为3、步长为2的卷归层,将输入张量的通道数由3转为8通道输出,宽高各压缩一半。 - 第一阶段stage0
通过三个残差层,将输入通道8转为16通道输出,宽高各压缩一半。 - 第二阶段stage1
通过三个残差层,将输入通道16转为24通道输出,宽高各压缩一半。 - 第三阶段stage2
通过六个残差层,将输入通道24转为56通道输出,宽高各压缩一半。 - 第四阶段stage3
通过三个残差层和一个卷归层,将输入通道56转为480通道输出,宽高各压缩一半。
代码实验
通过调用paddle.summary函数,以(5, 3, 64, 320)为输入张量结构,得到以下输出:
---------------------------------------------------------------------------
Layer (type) Input Shape Output Shape Param #
===========================================================================
Conv2D-1 [[5, 3, 64, 320]] [5, 8, 32, 160] 216
BatchNorm-1 [[5, 8, 32, 160]] [5, 8, 32, 160] 32
ConvBNLayer-1 [[5, 3, 64, 320]] [5, 8, 32, 160] 0 第一个卷归层
Conv2D-2 [[5, 8, 32, 160]] [5, 8, 32, 160] 64 stage0开始
BatchNorm-2 [[5, 8, 32, 160]] [5, 8, 32, 160] 32
ConvBNLayer-2 [[5, 8, 32, 160]] [5, 8, 32, 160] 0
Conv2D-3 [[5, 8, 32, 160]] [5, 8, 32, 160] 72
BatchNorm-3 [[5, 8, 32, 160]] [5, 8, 32, 160] 32
ConvBNLayer-3 [[5, 8, 32, 160]] [5, 8, 32, 160] 0
Conv2D-4 [[5, 8, 32, 160]] [5, 8, 32, 160] 64
BatchNorm-4 [[5, 8, 32, 160]] [5, 8, 32, 160] 32
ConvBNLayer-4 [[5, 8, 32, 160]] [5, 8, 32, 160] 0
ResidualUnit-1 [[5, 8, 32, 160]] [5, 8, 32, 160] 0
Conv2D-5 [[5, 8, 32, 160]] [5, 32, 32, 160] 256
BatchNorm-5 [[5, 32, 32, 160]] [5, 32, 32, 160] 128
ConvBNLayer-5 [[5, 8, 32, 160]] [5, 32, 32, 160] 0
Conv2D-6 [[5, 32, 32, 160]] [5, 32, 16, 80] 288
BatchNorm-6 [[5, 32, 16, 80]] [5, 32, 16, 80] 128
ConvBNLayer-6 [[5, 32, 32, 160]] [5, 32, 16, 80] 0
Conv2D-7 [[5, 32, 16, 80]] [5, 16, 16, 80] 512
BatchNorm-7 [[5, 16, 16, 80]] [5, 16, 16, 80] 64
ConvBNLayer-7 [[5, 32, 16, 80]] [5, 16, 16, 80] 0
ResidualUnit-2 [[5, 8, 32, 160]] [5, 16, 16, 80] 0
Conv2D-8 [[5, 16, 16, 80]] [5, 40, 16, 80] 640
BatchNorm-8 [[5, 40, 16, 80]] [5, 40, 16, 80] 160
ConvBNLayer-8 [[5, 16, 16, 80]] [5, 40, 16, 80] 0
Conv2D-9 [[5, 40, 16, 80]] [5, 40, 16, 80] 360
BatchNorm-9 [[5, 40, 16, 80]] [5, 40, 16, 80] 160
ConvBNLayer-9 [[5, 40, 16, 80]] [5, 40, 16, 80] 0
Conv2D-10 [[5, 40, 16, 80]] [5, 16, 16, 80] 640
BatchNorm-10 [[5, 16, 16, 80]] [5, 16, 16, 80] 64
ConvBNLayer-10 [[5, 40, 16, 80]] [5, 16, 16, 80] 0
ResidualUnit-3 [[5, 16, 16, 80]] [5, 16, 16, 80] 0 stage0结束
Conv2D-11 [[5, 16, 16, 80]] [5, 40, 16, 80] 640 stage1开始
BatchNorm-11 [[5, 40, 16, 80]] [5, 40, 16, 80] 160
ConvBNLayer-11 [[5, 16, 16, 80]] [5, 40, 16, 80] 0
Conv2D-12 [[5, 40, 16, 80]] [5, 40, 8, 40] 1,000
BatchNorm-12 [[5, 40, 8, 40]] [5, 40, 8, 40] 160
ConvBNLayer-12 [[5, 40, 16, 80]] [5, 40, 8, 40] 0
Conv2D-13 [[5, 40, 8, 40]] [5, 24, 8, 40] 960
BatchNorm-13 [[5, 24, 8, 40]] [5, 24, 8, 40] 96
ConvBNLayer-13 [[5, 40, 8, 40]] [5, 24, 8, 40] 0
ResidualUnit-4 [[5, 16, 16, 80]] [5, 24, 8, 40] 0
Conv2D-14 [[5, 24, 8, 40]] [5, 64, 8, 40] 1,536
BatchNorm-14 [[5, 64, 8, 40]] [5, 64, 8, 40] 256
ConvBNLayer-14 [[5, 24, 8, 40]] [5, 64, 8, 40] 0
Conv2D-15 [[5, 64, 8, 40]] [5, 64, 8, 40] 1,600
BatchNorm-15 [[5, 64, 8, 40]] [5, 64, 8, 40] 256
ConvBNLayer-15 [[5, 64, 8, 40]] [5, 64, 8, 40] 0
Conv2D-16 [[5, 64, 8, 40]] [5, 24, 8, 40] 1,536
BatchNorm-16 [[5, 24, 8, 40]] [5, 24, 8, 40] 96
ConvBNLayer-16 [[5, 64, 8, 40]] [5, 24, 8, 40] 0
ResidualUnit-5 [[5, 24, 8, 40]] [5, 24, 8, 40] 0
Conv2D-17 [[5, 24, 8, 40]] [5, 64, 8, 40] 1,536
BatchNorm-17 [[5, 64, 8, 40]] [5, 64, 8, 40] 256
ConvBNLayer-17 [[5, 24, 8, 40]] [5, 64, 8, 40] 0
Conv2D-18 [[5, 64, 8, 40]] [5, 64, 8, 40] 1,600
BatchNorm-18 [[5, 64, 8, 40]] [5, 64, 8, 40] 256
ConvBNLayer-18 [[5, 64, 8, 40]] [5, 64, 8, 40] 0
Conv2D-19 [[5, 64, 8, 40]] [5, 24, 8, 40] 1,536
BatchNorm-19 [[5, 24, 8, 40]] [5, 24, 8, 40] 96
ConvBNLayer-19 [[5, 64, 8, 40]] [5, 24, 8, 40] 0
ResidualUnit-6 [[5, 24, 8, 40]] [5, 24, 8, 40] 0 stage1结束
Conv2D-20 [[5, 24, 8, 40]] [5, 120, 8, 40] 2,880 stage2开始
BatchNorm-20 [[5, 120, 8, 40]] [5, 120, 8, 40] 480
ConvBNLayer-20 [[5, 24, 8, 40]] [5, 120, 8, 40] 0
Conv2D-21 [[5, 120, 8, 40]] [5, 120, 4, 20] 1,080
BatchNorm-21 [[5, 120, 4, 20]] [5, 120, 4, 20] 480
ConvBNLayer-21 [[5, 120, 8, 40]] [5, 120, 4, 20] 0
Conv2D-22 [[5, 120, 4, 20]] [5, 40, 4, 20] 4,800
BatchNorm-22 [[5, 40, 4, 20]] [5, 40, 4, 20] 160
ConvBNLayer-22 [[5, 120, 4, 20]] [5, 40, 4, 20] 0
ResidualUnit-7 [[5, 24, 8, 40]] [5, 40, 4, 20] 0
Conv2D-23 [[5, 40, 4, 20]] [5, 104, 4, 20] 4,160
BatchNorm-23 [[5, 104, 4, 20]] [5, 104, 4, 20] 416
ConvBNLayer-23 [[5, 40, 4, 20]] [5, 104, 4, 20] 0
Conv2D-24 [[5, 104, 4, 20]] [5, 104, 4, 20] 936
BatchNorm-24 [[5, 104, 4, 20]] [5, 104, 4, 20] 416
ConvBNLayer-24 [[5, 104, 4, 20]] [5, 104, 4, 20] 0
Conv2D-25 [[5, 104, 4, 20]] [5, 40, 4, 20] 4,160
BatchNorm-25 [[5, 40, 4, 20]] [5, 40, 4, 20] 160
ConvBNLayer-25 [[5, 104, 4, 20]] [5, 40, 4, 20] 0
ResidualUnit-8 [[5, 40, 4, 20]] [5, 40, 4, 20] 0
Conv2D-26 [[5, 40, 4, 20]] [5, 96, 4, 20] 3,840
BatchNorm-26 [[5, 96, 4, 20]] [5, 96, 4, 20] 384
ConvBNLayer-26 [[5, 40, 4, 20]] [5, 96, 4, 20] 0
Conv2D-27 [[5, 96, 4, 20]] [5, 96, 4, 20] 864
BatchNorm-27 [[5, 96, 4, 20]] [5, 96, 4, 20] 384
ConvBNLayer-27 [[5, 96, 4, 20]] [5, 96, 4, 20] 0
Conv2D-28 [[5, 96, 4, 20]] [5, 40, 4, 20] 3,840
BatchNorm-28 [[5, 40, 4, 20]] [5, 40, 4, 20] 160
ConvBNLayer-28 [[5, 96, 4, 20]] [5, 40, 4, 20] 0
ResidualUnit-9 [[5, 40, 4, 20]] [5, 40, 4, 20] 0
Conv2D-29 [[5, 40, 4, 20]] [5, 96, 4, 20] 3,840
BatchNorm-29 [[5, 96, 4, 20]] [5, 96, 4, 20] 384
ConvBNLayer-29 [[5, 40, 4, 20]] [5, 96, 4, 20] 0
Conv2D-30 [[5, 96, 4, 20]] [5, 96, 4, 20] 864
BatchNorm-30 [[5, 96, 4, 20]] [5, 96, 4, 20] 384
ConvBNLayer-30 [[5, 96, 4, 20]] [5, 96, 4, 20] 0
Conv2D-31 [[5, 96, 4, 20]] [5, 40, 4, 20] 3,840
BatchNorm-31 [[5, 40, 4, 20]] [5, 40, 4, 20] 160
ConvBNLayer-31 [[5, 96, 4, 20]] [5, 40, 4, 20] 0
ResidualUnit-10 [[5, 40, 4, 20]] [5, 40, 4, 20] 0
Conv2D-32 [[5, 40, 4, 20]] [5, 240, 4, 20] 9,600
BatchNorm-32 [[5, 240, 4, 20]] [5, 240, 4, 20] 960
ConvBNLayer-32 [[5, 40, 4, 20]] [5, 240, 4, 20] 0
Conv2D-33 [[5, 240, 4, 20]] [5, 240, 4, 20] 2,160
BatchNorm-33 [[5, 240, 4, 20]] [5, 240, 4, 20] 960
ConvBNLayer-33 [[5, 240, 4, 20]] [5, 240, 4, 20] 0
Conv2D-34 [[5, 240, 4, 20]] [5, 56, 4, 20] 13,440
BatchNorm-34 [[5, 56, 4, 20]] [5, 56, 4, 20] 224
ConvBNLayer-34 [[5, 240, 4, 20]] [5, 56, 4, 20] 0
ResidualUnit-11 [[5, 40, 4, 20]] [5, 56, 4, 20] 0
Conv2D-35 [[5, 56, 4, 20]] [5, 336, 4, 20] 18,816
BatchNorm-35 [[5, 336, 4, 20]] [5, 336, 4, 20] 1,344
ConvBNLayer-35 [[5, 56, 4, 20]] [5, 336, 4, 20] 0
Conv2D-36 [[5, 336, 4, 20]] [5, 336, 4, 20] 3,024
BatchNorm-36 [[5, 336, 4, 20]] [5, 336, 4, 20] 1,344
ConvBNLayer-36 [[5, 336, 4, 20]] [5, 336, 4, 20] 0
Conv2D-37 [[5, 336, 4, 20]] [5, 56, 4, 20] 18,816
BatchNorm-37 [[5, 56, 4, 20]] [5, 56, 4, 20] 224
ConvBNLayer-37 [[5, 336, 4, 20]] [5, 56, 4, 20] 0
ResidualUnit-12 [[5, 56, 4, 20]] [5, 56, 4, 20] 0 stage2结束
Conv2D-38 [[5, 56, 4, 20]] [5, 336, 4, 20] 18,816 stage3开始
BatchNorm-38 [[5, 336, 4, 20]] [5, 336, 4, 20] 1,344
ConvBNLayer-38 [[5, 56, 4, 20]] [5, 336, 4, 20] 0
Conv2D-39 [[5, 336, 4, 20]] [5, 336, 2, 10] 8,400
BatchNorm-39 [[5, 336, 2, 10]] [5, 336, 2, 10] 1,344
ConvBNLayer-39 [[5, 336, 4, 20]] [5, 336, 2, 10] 0
Conv2D-40 [[5, 336, 2, 10]] [5, 80, 2, 10] 26,880
BatchNorm-40 [[5, 80, 2, 10]] [5, 80, 2, 10] 320
ConvBNLayer-40 [[5, 336, 2, 10]] [5, 80, 2, 10] 0
ResidualUnit-13 [[5, 56, 4, 20]] [5, 80, 2, 10] 0
Conv2D-41 [[5, 80, 2, 10]] [5, 480, 2, 10] 38,400
BatchNorm-41 [[5, 480, 2, 10]] [5, 480, 2, 10] 1,920
ConvBNLayer-41 [[5, 80, 2, 10]] [5, 480, 2, 10] 0
Conv2D-42 [[5, 480, 2, 10]] [5, 480, 2, 10] 12,000
BatchNorm-42 [[5, 480, 2, 10]] [5, 480, 2, 10] 1,920
ConvBNLayer-42 [[5, 480, 2, 10]] [5, 480, 2, 10] 0
Conv2D-43 [[5, 480, 2, 10]] [5, 80, 2, 10] 38,400
BatchNorm-43 [[5, 80, 2, 10]] [5, 80, 2, 10] 320
ConvBNLayer-43 [[5, 480, 2, 10]] [5, 80, 2, 10] 0
ResidualUnit-14 [[5, 80, 2, 10]] [5, 80, 2, 10] 0
Conv2D-44 [[5, 80, 2, 10]] [5, 480, 2, 10] 38,400
BatchNorm-44 [[5, 480, 2, 10]] [5, 480, 2, 10] 1,920
ConvBNLayer-44 [[5, 80, 2, 10]] [5, 480, 2, 10] 0
Conv2D-45 [[5, 480, 2, 10]] [5, 480, 2, 10] 12,000
BatchNorm-45 [[5, 480, 2, 10]] [5, 480, 2, 10] 1,920
ConvBNLayer-45 [[5, 480, 2, 10]] [5, 480, 2, 10] 0
Conv2D-46 [[5, 480, 2, 10]] [5, 80, 2, 10] 38,400
BatchNorm-46 [[5, 80, 2, 10]] [5, 80, 2, 10] 320
ConvBNLayer-46 [[5, 480, 2, 10]] [5, 80, 2, 10] 0
ResidualUnit-15 [[5, 80, 2, 10]] [5, 80, 2, 10] 0
Conv2D-47 [[5, 80, 2, 10]] [5, 480, 2, 10] 38,400
BatchNorm-47 [[5, 480, 2, 10]] [5, 480, 2, 10] 1,920
ConvBNLayer-47 [[5, 80, 2, 10]] [5, 480, 2, 10] 0 stage3结束
===========================================================================
Total params: 410,848
Trainable params: 398,480
Non-trainable params: 12,368
---------------------------------------------------------------------------
Input size (MB): 1.17
Forward/backward pass size (MB): 116.78
Params size (MB): 1.57
Estimated Total Size (MB): 119.52
---------------------------------------------------------------------------
可以对照骨干网结构图,看懂各项输出,各阶段已经在输出中标注。由于yml文件中设置disable_se为True,即禁用SE模块,所以压发层实际上并没有用到主干网模型中。如果把disable_se改为False,可以在输出中看到SEModule-1到SEModule-8,有兴趣可以改代码测试。
小结
首先解释了卷归层ConvBNLayer、压发层SEModule、残差层ResidualUnit三个基本概念,接着分析了MobileNetV3的内部结构,最后通过python代码展示PaddleOCRv3文本检测神经网络的summary输出。测试代码可以参考gitee。