Linux 36.3 + JetPack v6.0@jetson-inference之语义分割

1. 源由
2. segNet
- 2.1 命令选项
- 2.2 下载模型
- - 2.2.1 Cityscapes
  - 2.2.2 DeepScene
  - 2.2.3 MHP
  - 2.2.4 VOC
  - 2.2.5 SUN
- 2.3 操作示例
- - 2.3.1 单张照片
  - 2.3.2 多张照片
  - 2.3.3 视频
3. 代码
- 3.1 Python
- 3.2 C++
4. 参考资料

1. 源由

分类和目标识别基本上都是像素级别的卷积应用，并非整个图像范围。语义分割基于图像识别，对于环境感知尤其有用，它可以对场景中的许多不同潜在对象进行密集的逐像素分类，包括场景的前景和背景。

2. segNet

segNet 接受二维图像作为输入，并输出带有逐像素分类掩码叠加的第二张图像。掩码的每个像素对应于被分类的对象类别。

以下是可用的各种预训练分割模型，它们使用具有 Jetson 实时性能的 FCN-ResNet18 网络。这些模型适用于各种环境和主题，包括城市、越野小径以及室内办公空间和家庭。

下表列出了可用的预训练语义分割模型以及用于加载它们的 segnet 的相关 --network 参数。它们基于 21 类 FCN-ResNet18 网络，并使用 PyTorch 在各种数据集和分辨率上进行了训练，并将其导出为 ONNX 格式以便与 TensorRT 加载。

Dataset	Resolution	CLI Argument	Accuracy	Jetson Nano	Jetson Xavier
Cityscapes	512x256	fcn-resnet18-cityscapes-512x256	83.3%	48 FPS	480 FPS
Cityscapes	1024x512	fcn-resnet18-cityscapes-1024x512	87.3%	12 FPS	175 FPS
Cityscapes	2048x1024	fcn-resnet18-cityscapes-2048x1024	89.6%	3 FPS	47 FPS
DeepScene	576x320	fcn-resnet18-deepscene-576x320	96.4%	26 FPS	360 FPS
DeepScene	864x480	fcn-resnet18-deepscene-864x480	96.9%	14 FPS	190 FPS
Multi-Human	512x320	fcn-resnet18-mhp-512x320	86.5%	34 FPS	370 FPS
Multi-Human	640x360	fcn-resnet18-mhp-640x360	87.1%	23 FPS	325 FPS
Pascal VOC	320x320	fcn-resnet18-voc-320x320	85.9%	45 FPS	508 FPS
Pascal VOC	512x320	fcn-resnet18-voc-512x320	88.5%	34 FPS	375 FPS
SUN RGB-D	512x400	fcn-resnet18-sun-512x400	64.3%	28 FPS	340 FPS
SUN RGB-D	640x512	fcn-resnet18-sun-640x512	65.1%	17 FPS	224 FPS

2.1 命令选项

$ segnet --help
usage: segnet [--help] [--network NETWORK] ...
              input_URI [output_URI]

Segment and classify a video/image stream using a semantic segmentation DNN.
See below for additional arguments that may not be shown above.

positional arguments:
    input_URI       resource URI of input stream  (see videoSource below)
    output_URI      resource URI of output stream (see videoOutput below)

segNet arguments:
  --network=NETWORK    pre-trained model to load, one of the following:
                           * fcn-resnet18-cityscapes-512x256
                           * fcn-resnet18-cityscapes-1024x512
                           * fcn-resnet18-cityscapes-2048x1024
                           * fcn-resnet18-deepscene-576x320
                           * fcn-resnet18-deepscene-864x480
                           * fcn-resnet18-mhp-512x320
                           * fcn-resnet18-mhp-640x360
                           * fcn-resnet18-voc-320x320 (default)
                           * fcn-resnet18-voc-512x320
                           * fcn-resnet18-sun-512x400
                           * fcn-resnet18-sun-640x512
  --model=MODEL        path to custom model to load (caffemodel, uff, or onnx)
  --prototxt=PROTOTXT  path to custom prototxt to load (for .caffemodel only)
  --labels=LABELS      path to text file containing the labels for each class
  --colors=COLORS      path to text file containing the colors for each class
  --input-blob=INPUT   name of the input layer (default: 'input_0')
  --output-blob=OUTPUT name of the output layer (default: 'output_0')
  --alpha=ALPHA        overlay alpha blending value, range 0-255 (default: 150)
  --visualize=VISUAL   visualization flags (e.g. --visualize=overlay,mask)
                       valid combinations are:  'overlay', 'mask'
  --profile            enable layer profiling in TensorRT


videoSource arguments:
    input                resource URI of the input stream, for example:
                             * /dev/video0               (V4L2 camera #0)
                             * csi://0                   (MIPI CSI camera #0)
                             * rtp://@:1234              (RTP stream)
                             * rtsp://user:pass@ip:1234  (RTSP stream)
                             * webrtc://@:1234/my_stream (WebRTC stream)
                             * file://my_image.jpg       (image file)
                             * file://my_video.mp4       (video file)
                             * file://my_directory/      (directory of images)
  --input-width=WIDTH    explicitly request a width of the stream (optional)
  --input-height=HEIGHT  explicitly request a height of the stream (optional)
  --input-rate=RATE      explicitly request a framerate of the stream (optional)
  --input-save=FILE      path to video file for saving the input stream to disk
  --input-codec=CODEC    RTP requires the codec to be set, one of these:
                             * h264, h265
                             * vp8, vp9
                             * mpeg2, mpeg4
                             * mjpeg
  --input-decoder=TYPE   the decoder engine to use, one of these:
                             * cpu
                             * omx  (aarch64/JetPack4 only)
                             * v4l2 (aarch64/JetPack5 only)
  --input-flip=FLIP      flip method to apply to input:
                             * none (default)
                             * counterclockwise
                             * rotate-180
                             * clockwise
                             * horizontal
                             * vertical
                             * upper-right-diagonal
                             * upper-left-diagonal
  --input-loop=LOOP      for file-based inputs, the number of loops to run:
                             * -1 = loop forever
                             *  0 = don't loop (default)
                             * >0 = set number of loops


videoOutput arguments:
    output               resource URI of the output stream, for example:
                             * file://my_image.jpg       (image file)
                             * file://my_video.mp4       (video file)
                             * file://my_directory/      (directory of images)
                             * rtp://<remote-ip>:1234    (RTP stream)
                             * rtsp://@:8554/my_stream   (RTSP stream)
                             * webrtc://@:1234/my_stream (WebRTC stream)
                             * display://0               (OpenGL window)
  --output-codec=CODEC   desired codec for compressed output streams:
                            * h264 (default), h265
                            * vp8, vp9
                            * mpeg2, mpeg4
                            * mjpeg
  --output-encoder=TYPE  the encoder engine to use, one of these:
                            * cpu
                            * omx  (aarch64/JetPack4 only)
                            * v4l2 (aarch64/JetPack5 only)
  --output-save=FILE     path to a video file for saving the compressed stream
                         to disk, in addition to the primary output above
  --bitrate=BITRATE      desired target VBR bitrate for compressed streams,
                         in bits per second. The default is 4000000 (4 Mbps)
  --headless             don't create a default OpenGL GUI window


logging arguments:
  --log-file=FILE        output destination file (default is stdout)
  --log-level=LEVEL      message output threshold, one of the following:
                             * silent
                             * error
                             * warning
                             * success
                             * info
                             * verbose (default)
                             * debug
  --verbose              enable verbose logging (same as --log-level=verbose)
  --debug                enable debug logging   (same as --log-level=debug)

注：关于照片、视频等基本操作，详见：《Linux 36.3 + JetPack v6.0@jetson-inference之视频操作》

2.2 下载模型

两种方式：

创建对象时，初始化会自动下载
通过手动将模型文件放置到data/networks/目录下

国内，由于“墙”的存在，对于我们这种处于起飞阶段的菜鸟来说就是“障碍”。有条件的朋友可以参考《apt-get通过代理更新系统》进行设置网络。

不过，NVIDIA还是很热心的帮助我们做了“Work around”，所有的模型都已经预先存放在中国大陆能访问的位置：Github - model-mirror-190618

  --network=NETWORK    pre-trained model to load, one of the following:
                           * fcn-resnet18-cityscapes-512x256
                           * fcn-resnet18-cityscapes-1024x512
                           * fcn-resnet18-cityscapes-2048x1024
                           * fcn-resnet18-deepscene-576x320
                           * fcn-resnet18-deepscene-864x480
                           * fcn-resnet18-mhp-512x320
                           * fcn-resnet18-mhp-640x360
                           * fcn-resnet18-voc-320x320 (default)
                           * fcn-resnet18-voc-512x320
                           * fcn-resnet18-sun-512x400
                           * fcn-resnet18-sun-640x512
  --model=MODEL        path to custom model to load (caffemodel, uff, or onnx)

根据以上Model方面信息，该命令支持：

fcn-resnet18-cityscapes-512x256
fcn-resnet18-cityscapes-1024x512
fcn-resnet18-cityscapes-2048x1024
fcn-resnet18-deepscene-576x320
fcn-resnet18-deepscene-864x480
fcn-resnet18-mhp-512x320
fcn-resnet18-mhp-640x360
fcn-resnet18-voc-320x320 (default)
fcn-resnet18-voc-512x320
fcn-resnet18-sun-512x400
fcn-resnet18-sun-640x512
支持定制模型(需要用到通用的模型文件caffemodel, uff, or onnx)

作为示例，就下载一个fcn-resnet18-voc-320x320 (default)模型

$ mkdir model-mirror-190618
$ cd model-mirror-190618
$ wget https://github.com/dusty-nv/jetson-inference/releases/download/model-mirror-190618/FCN-ResNet18-Pascal-VOC-320x320.tar.gz
$ tar -zxvf FCN-ResNet18-Pascal-VOC-320x320.tar.gz -C ../data/networks
$ cd ..

注：这个模型文件下载要注意，将解压缩文件放置到FCN-ResNet18-Pascal-VOC-320x320目录下。

2.2.1 Cityscapes

Cityscapes数据集专注于对城市街景的语义理解。

在这里插入图片描述

2.2.2 DeepScene

DeepScene包含了在各种数据集上训练的单模态AdapNet++和多模态SSMA模型。
在这里插入图片描述

2.2.3 MHP

Multi-Human Parsing，新加坡国立大学 (NUS) 学习与视觉 (LV) 小组的多人体解析项目旨在推进人群场景中对人类的细粒度视觉理解。多人体解析与传统的定义明确的对象识别任务（如仅提供对象位置粗略预测的目标检测、仅预测实例级掩码而不提供身体部位和时尚类别详细信息的实例分割、以及不区分不同身份的类别级像素预测的人体解析）有显著不同。在现实世界中，多人互动的场景更为现实和普遍。

在这里插入图片描述

2.2.4 VOC

PASCAL VOC项目：

提供用于对象类别识别的标准化图像数据集
提供一套通用工具来访问数据集和注释
使不同方法的评估和比较成为可能
举办了对象类别识别性能评估的挑战赛（2005-2012年，现已结束）

2.2.5 SUN

SUNRGB-D四种不同的传感器捕捉，包含10,000张RGB-D图像，规模类似于PASCAL VOC。整个数据集密集注释，包括146,617个2D多边形和58,657个带有准确物体方向的3D边界框，以及场景的3D房间布局和类别。这个数据集使我们能够为场景理解任务训练需要大量数据的算法，使用直接且有意义的3D指标进行评估，避免在小型测试集上过拟合，并研究跨传感器偏差。

在这里插入图片描述

2.3 操作示例

$ cd build/aarch64/bin/

2.3.1 单张照片

# C++
$ ./segnet --network=<model> input.jpg output.jpg                  # overlay segmentation on original
$ ./segnet --network=<model> --alpha=200 input.jpg output.jpg      # make the overlay less opaque
$ ./segnet --network=<model> --visualize=mask input.jpg output.jpg # output the solid segmentation mask

# Python
$ ./segnet.py --network=<model> input.jpg output.jpg                  # overlay segmentation on original
$ ./segnet.py --network=<model> --alpha=200 input.jpg output.jpg      # make the overlay less opaque
$ ./segnet.py --network=<model> --visualize=mask input.jpg output.jpg # output the segmentation mask

举例：

# C++
$ ./segnet --network=fcn-resnet18-cityscapes images/city_0.jpg images/test/output.jpg

# Python
$ ./segnet.py --network=fcn-resnet18-cityscapes images/city_0.jpg images/test/output.jpg

在这里插入图片描述

2.3.2 多张照片

# C++
$ ./segnet --network=fcn-resnet18-sun "images/room_*.jpg" images/test/room_output_%i.jpg

# Python
$ ./segnet.py --network=fcn-resnet18-sun "images/room_*.jpg" images/test/room_output_%i.jpg

2.3.3 视频

# Download test video
wget https://nvidia.box.com/shared/static/veuuimq6pwvd62p9fresqhrrmfqz0e2f.mp4 -O pedestrians.mp4

# C++
$ ./segnet --network=fcn-resnet18-cityscapes ../../../pedestrians.mp4 images/test/pedestrians_ssd_segnet_cpp.mp4

# Python
$ ./segnet.py --network=fcn-resnet18-cityscapes ../../../pedestrians.mp4 images/test/pedestrians_ssd_segnet_python.mp4

pedestrians

注：从远距离视频的分析结果看，并不是非常理想，因此，实际还是取决于应用场景。

3. 代码

3.1 Python

- Import Statements:
  ├── import sys
  ├── import argparse
  ├── from jetson_inference import segNet
  ├── from jetson_utils import videoSource, videoOutput, cudaOverlay, cudaDeviceSynchronize, Log
  └── from segnet_utils import *

- Parse Command Line:
  ├── parser = argparse.ArgumentParser(...)
  ├── parser.add_argument("input", ...)
  ├── parser.add_argument("output", ...)
  ├── parser.add_argument("--network", ...)
  ├── parser.add_argument("--filter-mode", ...)
  ├── parser.add_argument("--visualize", ...)
  ├── parser.add_argument("--ignore-class", ...)
  ├── parser.add_argument("--alpha", ...)
  ├── parser.add_argument("--stats", ...)
  └── args = parser.parse_known_args()[0]

- Load Segmentation Network:
  ├── net = segNet(args.network, sys.argv)

- Set Alpha Blending Value:
  └── net.SetOverlayAlpha(args.alpha)

- Create Video Output:
  └── output = videoOutput(args.output, argv=sys.argv)

- Create Buffer Manager:
  └── buffers = segmentationBuffers(net, args)

- Create Video Source:
  └── input = videoSource(args.input, argv=sys.argv)

- Process Frames Loop:
  └── while True:
      ├── img_input = input.Capture()
      ├── if img_input is None:
      │   └── continue
      ├── buffers.Alloc(img_input.shape, img_input.format)
      ├── net.Process(img_input, ignore_class=args.ignore_class)
      ├── if buffers.overlay:
      │   └── net.Overlay(buffers.overlay, filter_mode=args.filter_mode)
      ├── if buffers.mask:
      │   └── net.Mask(buffers.mask, filter_mode=args.filter_mode)
      ├── if buffers.composite:
      │   └── cudaOverlay(buffers.overlay, buffers.composite, 0, 0)
      │       └── cudaOverlay(buffers.mask, buffers.composite, buffers.overlay.width, 0)
      ├── output.Render(buffers.output)
      ├── output.SetStatus("{:s} | Network {:.0f} FPS".format(args.network, net.GetNetworkFPS()))
      ├── cudaDeviceSynchronize()
      ├── net.PrintProfilerTimes()
      ├── if args.stats:
      │   └── buffers.ComputeStats()
      └── if not input.IsStreaming() or not output.IsStreaming():
          └── break

3.2 C++

#include statements
├── "videoSource.h"
├── "videoOutput.h"
├── "cudaOverlay.h"
├── "cudaMappedMemory.h"
├── "segNet.h"
└── <signal.h>

Global variables
└── bool signal_recieved = false;

Function definitions
├── void sig_handler(int signo)
│   └── if (signo == SIGINT)
│       └── LogVerbose("received SIGINT\n");
│           └── signal_recieved = true;
└── int usage()
    ├── printf("usage: segnet [--help] [--network NETWORK] ...\n");
    ├── printf("              input_URI [output_URI]\n\n");
    ├── printf("Segment and classify a video/image stream using a semantic segmentation DNN.\n");
    ├── printf("See below for additional arguments that may not be shown above.\n\n");
    ├── printf("positional arguments:\n");
    ├── printf("    input_URI       resource URI of input stream  (see videoSource below)\n");
    ├── printf("    output_URI      resource URI of output stream (see videoOutput below)\n\n");
    ├── printf("%s\n", segNet::Usage());
    ├── printf("%s\n", videoSource::Usage());
    ├── printf("%s\n", videoOutput::Usage());
    └── printf("%s\n", Log::Usage());

segmentation buffers
├── pixelType* imgMask = NULL;
├── pixelType* imgOverlay = NULL;
├── pixelType* imgComposite = NULL;
├── pixelType* imgOutput = NULL;
├── int2 maskSize;
├── int2 overlaySize;
├── int2 compositeSize;
└── int2 outputSize;

allocate buffers
└── bool allocBuffers(int width, int height, uint32_t flags)
    ├── if (imgOverlay != NULL && width == overlaySize.x && height == overlaySize.y)
    ├── CUDA_FREE_HOST(imgMask);
    ├── CUDA_FREE_HOST(imgOverlay);
    ├── CUDA_FREE_HOST(imgComposite);
    ├── overlaySize = make_int2(width, height);
    ├── if (!cudaAllocMapped(&imgOverlay, overlaySize))
    ├── imgOutput = imgOverlay;
    ├── outputSize = overlaySize;
    ├── maskSize = (flags & segNet::VISUALIZE_OVERLAY) ? make_int2(width/2, height/2) : overlaySize;
    ├── if (!cudaAllocMapped(&imgMask, maskSize))
    ├── imgOutput = imgMask;
    ├── outputSize = maskSize;
    ├── compositeSize = make_int2(overlaySize.x + maskSize.x, overlaySize.y);
    ├── if (!cudaAllocMapped(&imgComposite, compositeSize))
    └── imgOutput = imgComposite;
        └── outputSize = compositeSize;

main function
├── parse command line
├── attach signal handler
├── create input stream
├── create output stream
├── create segmentation network
├── set alpha blending value
├── get overlay/mask filtering mode
├── get visualization flags
├── get object class to ignore
└── processing loop
    ├── capture next image
    │   └── if (!input->Capture(&imgInput, &status))
    ├── allocate buffers for this size frame
    │   └── if (!allocBuffers(input->GetWidth(), input->GetHeight(), visualizationFlags))
    ├── process the segmentation network
    │   └── if (!net->Process(imgInput, input->GetWidth(), input->GetHeight(), ignoreClass))
    ├── generate overlay
    │   └── if (visualizationFlags & segNet::VISUALIZE_OVERLAY)
    ├── generate mask
    │   └── if (visualizationFlags & segNet::VISUALIZE_MASK)
    ├── generate composite
    │   └── if ((visualizationFlags & segNet::VISUALIZE_OVERLAY) && (visualizationFlags & segNet::VISUALIZE_MASK))
    ├── render outputs
    │   └── if (output != NULL)
    └── wait for the GPU to finish