小车AI视觉识别--9.目标检测

一、目标检测概述

本节主要解决的问题是如何使用OpenCV中的dnn模块，用来导入一个实现训练好的目标检测网络。但是对opencv的版本是有要求的。目前用深度学习进行目标检测，主要有三种方法：

Faster R-CNNs
You Only Look Once(YOLO)
Single Shot Detectors(SSDs)

Faster R-CNNs是最常听说的基于深度学习的神经网络了。然而，这种方法在技术上是很难懂的（尤其是对于深度学习新手），也难以实现，训练起来也是很困难。此外，即使是使用了“Faster”的方法实现R-CNNs（这里R表示候选区域Region Proposal），算法依然是比较慢的，大约是7FPS。如果我们追求速度，我们可以转向YOLO，因为它非常的快，在TianXGPU上可以达到40-90 FPS，最快的版本可能达到155 FPS。但YOLO的问题在于它的精度还有待提高。SSDs最初是由谷歌开发的，可以说是以上两者之间的平衡。相对于Faster R-CNNs，它的算法更加直接。相对于YOLO，又更加准确。

二、模型结构

MobileNet的主要工作是用depthwise sparable convolutions（深度级可分离卷积）替代过去的standard convolutions（标准卷积）来解决卷积网络的计算效率和参数量的问题。MobileNets模型基于是depthwise sparable convolutions（深度级可分离卷积），它可以将标准卷积分解成一个深度卷积和一个点卷积（1 × 1卷积核）。深度卷积将每个卷积核应用到每一个通道，而1 × 1卷积用来组合通道卷积的输出。

在MobileNet的基本组件中会加入Batch Normalization（BN），即在每次SGD（随机梯度下降）时，标准化处理，使得结果（输出信号各个维度）的均值为0，方差为1。一般在神经网络训练时遇到收敛速度很慢，或梯度爆炸等无法训练的状况时可以尝试BN来解决。另外，在一般使用情况下也可以加入BN来加快训练速度，提高模型精度。

除此之外，模型还使用ReLU激活函数，所以depthwise separable convolution的基本结构如下图所示：

网络解析（二）：MobileNets详解

而MobileNets网络是由很多上图所示的depthwise separable convolution组合而成的。其具体的网络结构如下图所示：

网络解析（二）：MobileNets详解

三、实验源码

#!/usr/bin/env python3
# -*-coding: utf-8 -*-
"""
    @Project: python-learning-notes
    @File   : openpose_for_image_test.py
    @Author : panjq
    @E-mail : pan_jinquan@163.com
    @Date   : 2019-07-29 21:50:17
"""
import time
import cv2 as cv
import numpy as np
######################### Detection ##########################
# load the COCO class names
with open('object_detection_coco.txt', 'r') as f: class_names = f.read().split('\n')
# get a different color array for each of the classes
COLORS = np.random.uniform(0, 255, size=(len(class_names), 3))
# load the DNN modelimage
model = cv.dnn.readNet(model='frozen_inference_graph.pb', config='ssd_mobilenet_v2_coco.txt', framework='TensorFlow')

######################### openpose ##########################
BODY_PARTS = {"Nose": 0, "Neck": 1, "RShoulder": 2, "RElbow": 3, "RWrist": 4,
          "LShoulder": 5, "LElbow": 6, "LWrist": 7, "RHip": 8, "RKnee": 9,
          "RAnkle": 10, "LHip": 11, "LKnee": 12, "LAnkle": 13, "REye": 14,
          "LEye": 15, "REar": 16, "LEar": 17, "Background": 18}
POSE_PAIRS = [["Neck", "RShoulder"], ["Neck", "LShoulder"], ["RShoulder", "RElbow"],
          ["RElbow", "RWrist"], ["LShoulder", "LElbow"], ["LElbow", "LWrist"],
          ["Neck", "RHip"], ["RHip", "RKnee"], ["RKnee", "RAnkle"], ["Neck", "LHip"],
          ["LHip", "LKnee"], ["LKnee", "LAnkle"], ["Neck", "Nose"], ["Nose", "REye"],
          ["REye", "REar"], ["Nose", "LEye"], ["LEye", "LEar"]]
net = cv.dnn.readNetFromTensorflow("graph_opt.pb")

def Target_Detection(image):
    image_height, image_width, _ = image.shape
    # create blob from image
    blob = cv.dnn.blobFromImage(image=image, size=(300, 300), mean=(104, 117, 123), swapRB=True)
    model.setInput(blob)
    output = model.forward()
    # loop over each of the detections
    for detection in output[0, 0, :, :]:
        # extract the confidence of the detection
        confidence = detection[2]
        # draw bounding boxes only if the detection confidence is above...
        # ... a certain threshold, else skip
        if confidence > .4:
            # get the class id
            class_id = detection[1]
            # map the class id to the class
            class_name = class_names[int(class_id) - 1]
            color = COLORS[int(class_id)]
            # get the bounding box coordinates
            box_x = detection[3] * image_width
            box_y = detection[4] * image_height
            # get the bounding box width and height
            box_width = detection[5] * image_width
            box_height = detection[6] * image_height
            # draw a rectangle around each detected object
            cv.rectangle(image, (int(box_x), int(box_y)), (int(box_width), int(box_height)), color, thickness=2)
            # put the class name text on the detected object
            cv.putText(image, class_name, (int(box_x), int(box_y - 5)), cv.FONT_HERSHEY_SIMPLEX, 1, color, 2)
    return image


def openpose(frame):
    frameHeight, frameWidth = frame.shape[:2]
    net.setInput(cv.dnn.blobFromImage(frame, 1.0, (368, 368), (127.5, 127.5, 127.5), swapRB=True, crop=False))
    out = net.forward()
    out = out[:, :19, :, :]  # MobileNet output [1, 57, -1, -1], we only need the first 19 elements
    assert (len(BODY_PARTS) == out.shape[1])
    points = []
    for i in range(len(BODY_PARTS)):
        # Slice heatmap of corresponging body's part.
        heatMap = out[0, i, :, :]
        # Originally, we try to find all the local maximums. To simplify a sample
        # we just find a global one. However only a single pose at the same time
        # could be detected this way.
        _, conf, _, point = cv.minMaxLoc(heatMap)
        x = (frameWidth * point[0]) / out.shape[3]
        y = (frameHeight * point[1]) / out.shape[2]
        # Add a point if it's confidence is higher than threshold.
        points.append((int(x), int(y)) if conf > 0.2 else None)
    for pair in POSE_PAIRS:
        partFrom = pair[0]
        partTo = pair[1]
        assert (partFrom in BODY_PARTS)
        assert (partTo in BODY_PARTS)
        idFrom = BODY_PARTS[partFrom]
        idTo = BODY_PARTS[partTo]
        if points[idFrom] and points[idTo]:
            cv.line(frame, points[idFrom], points[idTo], (0, 255, 0), 3)
            cv.ellipse(frame, points[idFrom], (3, 3), 0, 0, 360, (0, 0, 255), cv.FILLED)
            cv.ellipse(frame, points[idTo], (3, 3), 0, 0, 360, (0, 0, 255), cv.FILLED)
    return frame


if __name__ == '__main__':
    capture = cv.VideoCapture(0)
    cv_edition = cv.__version__
    if cv_edition[0] == '3': capture.set(cv.CAP_PROP_FOURCC, cv.VideoWriter_fourcc(*'XVID'))
    else: capture.set(cv.CAP_PROP_FOURCC, cv.VideoWriter.fourcc('M', 'J', 'P', 'G'))
    capture.set(cv.CAP_PROP_FRAME_WIDTH, 640)
    capture.set(cv.CAP_PROP_FRAME_HEIGHT, 480)
    state=True
    while capture.isOpened():
        start = time.time()
        ret, frame = capture.read()
        action = cv.waitKey(10) & 0xFF
        if state==True: 
            frame = Target_Detection(frame)
            cv.putText(frame, "Detection", (240, 30), cv.FONT_HERSHEY_SIMPLEX, 0.9, (0, 0, 255), 1)
        else: 
            frame = openpose(frame)
            cv.putText(frame, "Openpose", (240, 30), cv.FONT_HERSHEY_SIMPLEX, 0.9, (0, 0, 255), 1)
        if action == ord('q') or action == ord('Q'): break
        if action == ord('f') or action == ord('F'): state = not state
        end = time.time()
        fps = 1 / (end - start)
        text = "FPS : " + str(int(fps))
        cv.putText(frame, text, (20, 30), cv.FONT_HERSHEY_SIMPLEX, 0.9, (100, 200, 200), 1)
        cv.imshow('frame', frame)
        
    capture.release()
    cv.destroyAllWindows()

四、源码解析

可识别的物体列表

[person, bicycle, car, motorcycle, airplane, bus, train, truck, boat, traffic light, fire hydrant, street sign, stop sign, parking meter, bench, bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe, hat, backpack, umbrella, shoe, eye glasses, handbag, tie, suitcase, frisbee, skis, snowboard, sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket, bottle, plate, wine glass, cup, fork, knife, spoon, bowl, banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, cake, chair, couch, potted plant, bed, mirror, dining table, window, desk, toilet, door, tv, laptop, mouse, remote, keyboard, cell phone, microwave, oven, toaster, sink, refrigerator, blender, book, clock, vase, scissors, teddy bear, hair drier, toothbrush]

加载类别【object_detection_coco.txt】，导入模型【frozen_inference_graph.pb】，指定深度学习框架【TensorFlow】。

# 加载COCO类名称
with open('object_detection_coco.txt', 'r') as f: class_names = f.read().split('\n')
# 对于不同目标显示不同颜色
COLORS = np.random.uniform(0, 255, size=(len(class_names), 3))
# 加载DNN图像模型
model = cv.dnn.readNet(model='frozen_inference_graph.pb', config='ssd_mobilenet_v2_coco.txt', framework='TensorFlow')

导入图片，提取了高度和宽度，计算了300x300的像素blob，把这个blob传入神经网络.

def Target_Detection(image):
    image_height, image_width, _ = image.shape
    # 从图像中创建blob
    blob = cv.dnn.blobFromImage(image=image, size=(300, 300), mean=(104, 117, 123), swapRB=True)
    model.setInput(blob)
    output = model.forward()
    # 遍历每个检测
    for detection in output[0, 0, :, :]:
        # 提取检测的置信度
        confidence = detection[2]
        # 仅在检测置信度高于某个阈值时，绘制边界框，否则跳过
        if confidence > .4:
            # 获取类的ID
            class_id = detection[1] 
            # 将类的id 映射到类
            class_name = class_names[int(class_id) - 1]
            color = COLORS[int(class_id)]
            # 获取边界框坐标
            box_x = detection[3] * image_width
            box_y = detection[4] * image_height
            # 获取边界框的宽度和高度
            box_width = detection[5] * image_width
            box_height = detection[6] * image_height
            # 在每个检测到的对象周围绘制一个矩形
            cv.rectangle(image, (int(box_x), int(box_y)), (int(box_width), int(box_height)), color, thickness=2)
            # 将类名文本写在检测到的对象上
            cv.putText(image, class_name, (int(box_x), int(box_y - 5)), cv.FONT_HERSHEY_SIMPLEX, 1, color, 2)
    return image