保姆级 Keras 实现 YOLO v3 二

一. 数据准备
二. 从 xml 或者 json 文件中读出标注信息
三. K-Means 计算 anchor box 聚类尺寸
- 读出所有标注框尺寸
- K-Means 聚类
四. 代码下载

上一篇文章中, 我们完成了 YOLO v3 的网络定义, 相当于完成了前向计算功能, 但此时网络中的参数处于随机状态, 预测并没有任何意义. 接下来的工作就是要从头开始训练, 让网络调整其参数以达到预期的效果

这里需要定义一些常量, 因为后面的函数会用到, 训练自己的数据集时也需要修改的

# 模型配置
LONG_SIDE = 416        # 输入图像缩放长边尺寸
STRIDES = (8, 16, 32)  # 每种特征图的下采样倍数
CLUSTER_K = 9          # anchor box 聚类中心数量
NEG_THRES = 0.4        # 负样本阈值, 这个值按你喜欢的来改
POS_THRES = 0.7        # 正样本阈值, 这个是为了增加更多的正样本而设置, 后面会有解释

# 类别列表, 不分先后
CATEGORIES = ("aeroplane", "bicycle", "bird", "boat", "bottle",
              "bus", "car", "cat", "chair", "cow",
              "diningtable", "dog", "horse", "motorbike", "person",
              "pottedplant", "sheep", "sofa", "train", "tvmonitor")

DATA_PATH = "data_set" # 这样写表示相对路径, 也可以写成绝对路径, 你喜欢就好

一. 数据准备

一般我们都是为了训练自己的数据集, 这就涉及到做标签的问题, 不过这不是本文的重点, 可以参考《保姆级 Keras 实现 Faster R-CNN 一》, 里面有说明如何标注. 标注完成后, 图像和标签文件放到同一个目录中, 方便处理. 假设放到了 data_set 中

如果是使用已经标注好的数据, 那就不用再标注, 只是要理解标注文件中的信息, 像 VOC2007 标注文件是 xml 格式的, 什么格式无所谓, 只要能读出来就行. 我们关注的是标注文件中目标框的坐标和类别.《保姆级 Keras 实现 Faster R-CNN 二》中也有相应的说明, 这里就不再重复讲了. 也一样将图像和对应的标注文件放到一个文件夹中, 假设放到了 data_set 中, 如下图这样

data_set
因为我们把图像和对应的标注文件放到了同一个文件夹中, 接下就就需要将各个图像和标注文件的路径列出来, 再划分训练集和验证集

# 取得图像和标注文件路径
# data_set_path: 数据集所在路径
# split_rate: 这些文件中用于训练, 验证, 测试所占的比例
#             如果为 None, 则不区分, 直接返回全部
#             如果只写一个小数, 如 0.8, 则表示 80% 为训练集, 20% 为验证集, 没有测试集
#             如果是一个 tuple 或 list, 只有一个元素的话, 同上面的一个小数的情况
# shuffle_enable: 是否要打乱顺序
# 返回训练集, 验证集和验证集路径列表
def get_data_set(data_set_path, split_rate = (0.7, 0.2, 0.1), shuffle_enable = True):
    data_set = []
    files = os.listdir(data_set_path)
    
    for f in files:
        ext = osp.splitext(f)[1]
        if ext in (".jpg", ".png", ".bmp"):            
            img_path = osp.join(data_set_path, f)
            
            ann_type = "" # 标注文件类型
            ann_path = img_path.replace(ext, ".json")
            
            if osp.exists(ann_path):
                ann_type = "json"
            else:
                ann_path = img_path.replace(ext, ".xml")
                if osp.exists(ann_path):
                    ann_type = "xml"
                
            if "" == ann_type:
                continue
                
            data_set.append((img_path, ann_path, ann_type))
        
    if shuffle_enable:
        shuffle(data_set)
        
    if None == split_rate:
        return data_set

    total_num = len(data_set)

    if isinstance(split_rate, float) or 1 == len(split_rate):
        if isinstance(split_rate, float):
            split_rate = [split_rate]
        train_pos = int(total_num * split_rate[0])
        train_set = data_set[: train_pos]
        valid_set = data_set[train_pos: ]

        return train_set, valid_set

    elif isinstance(split_rate, tuple) or isinstance(split_rate, list):
        list_len = len(split_rate)
        assert(list_len > 1)

        train_pos = int(total_num * split_rate[0])
        valid_pos = int(total_num * (split_rate[0] + split_rate[1]))

        train_set = data_set[0: train_pos]
        valid_set = data_set[train_pos: valid_pos]
        test_set = data_set[valid_pos: ]

        return train_set, valid_set, test_set

上面的函数中, 区分了标注文件的类型, VOC2007 是 xml, 如何使用 Labelme 标注的话, 标注文件则是 json, 下面测试一下

# 取得目录
train_set, valid_set, test_set = get_data_set(DATA_PATH, split_rate = (0.8, 0.1, 0.1))

print("Total number:", len(train_set) + len(valid_set) + len(test_set),
      " Train number:", len(train_set),
      " Valid number:", len(valid_set),
      " Test number:", len(test_set))

# 输出第一个元素
print("First element:", train_set[0])

输出如下

Total number: 5010  Train number: 4008  Valid number: 501  Test number: 501
First element: ('data_set\\003885.jpg', 'data_set\\003885.xml', 'xml')

因为 YOLO 的性能已经经过了验证, 所以其实不需要测试集, 这样参与训练的图像就会多一点, 训练出来的模型也会好一点, 划分是就可以这样

# 取得目录
train_set, valid_set, test_set = get_data_set(DATA_PATH, split_rate = 0.95)

就像征性的留一点用作验证集了

二. 从 xml 或者 json 文件中读出标注信息

前面讲过, 我们需要的是标注框的坐标和类别, 所以只需要从标注文件中读取相关信息即可

# 从 xml 或 json 文件中读出 ground_truth
# data_set: get_data_set 函数返回的列表
# categories: 类别列表
# file_type: 标注文件类型
# 返回 ground_truth 坐标与类别
def get_ground_truth(label_path, file_type, categories):
    ground_truth = []
    with open(label_path, 'r', encoding = "utf-8") as f:
        if "json" == file_type:
            jsn = f.read()
            js_dict = json.loads(jsn)        
            shapes = js_dict["shapes"] # 取出所有图形

            for shape in shapes:
                if shape["label"] in categories:                
                    pts = shape["points"]
                    x1 = round(pts[0][0])
                    x2 = round(pts[1][0])
                    y1 = round(pts[0][1])
                    y2 = round(pts[1][1])

                    # 防止有些人标注的时候喜欢从右下角拉到左上角
                    if x1 > x2:
                        x1, x2 = x2, x1
                    if y1 > y2:
                        y1, y2 = y2, y1
                        
                    bnd_box = [x1, y1, x2, y2]
                    cls_id = categories.index(shape["label"])

                    # 把 bnd_box 和 cls_id 组合在一起, 后面可有会用得上
                    ground_truth.append([bnd_box, cls_id])
        elif "xml" == file_type:
            tree = et.parse(f)
            root = tree.getroot()
            for obj in root.iter("object"):

                cls_id = obj.find("name").text
                cls_id = categories.index(cls_id) # 类别 id

                bnd_box = obj.find("bndbox")
                bnd_box = [int(bnd_box.find("xmin").text),
                           int(bnd_box.find("ymin").text),
                           int(bnd_box.find("xmax").text),
                           int(bnd_box.find("ymax").text)]

                # 把 bnd_box 和 cls_id 组合在一起, 后面可有会用得上
                ground_truth.append([bnd_box, cls_id])
            
    return ground_truth

在返回的数据中, 包含的是目标框左上角和右下角的坐标, 还有目标类别序号, 接下来测试 get_ground_truth 函数

# 测试 get_ground_truth
test_idx = random.randint(0, len(train_set)) # 测试图像的序号
label_data = train_set[test_idx] # train_set 上面已经定义过了
gts = get_ground_truth(label_data[1], label_data[2], CATEGORIES)

image = cv.imread(label_data[0])
img_copy = image.copy()
print(img_copy.shape)

for gt in gts:    
    print(gt, "class:", CATEGORIES[gt[1]])
    cv.rectangle(img_copy, (gt[0][0], gt[0][1]), (gt[0][2], gt[0][3]),
                 (0, random.randint(128, 256), 0), 2)
    
plt.figure("label_box", figsize = (6, 3))
plt.imshow(img_copy[..., : : -1])
plt.show()

(334, 500, 3)
[[28, 44, 91, 113], 1] class: aeroplane
[[47, 151, 111, 212], 1] class: aeroplane
[[65, 239, 127, 299], 1] class: aeroplane
[[189, 143, 255, 205], 1] class: aeroplane
[[164, 29, 228, 96], 1] class: aeroplane
[[397, 15, 462, 83], 1] class: aeroplane

show_ground_truth

这样看是没有问题, 但是考虑到网络的输入尺寸是 $416 \times 416$ , 所以需要对图像进行缩放, 那标注框也需要进行相应的缩放. 我的做法是保持图像比例, 将图像长边变成 $416$ , 短边进行填充, 这样可以保证目标不会因为图像缩放而变形. 现修改 get_ground_truth 函数如下, 增加了对坐标的缩放和偏移, 还返回了图像的缩放系数与填充尺寸, 方便后面的函数操作

# 从 xml 或 json 文件中读出 ground_truth
# data_set: get_data_set 函数返回的列表
# categories: 类别列表
# file_type: 标注文件类型
# 返回 缩放系数, 填充尺寸, ground_truth 坐标与类别
def get_ground_truth(label_path, file_type, categories):
    ground_truth = []    
    scale = 1.0  # 缩放比例
    pad_size = 0 # 填充尺寸
    with open(label_path, 'r', encoding = "utf-8") as f:
        if "json" == file_type:
            jsn = f.read()
            js_dict = json.loads(jsn)        
            shapes = js_dict["shapes"] # 取出所有图形
            
            # 增加对图像尺寸的判断
            image_rows = js_dict["imageHeight"]
            image_cols = js_dict["imageWidth"]
            
            if image_rows < image_cols:
                scale = LONG_SIDE / image_cols
                pad_size = (LONG_SIDE - image_rows * scale) / 2
            else:
                scale = LONG_SIDE / image_rows
                pad_size = (LONG_SIDE - image_cols * scale) / 2
                
            for shape in shapes:
                if shape["label"] in categories:                
                    pts = shape["points"]
                    x1 = round(pts[0][0])
                    x2 = round(pts[1][0])
                    y1 = round(pts[0][1])
                    y2 = round(pts[1][1])

                    # 防止有些人标注的时候喜欢从右下角拉到左上角
                    if x1 > x2:
                        x1, x2 = x2, x1
                    if y1 > y2:
                        y1, y2 = y2, y1
                    
                    bnd_box = [x1, y1, x2, y2]
                    
                    if image_rows < image_cols:
                        bnd_box[0] = round(bnd_box[0] * scale)
                        bnd_box[2] = round(bnd_box[2] * scale)
                        bnd_box[1] = round(bnd_box[1] * scale + pad_size)
                        bnd_box[3] = round(bnd_box[3] * scale + pad_size)
                    else:
                        bnd_box[0] = round(bnd_box[0] * scale + pad_size)
                        bnd_box[2] = round(bnd_box[2] * scale + pad_size)
                        bnd_box[1] = round(bnd_box[1] * scale)
                        bnd_box[3] = round(bnd_box[3] * scale)
                        
                    cls_id = categories.index(shape["label"])

                    # 把 bnd_box 和 cls_id 组合在一起, 后面可有会用得上
                    ground_truth.append([bnd_box, cls_id])
        elif "xml" == file_type:
            tree = et.parse(f)
            root = tree.getroot()
            
            # 增加对图像尺寸的判断
            image_shape = root.find("size")
            image_rows = int(image_shape.find("height").text)
            image_cols = int(image_shape.find("width").text)
            
            if image_rows < image_cols:
                scale = LONG_SIDE / image_cols
                pad_size = (LONG_SIDE - image_rows * scale) / 2
            else:
                scale = LONG_SIDE / image_rows
                pad_size = (LONG_SIDE - image_cols * scale) / 2
            
            for obj in root.iter("object"):
                cls_id = obj.find("name").text
                cls_id = categories.index(cls_id) # 类别 id

                bnd_box = obj.find("bndbox")
                bnd_box = [int(bnd_box.find("xmin").text),
                           int(bnd_box.find("ymin").text),
                           int(bnd_box.find("xmax").text),
                           int(bnd_box.find("ymax").text)]

                if image_rows < image_cols:
                    bnd_box[0] = round(bnd_box[0] * scale)
                    bnd_box[2] = round(bnd_box[2] * scale)
                    bnd_box[1] = round(bnd_box[1] * scale + pad_size)
                    bnd_box[3] = round(bnd_box[3] * scale + pad_size)
                else:
                    bnd_box[0] = round(bnd_box[0] * scale + pad_size)
                    bnd_box[2] = round(bnd_box[2] * scale + pad_size)
                    bnd_box[1] = round(bnd_box[1] * scale)
                    bnd_box[3] = round(bnd_box[3] * scale)
                        
                # 把 bnd_box 和 cls_id 组合在一起, 后面可有会用得上
                ground_truth.append([bnd_box, cls_id])
            
    return scale, pad_size, ground_truth

测试函数也增加相应的缩放与图像填充

# 测试 get_ground_truth
test_idx = random.randint(0, len(train_set)) # 测试图像的序号
label_data = train_set[test_idx] # train_set 上面已经定义过了
# 增加了返回的缩放系数与填充尺寸
scale, pad_size, gts = get_ground_truth(label_data[1], label_data[2], CATEGORIES)

image = cv.imread(label_data[0])
img_copy = cv.resize(image, (round(image.shape[1] * scale), round(image.shape[0] * scale)),
                     interpolation = cv.INTER_LINEAR)

if img_copy.shape[0] < img_copy.shape[1]:
    img_copy = cv.copyMakeBorder(img_copy, round(pad_size), round(pad_size), 0, 0, cv.BORDER_CONSTANT, (0, 0, 0))
else:
    img_copy = cv.copyMakeBorder(img_copy, 0, 0, round(pad_size), round(pad_size), cv.BORDER_CONSTANT, (0, 0, 0))

print(img_copy.shape)

for gt in gts:
    print(gt, "class:", CATEGORIES[gt[1]])
    cv.rectangle(img_copy, (gt[0][0], gt[0][1]), (gt[0][2], gt[0][3]),
                 (0, random.randint(128, 256), 0), 2)
    
plt.figure("label_box", figsize = (6, 3))
plt.imshow(img_copy[..., : : -1])
plt.show()

效果如下, 图像的尺寸变成了 $416 \times 416)$

(416, 416, 3)
[[23, 106, 76, 163], 0] class: aeroplane
[[39, 195, 92, 245], 0] class: aeroplane
[[54, 268, 106, 318], 0] class: aeroplane
[[157, 188, 212, 240], 0] class: aeroplane
[[136, 93, 190, 149], 0] class: aeroplane
[[330, 82, 384, 138], 0] class: aeroplane

resized_ground_truth

三. K-Means 计算 anchor box 聚类尺寸

上面已经可以读出标签文件中的各目标框的坐标, 那接下为就可以用这些坐标来计算我们想要的 $k$ 种 anchor box 尺寸了, 这里 $k = 9$ , 所以聚类个数为 $9$

读出所有标注框尺寸

因为在聚类的时候, 距离公式是 $1 - I o U$ , 而各标注框位置是随机的, 所以需要将标注框左上角移动到相同的位置, 这样才有计算的基准, 这个相同位置最简单的就是 $(0, 0)$ , 所以读出来的标注框就可在简化成为 $(w, h)$

# 读出所有标注框
all_boxes = []

for s in (train_set, valid_set, test_set):
    for each in s:
        _, __, gts = get_ground_truth(each[1], each[2], CATEGORIES)
        for box, _ in gts:
            all_boxes.append((box[2] - box[0], box[3] - box[1]))

print("box_num:", len(all_boxes))
print(all_boxes[: 4])

box_num: 15658
[(254, 228), (306, 383), (56, 45), (59, 110)]

K-Means 聚类

既然要用 $I o U$ 计算距离, 那就要先定义计算 $I o U$ 的函数

# 计算聚类 IoU
# box: 单个真实框 (w, h)
# clusters: 聚类中心的 (w, h)
# 返回标注框和所有聚类中心的 IoU 值
def cluster_iou(box, clusters):
    # 交集
    x = np.minimum(box[0], clusters[:, 0])
    y = np.minimum(box[1], clusters[:, 1])    
    intersection = x * y
    
    # 并集
    area_box = box[0] * box[1]
    area_cluster = clusters[:, 0] * clusters[:, 1]    
    union = area_box + area_cluster - intersection
    
    return intersection / union

现在就可以定义一个函数来聚类 anchor box 了

# 使用 k-means 聚类算法和 1-IoU 距离函数来确定 anchor box
# boxes: 标注框 (w, h)
# k: 聚类的数量
# 返回聚类中心
def kmeans_anchor(boxes, k):
    n = boxes.shape[0]
    distances = np.empty((n, k))
    last_clusters = np.zeros((n,))
    
    # 随机初始化聚类中心
    np.random.seed(0)
    clusters = boxes[np.random.choice(n, k, replace = False)]

    while True:
        for i, box in enumerate(boxes):
            distances[i] = 1 - cluster_iou(box, clusters)
        
        nearest_clusters = np.argmin(distances, axis = 1)
        
        if (last_clusters == nearest_clusters).all():
            break
        
        # 更新聚类中心
        for cluster in range(k):
            clusters[cluster] = np.median(boxes[nearest_clusters == cluster], axis = 0)
        
        last_clusters = nearest_clusters

    return clusters

上面的函数中, 更新聚类中心用的是中值(np.median), 也可以使用平均值 (np.mean), 只是平均值容易受距离较远点的影响, 接下来调用函数得到 $k$ 个 anchor box 尺寸

# 聚类 k 个 anchor box 尺寸
cluster_anchors = kmeans_anchor(np.array(all_boxes), CLUSTER_K)
# 计算矩形面积从小到大排序
areas = cluster_anchors[:, 0] * cluster_anchors[:, 1]
sorted_indices = np.argsort(areas)
cluster_anchors = cluster_anchors[sorted_indices]
print(cluster_anchors)

[[ 16  22]
 [ 26  58]
 [ 48  34]
 [ 53  87]
 [118  85]
 [ 85 160]
 [245 134]
 [156 228]
 [310 263]]

上面的 $9$ 个尺寸便是图像缩放后的聚类尺寸, 你运行的代码结果可能和我的不一样, 因为聚类算法会受初始值的影响, 不过也差不多

四. 代码下载

示例代码可下载 Jupyter Notebook 示例代码

上一篇: 保姆级 Keras 实现 YOLO v3 一
下一篇: 保姆级 Keras 实现 YOLO v3 三