MAPTR：在线矢量化高精地图构建的结构化建模与学习(2208)

MAPTR: STRUCTURED MODELING AND LEARNING FOR ONLINE VECTORIZED HD MAP CONSTRUCTION

MAPTR：在线矢量化高精地图构建的结构化建模与学习

在这里插入图片描述

ABSTRACT

High-definition (HD) map provides abundant and precise environmental information of the driving scene, serving as a fundamental and indispensable component for planning in autonomous driving system. We present MapTR, a structured end-to-end Transformer for efficient online vectorized HD map construction. We propose a unified permutation-equivalent modeling approach, i.e., modeling map element as a point set with a group of equivalent permutations, which accurately describes the shape of map element and stabilizes the learning process. We design a hierarchical query embedding scheme to flexibly encode structured map information and perform hierarchical bipartite matching for map element learning. MapTR achieves the best performance and efficiency with only camera input among existing vectorized map construction approaches on nuScenes dataset. In particular, MapTR-nano runs at real-time inference speed (25:1 FPS) on RTX 3090, 8× faster than the existing state-of-the-art camera-based method while achieving 5:0 higher mAP. Even compared with the existing stateof-the-art multi-modality method, MapTR-nano achieves 0:7 higher mAP , and MapTR-tiny achieves 13:5 higher mAP and 3× faster inference speed. Abundant qualitative results show that MapTR maintains stable and robust map construction quality in complex and various driving scenes. MapTR is of great application value in autonomous driving. Code and more demos are available at https://github.com/hustvl/MapTR.
高清（HD）地图提供了驾驶场景中丰富而精确的环境信息，是自动驾驶系统中规划部分的基础和不可或缺的组成部分。我们提出了MapTR，这是一个结构化的端到端Transformer，用于高效的在线矢量化高清地图构建。我们提出了一种统一的排列等价建模方法，即把地图元素建模为一组等价排列的点集，这能够准确描述地图元素的形状并稳定学习过程。我们设计了一种分层查询嵌入方案，以灵活编码结构化地图信息，并执行分层二分图匹配以进行地图元素学习。在nuScenes数据集上，MapTR仅使用摄像头输入就在现有的矢量化地图构建方法中实现了最佳性能和效率。特别是，MapTR-nano在RTX 3090上以实时推理速度（25.1 FPS）运行， 比现有的最先进的基于摄像头的方法快8倍 ，同时实现了5.0更高的平均精度（mAP）。即使与现有的最先进的多模态方法相比，MapTR-nano实现了0.7更高的mAP，而MapTR-tiny实现了13.5更高的mAP和3倍更快的推理速度。丰富的定性结果表明，MapTR在复杂多变的驾驶场景中保持了稳定和健壮的地图构建质量。MapTR在自动驾驶中具有重要的应用价值。代码和更多演示可在以下链接获取：https://github.com/hustvl/MapTR。

1 INTRODUCTION

High-definition (HD) map is the high-precision map specifically designed for autonomous driving, composed of instance-level vectorized representation of map elements (pedestrian crossing, lane divider, road boundaries, etc.). HD map contains rich semantic information of road topology and traffic rules, which is vital for the navigation of self-driving vehicle.
高清（HD）地图是专为自动驾驶设计的高精度地图，由地图元素的实例级矢量化表示（人行横道、车道分隔线、道路边界等）组成。高清地图包含了丰富的道路拓扑和交通规则的语义信息，这对于自动驾驶车辆的导航至关重要。
Conventionally HD map is constructed offline with SLAM-based methods (Zhang & Singh, 2014; Shan & Englot, 2018; Shan et al., 2020), incurring complicated pipeline and high maintaining cost. Recently, online HD map construction has attracted ever-increasing interests, which constructs map around ego-vehicle at runtime with vehicle-mounted sensors, getting rid of offline human efforts.
传统上，高清（HD）地图是通过基于SLAM（Simultaneous Localization and Mapping，即同时定位与地图构建）的方法离线构建的（Zhang & Singh, 2014; Shan & Englot, 2018; Shan et al., 2020），这涉及到复杂的流程和高昂的维护成本。最近，在线高清地图构建越来越受到关注，它通过车载传感器在运行时围绕自车构建地图，摆脱了线下人工的努力。
Early works (Chen et al., 2022a; Liu et al., 2021a; Can et al., 2021) leverage line-shape priors to perceive open-shape lanes based on the front-view image. They are restricted to single-view perception and can not cope with other map elements with arbitrary shapes. With the development of bird’s eye view (BEV) representation learning, recent works (Chen et al., 2022b; Zhou & Krahenb ¨ uhl ¨ , 2022; Hu et al., 2021; Li et al., 2022c) predict rasterized map by performing BEV semantic segmentation. However, the rasterized map lacks vectorized instance-level information, such as the lane structure, which is important for the downstream tasks (e.g., motion prediction and planning). To construct vectorized HD map, HDMapNet (Li et al., 2022a) groups pixel-wise segmentation results, which requires complicated and time-consuming post-processing. VectorMapNet (Liu et al., 2022a) represents each map element as a point sequence. It adopts a cascaded coarse-to-fine framework and utilizes auto-regressive decoder to predict points sequentially, leading to long inference time.
早期的工作（Chen et al., 2022a; Liu et al., 2021a; Can et al., 2021）利用线形先验，基于前视图图像感知开放形状的车道。它们受限于单视图感知，无法应对其他任意形状的地图元素。随着鸟瞰图（BEV）表示学习的发展，最近的工作（Chen et al., 2022b; Zhou & Krahenbühl, 2022; Hu et al., 2021; Li et al., 2022c）通过执行BEV语义分割来预测光栅化地图。然而，光栅化地图缺乏矢量化的实例级信息，例如车道结构，这对于下游任务（例如，运动预测和规划）是重要的。为了构建矢量化的高清地图，HDMapNet（Li et al., 2022a）将像素级分割结果进行分组，这需要复杂且耗时的后处理。VectorMapNet（Liu et al., 2022a）将每个地图元素表示为点序列。它采用了一个级联的粗到细框架，并利用自回归解码器顺序预测点，导致推理时间较长。
Current online vectorized HD map construction methods are restricted by the efficiency and not applicable in real-time scenarios. Recently, DETR (Carion et al., 2020) employs a simple and efficient encoder-decoder Transformer architecture and realizes end-to-end object detection.
当前在线矢量化高清地图构建方法受到效率限制，不适用于实时场景。最近，DETR（Carion et al., 2020）采用了简单高效的编码器-解码器Transformer架构，并实现了端到端的目标检测。
It is natural to ask a question: Can we design a DETR-like paradigm for efficient end-to-end vectorized HD map construction? We show that the answer is affirmative with our proposed Map TRansformer (MapTR)
很自然地，我们会问一个问题：我们能否设计一个类似DETR的范式来高效地进行端到端的矢量化高清地图构建？我们展示了一个肯定的答案，这就是我们提出的MapTRansformer（MapTR）。
Different from object detection in which objects can be easily geometrically abstracted as bounding box, vectorized map elements have more dynamic shapes. To accurately describe map elements, we propose a novel unified modeling method. We model each map element as a point set with a group of equivalent permutations. The point set determines the position of the map element. And the permutation group includes all the possible organization sequences of the point set corresponding to the same geometrical shape, avoiding the ambiguity of shape.
与目标检测不同，目标可以很容易地被几何抽象为边界框，矢量化地图元素具有更动态的形状。为了准确描述地图元素，我们提出了一种新颖的统一建模方法。我们将每个地图元素建模为一组具有等价排列的点集。点集决定了地图元素的位置。排列群包括所有可能的点集组织序列，这些序列对应于相同的几何形状，避免了形状的歧义。
Based on the permutation-equivalent modeling, we design a structured framework which takes as input images of vehicle-mounted cameras and outputs vectorized HD map. We streamline the on-line vectorized HD map construction as a parallel regression problem. Hierarchical query embeddings are proposed to flexibly encode instance-level and point-level information. All instances and all points of instance are simultaneously predicted with a unified Transformer structure. And the training pipeline is formulated as a hierarchical set prediction task, where we perform hierarchical bipartite matching to assign instances and points in turn. And we supervise the geometrical shape in both point and edge levels with the proposed point2point loss and edge direction loss.
基于排列等价建模，我们设计了一个结构化框架，它以车载摄像头拍摄的图像为输入，输出矢量化的高清地图。我们将在线矢量化高清地图构建简化为一个并行回归问题。提出了分层查询嵌入，以灵活编码实例级和点级信息。所有实例及其所有点同时通过统一的Transformer结构进行预测。训练流程被构建为一个分层集合预测任务，其中我们执行分层二分图匹配来依次分配实例和点。我们还提出了点对点损失和边缘方向损失，以在点和边缘两个层面上对几何形状进行监督。
With all the proposed designs, we present MapTR, an efficient end-to-end online vectorized HD map construction method with unified modeling and architecture. MapTR achieves the best performance and efficiency among existing vectorized map construction approaches on nuScenes (Caesar et al., 2020) dataset. In particular, MapTR-nano runs at real-time inference speed (25:1 FPS) on RTX 3090, 8× faster than the existing state-of-the-art camera-based method while achieving 5:0 higher mAP. Even compared with the existing state-of-the-art multi-modality method, MapTR-nano achieves 0:7 higher mAP and 8× faster inference speed, and MapTR-tiny achieves 13:5 higher mAP and 3× faster inference speed. As the visualization shows (Fig. 1), MapTR maintains stable and robust map construction quality in complex and various driving scenes.
结合所有提出的设计，我们呈现了MapTR，这是一种高效的端到端在线矢量化高清地图构建方法，具有统一的建模和架构。在nuScenes（Caesar et al., 2020）数据集上，MapTR在现有的矢量化地图构建方法中实现了最佳性能和效率。特别是，MapTR-nano在RTX 3090上以实时推理速度（25.1 FPS）运行，比现有的最先进的基于摄像头的方法快8倍，同时实现了5.0更高的平均精度（mAP）。即使与现有的最先进的多模态方法相比，MapTR-nano实现了0.7更高的mAP和8倍更快的推理速度，而MapTR-tiny实现了13.5更高的mAP和3倍更快的推理速度。正如可视化所示（图1），MapTR在复杂多变的驾驶场景中保持了稳定和健壮的地图构建质量。
在这里插入图片描述
Our contributions can be summarized as follows:
• We propose a unified permutation-equivalent modeling approach for map elements, i.e., modeling map element as a point set with a group of equivalent permutations, which accurately describes the shape of map element and stabilizes the learning process.
• Based on the novel modeling, we present MapTR, a structured end-to-end framework for efficient online vectorized HD map construction. We design a hierarchical query embedding scheme to flexibly encode instance-level and point-level information, perform hierarchical bipartite matching for map element learning, and supervise the geometrical shape in both point and edge levels with the proposed point2point loss and edge direction loss.
• MapTR is the first real-time and SOTA vectorized HD map construction approach with stable and robust performance in complex and various driving scenes.
我们的贡献可以总结如下：

我们提出了一种统一的排列等价建模方法来处理地图元素，即把地图元素建模为一组等价排列的点集，这能够准确描述地图元素的形状并稳定学习过程。
基于这种新颖的建模方法，我们提出了MapTR，这是一个结构化的端到端框架，用于高效的在线矢量化高清地图构建。我们设计了一种分层查询嵌入方案，以灵活编码实例级和点级信息，执行分层二分图匹配进行地图元素学习，并使用提出的点对点损失和边缘方向损失在点和边缘两个层面上对几何形状进行监督。
MapTR是第一个实时且在复杂多变的驾驶场景中具有稳定和鲁棒性能的SOTA（State of the Art）矢量化高清地图构建方法。

2 RELATED WORK

HD Map Construction. Recently, with the development of 2D-to-BEV methods (Ma et al., 2022), HD map construction is formulated as a segmentation problem based on surround-view image data captured by vehicle-mounted cameras. Chen et al. (2022b); Zhou & Krahenb ¨ uhl ¨ (2022); Hu et al. (2021); Li et al. (2022c); Philion & Fidler (2020); Liu et al. (2022b) generate rasterized map by performing BEV semantic segmentation. To build vectorized HD map, HDMapNet (Li et al., 2022a) groups pixel-wise semantic segmentation results with heuristic and time-consuming post-processing to generate instances. VectorMapNet (Liu et al., 2022a) serves as the first end-to-end framework, which adopts a two-stage coarse-to-fine framework and utilizes auto-regressive decoder to predict points sequentially, leading to long inference time and the ambiguity about permutation. Different from VectorMapNet, MapTR introduces novel and unified modeling for map element, solving the ambiguity and stabilizing the learning process. And MapTR builds a structured and parallel onestage framework with much higher efficiency.
高清地图构建。最近，随着2D到鸟瞰图（BEV）方法（Ma et al., 2022）的发展，高清地图构建被构建为一个基于车载摄像头捕获的环视图像数据的分割问题。Chen et al. (2022b)；Zhou & Krahenbühl (2022)；Hu et al. (2021)；Li et al. (2022c)；Philion & Fidler (2020)；Liu et al. (2022b)通过执行BEV语义分割来生成光栅化地图。为了构建矢量化高清地图，HDMapNet（Li et al., 2022a）通过启发式且耗时的后处理将像素级语义分割结果分组以生成实例。VectorMapNet（Liu et al., 2022a）作为第一个端到端框架，采用两阶段的粗到细框架，并利用自回归解码器顺序预测点，导致推理时间长且排列存在歧义。与VectorMapNet不同，MapTR引入了地图元素的新颖且统一的建模，解决了歧义并稳定了学习过程。而且，MapTR构建了一个结构化且并行的单阶段框架，效率更高。
Lane Detection. Lane detection can be viewed as a sub task of HD map construction, which focuses on detecting lane elements in the road scenes. Since most datasets of lane detection only provide single view annotations and focus on open-shape elements, related methods are restricted to single view. LaneATT (Tabelini et al., 2021) utilizes anchor-based deep lane detection model to achieve good trade-off between accuracy and efficiency. LSTR (Liu et al., 2021a) adopts the Transformer architecture to directly output parameters of a lane shape model. GANet (Wang et al., 2022) formulates lane detection as a keypoint estimation and association problem and takes a bottom-up design. Feng et al. (2022) proposes parametric Bezier curve-based method for lane detection. Instead of detecting lane in the 2D image coordinate, Garnett et al. (2019) proposes 3D-LaneNet which performs 3D lane detection in BEV. STSU (Can et al., 2021) represents lanes as a directed graph in BEV coordinates and adopts curve-based Bezier method to predict lanes from monocular camera image. Persformer (Chen et al., 2022a) provides better BEV feature representation and optimizes anchor design to unify 2D and 3D lane detection simultaneously. Instead of only detecting lanes in the limited single view, MapTR can perceive various kinds of map elements of 360◦ horizontal FOV, with a unified modeling and learning framework.
车道检测。车道检测可以被视为高清地图构建的一个子任务，专注于在道路场景中检测车道元素。由于大多数车道检测数据集仅提供单视图注释并专注于开放形状元素，相关方法受限于单视图。LaneATT（Tabelini et al., 2021）利用基于锚点的深度车道检测模型，在准确性和效率之间取得了良好的平衡。LSTR（Liu et al., 2021a）采用Transformer架构直接输出车道形状模型的参数。GANet（Wang et al., 2022）将车道检测公式化为关键点估计和关联问题，并采用自下而上的设计理念。Feng et al.（2022）提出了基于参数贝塞尔曲线的车道检测方法。与在2D图像坐标中检测车道不同，Garnett et al.（2019）提出了3D-LaneNet，它在BEV中执行3D车道检测。STSU（Can et al., 2021）在BEV坐标中将车道表示为有向图，并采用基于曲线的贝塞尔方法从单目摄像头图像预测车道。Persformer（Chen et al., 2022a）提供了更好的BEV特征表示，并优化了锚点设计，以同时统一2D和3D车道检测。与仅在有限的单视图中检测车道不同，MapTR能够感知360°水平视场的各种地图元素，具有统一的建模和学习框架。
Contour-based Instance Segmentation. Another line of work related to MapTR is contour-based 2D instance segmentation (Zhu et al., 2022; Xie et al., 2020; Xu et al., 2019; Liu et al., 2021c). These methods reformulate 2D instance segmentation as object contour prediction task, and estimate the image coordinates of the contour vertices. CurveGCN (Ling et al., 2019) utilizes Graph Convolution Networks to predict polygonal boundaries. Lazarow et al. (2022); Liang et al. (2020); Li et al. (2021); Peng et al. (2020) rely on intermediate representations and adopt a two-stage paradigm, i.e., the first stage performs segmentation / detection to generate vertices and the second stage converts vertices to polygons. These works model contours of 2D instance masks as polygons. Their modeling methods cannot cope with line-shape map elements and are not applicable for map construction. Differently, MapTR is tailored for HD map construction and models various kinds of map elements in a unified manner. Besides, MapTR does not rely on intermediate representations and has an efficient and compact pipeline.
基于轮廓的实例分割。与MapTR相关的另一系列工作是基于轮廓的2D实例分割（Zhu et al., 2022; Xie et al., 2020; Xu et al., 2019; Liu et al., 2021c）。这些方法将2D实例分割重新定义为对象轮廓预测任务，并估计轮廓顶点的图像坐标。CurveGCN（Ling et al., 2019）利用图卷积网络来预测多边形边界。Lazarow et al. (2022); Liang et al. (2020); Li et al. (2021); Peng et al. (2020)依赖于中间表示，并采用两阶段范式，即第一阶段执行分割/检测以生成顶点，第二阶段将顶点转换为多边形。这些工作将2D实例掩码的轮廓建模为多边形。它们的建模方法无法应对线形地图元素，也不适用于地图构建。不同地，MapTR专为高清地图构建量身定制，以统一的方式建模各种类型的地图元素。此外，MapTR不依赖于中间表示，并且具有高效且紧凑的流程。

3 MAPTR

3.1 PERMUTATION-EQUIVALENT MODELING

MapTR aims at modeling and learning the HD map in a unified manner. HD map is a collection of vectorized static map elements, including pedestrian crossing, lane divider, road boundarie, etc. For structured modeling, MapTR geometrically abstracts map elements as closed shape (like pedestrian crossing) and open shape (like lane divider). Through sampling points sequentially along the shape boundary, closed-shape element is discretized into polygon while open-shape element is discretized into polyline.
MapTR旨在以统一的方式对高清地图进行建模和学习。高清地图是一系列矢量化静态地图元素的集合，包括人行横道、车道分隔线、道路边界等。为了结构化建模，MapTR几何抽象地将地图元素表示为封闭形状（如人行横道）和开放形状（如车道分隔线）。通过沿形状边界顺序采样点，封闭形状元素被离散化为多边形，而开放形状元素被离散化为折线。
Preliminarily, both polygon and polyline can be represented as an ordered point set V F = [v0; v1; : : : ; vNv−1] (see Fig. 3 (Vanilla)). Nv denotes the number of points. However, the permutation of the point set is not explicitly defined and not unique. There exist many equivalent permutations for polygon and polyline. For example, as illustrated in Fig. 2 (a), for the lane divider (polyline) between two opposite lanes, defining its direction is difficult. Both endpoints of the lane divider can be regarded as the start point and the point set can be organized in two directions. In Fig. 2 (b), for the pedestrian crossing (polygon), the point set can be organized in two opposite directions (counter-clockwise and clockwise). And circularly changing the permutation of point set has no influence on the geometrical shape of the polygon. Imposing a fixed permutation to the point set as supervision is not rational. The imposed fixed permutation contradicts with other equivalent permutations, hampering the learning process.
初步来看，多边形和折线都可以表示为有序点集 $V^F = [v_0, v_1, \ldots, v_{N_v-1}]$ （见图3（Vanilla））。 $N_v$ 表示点的数量。然而，点集的排列并没有明确定义，也不是唯一的。对于多边形和折线，存在许多等价的排列。例如，如 图2（a） 所示，对于两个相对车道之间的车道分隔线（折线），定义其方向是困难的。车道分隔线的两个端点都可以被视为起点，点集可以以两个方向组织。在图2（b）中，对于人行横道（多边形），点集可以以两个相反的方向（逆时针和顺时针）组织。并且，循环改变点集的排列对多边形的几何形状没有影响。将固定的排列强加给点集作为监督是不合理的。强加的固定排列与其他等价排列相矛盾，妨碍了学习过程。
在这里插入图片描述
图2. 典型案例，用以说明地图元素关于起点和方向的歧义性。
(a) 折线：对于两个相对车道之间的车道分隔线，定义其方向是困难的。车道分隔线的两个端点都可以被视为起点，点集可以以两个方向组织。
(b) 多边形：对于人行横道，多边形的每个点都可以被视为起点，并且多边形可以以两个相反的方向（逆时针和顺时针）连接。

在这里插入图片描述
图3. MapTR排列等价建模的说明。地图元素被几何抽象并离散化为折线和多边形。MapTR使用（V, Γ）对每个地图元素进行建模（一个点集 V 和一组等价排列 Γ），避免了歧义并稳定了学习过程。
To bridge this gap, MapTR models each map element with V = (V; Γ). V = fvjgN j=0 v−1 denotes the point set of the map element (Nv is the number of points). Γ = fγkg denotes a group of equivalent permutations of the point set V , covering all the possible organization sequences.
为了弥补这一差距，MapTR使用 $V = (V, Γ)$ 来对每个地图元素进行建模。 $\{v_j\}^{N_v-1}_{j=0}$ 表示地图元素的点集（ $N_v$ 是点的数量）。 $Γ = \{γ^k\}$ 表示点集 $V$ 的一组等价排列，涵盖了所有可能的组织序列。
Specifically, for polyline element (see Fig. 3 (left)), Γ includes 2 kinds of equivalent permutations:
具体来说，对于折线元素（见图3（左）），Γ 包括 2 种等价排列：
在这里插入图片描述
For polygon element (see Fig. 3 (right)), Γ includes 2 × Nv kinds of equivalent permutations:
对于多边形元素（见图3（右）），Γ 包括 $2 × N_v$ 种等价排列：

By introducing the conception of equivalent permutations, MapTR models map elements in a unified manner and addresses the ambiguity issue. MapTR further introduces hierarchical bipartite matching (see Sec. 3.2 and Sec. 3.3) for map element learning, and designs a structured encoder-decoder Transformer architecture to efficiently predict map elements (see Sec. 3.4).
通过引入等价排列的概念，MapTR以统一的方式对地图元素进行建模，并解决了歧义问题。MapTR进一步引入了分层二分图匹配（见第3.2节和第3.3节）用于地图元素学习，并设计了一个结构化的编码器-解码器Transformer架构来高效预测地图元素（见第3.4节）。

3.2 HIERARCHICAL MATCHING

MapTR parallelly infers a fixed-size set of N map elements in a single pass, following the endto-end paradigm of DETR (Carion et al., 2020; Fang et al., 2021). N is set to be larger than the typical number of map elements in a scene. Let’s denote the set of N predicted map elements by Y^ = fy^igN i=0 −1. The set of ground-truth (GT) map elements is padded with ? (no object) to form a set with size N, denoted by Y = fyigN i=0 −1. yi = (ci; Vi; Γi), where ci, Vi and Γi are respectively the target class label, point set and permutation group of GT map element yi. y^i = (^ pi; V^i), where p^i and V^i are respectively the predicted classification score and predicted point set. To achieve structured map element modeling and learning, MapTR introduces hierarchical bipartite matching, i.e., performing instance-level matching and point-level matching in order.
MapTR在单次传递中并行推断固定大小的N个地图元素，遵循DETR（Carion et al., 2020; Fang et al., 2021）的端到端范式。N的设置大于场景中典型地图元素的数量。让我们用 $\hat{Y} = \{\hat{y}_i\}^{N-1}_{i=0}$ 表示 N 个预测地图元素的集合。真实地图元素（GT）的集合用？（无对象）填充，形成一个大小为 N 的集合，表示为 $\{y_i\}^{N-1}_{i=0} \cdot y_i = (c_i, V_i, \Gamma_i)$ ，其中 $c_i, V_i 和 \Gamma_i$ 分别是GT地图元素 $y_i$ 的目标类别标签、点集和排列群。 $\hat{y}_i = (\hat{p}_i, \hat{V}_i)$ ，其中 $\hat{p}_i$ 和 $\hat{V}_i$ 分别是预测的分类得分和预测的点集。为了实现结构化地图元素建模和学习，MapTR引入了分层二分图匹配，即按顺序执行实例级匹配和点级匹配。
Instance-level Matching. First, we need to find an optimal instance-level label assignment π^ between predicted map elements fy^ig and GT map elements fyig. π^ is a permutation of N elements (π^ 2 ΠN) with the lowest instance-level matching cost:

实例级匹配。首先，我们需要在预测的地图元素 $\{\hat{y}_i\}$ 和GT地图元素 ${y_i\}$ 之间找到一个最优的实例级标签分配 $\hat{\pi}$ 。 $\hat{\pi}$ 是 N 个元素的排列 ( $\hat{\pi} \in \Pi_N$ ），具有最低的实例级匹配成本：
在这里插入图片描述
Lins match(^ yπ(i); yi) is a pair-wise matching cost between prediction y^π(i) and GT yi, which considers both the class label of map element and the position of point set:
$L_{ins\_match}(\hat{y}_{\pi(i)}, y_i)$ 是预测 $\hat{y}_{\pi(i)}$ 和 $GT y_i$ 之间的成对匹配成本，它同时考虑了地图元素的类别标签和点集的位置：
在这里插入图片描述

$L_{\text{Focal}}(\hat{p}_{\pi(i)}, c_i)$ 是类别匹配成本项，定义为预测分类得分 $\hat{p}_{\pi(i)}$ 和目标类别标签 $c_i$ 之间的 Focal Loss（Lin et al., 2017）。 $L_{\text{position}}(\hat{V}_{\pi(i)}, V_i)$ 是位置匹配成本项，它反映了预测点集 $\hat{V}_{\pi(i)}$ 和GT点集 $V_i$ 之间的位置相关性（更多细节请参考第B节）。Hungarian算法被用来找到最优的实例级分配 $\hat{\pi}$ ，遵循DETR的方法。
在这里插入图片描述
点级匹配。在实例级匹配之后，每个预测的地图元素 $\hat{y}_{\hat{\pi}(i)}$ 被分配给一个GT地图元素 $y_i$ 。然后对于每个被分配了正标签 $c_i \neq \phi$ 的预测实例，我们执行点级匹配，以找到预测点集 $\hat{V}_{\hat{\pi}(i)}$ 和GT点集 $V_i$ 之间的最优点对点分配 $\hat{\gamma} \in \Gamma$ 。 $\hat{\gamma}$ 是从预定义的排列群 $\Gamma$ 中选择的，并且具有最低的点级匹配成本：
在这里插入图片描述

$D_{\text{Manhattan}}(\hat{v}_j, v_{\gamma(j)})$ 是预测点集 $\hat{V}$ 中的第 $j$ 个点和GT点集 $V$ 中第 $\gamma(j)$ 个点之间的曼哈顿距离。

3.3 TRAINING LOSS

在这里插入图片描述
MapTR基于最优的实例级和点级分配 $\hat{\pi} \ and \ \{\hat{\gamma}_i\})$ 进行训练。损失函数由三部分组成，分类损失、点对点损失和边缘方向损失：

其中，λ、α和β是用于平衡不同损失项的权重。
在这里插入图片描述
分类损失。根据实例级最优匹配结果 $\hat{\pi}$ ，每个预测的地图元素都被分配了一个类别标签。分类损失是一个Focal Loss项，公式如下：

在这里插入图片描述
点对点损失。点对点损失监督每个预测点的位置。对于每个索引为i的GT实例，根据点级最优匹配结果 $\hat{\gamma}_i$ ，每个预测点 $\hat{v}_{\hat{\pi}(i),j}$ 被分配给一个GT点 $v_{i,\hat{\gamma}_i(j)}$ 。点对点损失定义为计算每对分配点之间的曼哈顿距离：
在这里插入图片描述

边缘方向损失。点对点损失仅监督多边形和折线的节点点，没有考虑边（相邻点之间的连接线）。为了准确表示地图元素，边的方向非常重要。因此，我们进一步设计了边缘方向损失来监督更高级别的几何形状。具体来说，我们考虑了配对预测边 $\hat{e}_{\hat{\pi}(i),j}$ 和 GT边 $e_{i,\hat{\gamma}_i(j)}$ 之间的余弦相似度：
在这里插入图片描述

3.4 ARCHITECTURE

在这里插入图片描述
MapTR设计了一个编码器-解码器范式。整体架构在图4 中描绘。

图4. MapTR的整体架构。MapTR 采用了编码器-解码器范式 。 地图编码器将传感器输入转换为统一的鸟瞰图（BEV）表示 。 地图解码器采用了分层查询嵌入方案来明确编码地图元素，并基于排列等价建模执行分层匹配。MapTR是完全端到端的。该流程高度结构化、紧凑且高效。
在这里插入图片描述
输入模态。MapTR以车载摄像头的环视图像作为输入。MapTR也兼容其他车载传感器（例如，激光雷达和雷达）。将MapTR扩展到多模态数据是直接且简单的。并且，得益于合理的排列等价建模，即使仅使用摄像头输入，MapTR也显著优于其他使用多模态输入的方法。
在这里插入图片描述
地图编码器。MapTR的地图编码器从多个车载摄像头的图像中提取特征，并将这些特征转换为统一的特征表示，即鸟瞰图（BEV）表示。给定多视图图像 $\mathbb{I} = \{I_1, \ldots, I_K\}$ ，我们利用一个传统的骨干网络生成多视图特征图 $\mathbb{F} = \{F_1, \ldots, F_K\}$ 。然后，2D图像特征 $\mathbb{F}$ 被转换为BEV特征 $\in \mathbb{R}^{H \times W \times C}$ 。默认情况下，我们采用GKT（Chen et al., 2022b）作为基础的2D到BEV转换模块，考虑到其易于部署的特性和高效率。MapTR兼容其他转换方法，并保持稳定的表现，例如，CVT（Zhou & Krahenbühl, 2022）、LSS（Philion & Fidler, 2020; Liu et al., 2022c; Li et al., 2022b; Huang et al., 2021）、可变形注意力（Deformable Attention）（Li et al., 2022c; Zhu et al., 2021）和IPM（Mallot et al., 1991）。消融研究结果呈现在表4 中。

《Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer》（2206）

在这里插入图片描述
表4. 关于2D到BEV转换方法的消融研究。MapTR与各种2D到BEV方法兼容，并实现了稳定的表现。

地图解码器。我们提出了一种分层查询嵌入方案来明确编码每个地图元素。具体来说，我们定义了一组实例级查询 $\{q_i^{\text{ins}}\}_{i=0}^{N-1}$ 和一组点级查询 $\{q_j^{\text{pt}}\}_{j=0}^{N_v-1}$ ，所有实例都共享这些查询。每个地图元素（索引为 $i$ ）对应一组分层查询 ${q_{ij}^{hie}\}_{j=0}^{N_v-1}$ 。第 $i$ 个地图元素的第 $j$ 个点的分层查询公式如下：
在这里插入图片描述

实例级查询怎么理解，实例是怎么表示的呢？

在这里插入图片描述

地图解码器包含多个级联的解码器层，这些层迭代更新分层查询。在每个解码器层中，我们采用MHSA（多头自注意力机制）使分层查询相互交换信息（包括实例间和实例内）。然后，我们采用可变形注意力（Deformable Attention）（Zhu et al., 2021）使分层查询与BEV特征互动，这一灵感来自BEVFormer（Li et al., 2022c）。每个查询 $q_{ij}^{hie}$ 预测参考点 $p_{ij}$ 的二维归一化BEV坐标 $x_{ij}, y_{ij})$ 。然后我们围绕参考点采样BEV特征并更新查询。
在这里插入图片描述
地图元素通常具有不规则形状，并且需要长距离上下文。每个地图元素对应一组具有灵活和动态分布的参考点 ${p_{ij}\}_{j=0}^{N_v-1}$ 。参考点 ${p_{ij}\}_{j=0}^{N_v-1}$ 能够适应地图元素的任意形状，并为地图元素学习捕获信息丰富的上下文。
在这里插入图片描述
MapTR的预测头很简单，由一个分类分支和一个点回归分支组成。分类分支预测实例类别得分。点回归分支预测点集 $\hat{V}$ 的位置。对于每个地图元素，它输出一个 $2N_v$ 维向量，代表 $N_v$ 个点的归一化BEV坐标。

4 EXPERIMENTS

在这里插入图片描述
数据集和指标。我们在流行的nuScenes（Caesar et al., 2020）数据集上评估MapTR，该数据集包含1000个场景，每个场景大约持续20秒。关键样本以2Hz的频率进行注释。每个样本有来自6个摄像头的RGB图像，覆盖了自车360°的水平视场（FOV）。按照之前的方法（Li et al., 2022a; Liu et al., 2022a），我们选择了三种地图元素进行公平评估——人行横道、车道分隔线和道路边界。感知范围在 $X 轴为 [- 15.0 m, 15.0 m]$ ，在 $Y 轴为 [- 30.0 m, 30.0 m]$ 。我们采用平均精度（AP）来评估地图构建的质量。Chamfer距离（ $D_{Chamfer}$ ）用于确定预测和GT是否匹配。我们在几个 $D_{Chamfer}$ 阈值（τ ∈ T; T = {0.5; 1.0; 1.5}）下计算APτ，然后取所有阈值的平均值作为最终的AP指标：

Chamfer距离

在这里插入图片描述

实现细节。MapTR使用8块NVIDIA GeForce RTX 3090 GPU进行训练。我们采用AdamW（Loshchilov & Hutter, 2019）优化器和余弦退火调度。对于MapTR-tiny，我们采用ResNet50（He et al., 2016）作为骨干网络。我们以总共32的批量大小（包含6视图图像）训练MapTR-tiny。所有的消融研究都是基于训练了24个周期的MapTR-tiny。MapTR-nano旨在用于实时应用。我们采用ResNet18作为骨干网络。更多的细节在附录A中提供。

4.1 COMPARISONS WITH STATE-OF-THE-ART METHODS

在这里插入图片描述
在表1 中，我们比较了MapTR与最先进的方法。MapTR-nano在RTX 3090上以实时推理速度（25.1 FPS）运行，比现有的最先进的基于摄像头的方法（VectorMapNet-C）快8倍，同时实现了高于5.0的平均精度（mAP）。即使与现有的最先进的多模态方法相比，MapTR-nano实现了0.7更高的mAP和8倍更快的推理速度，而MapTR-tiny实现了高于13.5的mAP和3倍更快的推理速度。MapTR也是一个快速收敛的方法，它在24-epoch的时展现出了先进的性能。
在这里插入图片描述
表1. 在nuScenes验证集上与最先进方法（Liu et al., 2022a; Li et al., 2022a）的比较。“C”和“L”分别表示摄像头和激光雷达。“Effi-B0”和“PointPillars”分别对应于Tan & Le (2019)和Lang et al. (2019)的工作。其他方法的平均精度（AP）数据取自VectorMapNet的论文。VectorMapNet-C的FPS由其作者提供，并在RTX 3090上测量。其他FPS在同一台装有RTX 3090的机器上测量。“-”表示相应的结果不可用。即使仅使用摄像头输入，MapTR-tiny也显著优于多模态对应物（+13.5 mAP）。MapTR-nano实现了最先进的基于摄像头的性能，并以25.1 FPS的速度运行，首次实现了实时矢量化地图构建。
在这里插入图片描述

4.2 ABLATION STUDY

在这里插入图片描述
为了验证不同设计的有效性，我们在nuScenes验证集上进行了消融实验。更多的消融研究在 附录B 中。

排列等价建模的有效性。在表2 中，我们提供了消融实验来验证所提出的排列等价建模的有效性。与强加唯一排列给点集的原始建模方法相比，排列等价建模解决了地图元素的歧义问题，并带来了5.9 mAP的提升。对于人行横道，提升甚至达到了11.9 AP，证明了在建模多边形元素时的优越性。我们还在图5 中可视化了学习过程，以展示所提建模方法的稳定性。
在这里插入图片描述

边缘方向损失的有效性。关于边缘方向损失权重的消融研究呈现在表3 中。 $β = 0$ 意味着我们不使用边缘方向损失。 $β = 5e^{−3}$ 对应于适当的监督 ，并被采用作为默认设置。
2D到BEV转换。在表4 中，我们对2D到BEV转换方法进行了消融研究（例如，IPM（Mallot et al., 1991）、LSS（Liu et al., 2022c; Philion & Fidler, 2020）、Deformable Attention（Li et al., 2022c）和GKT（Chen et al., 2022b））。我们使用了LSS（Liu et al., 2022c）的优化实现。为了与IPM和LSS进行公平比较，GKT和Deformable Attention都采用了单层配置。实验表明MapTR与各种2D到BEV方法兼容，并实现了稳定的性能。考虑到其易于部署的特性和高效率，我们采用GKT作为MapTR的默认配置。

4.3 QUALITATIVE VISUALIZATION

在这里插入图片描述
我们在图1 中展示了复杂多样驾驶场景中预测的矢量化高清地图结果。MapTR保持了稳定且令人印象深刻的结果。更多的定性结果在附录C 中提供。我们还提供了视频（在补充材料中）以展示其鲁棒性。

5 CONCLUSION

在这里插入图片描述
MapTR是一个结构化的端到端框架，用于高效的在线矢量化高清地图构建，它采用了简单的编码器-解码器Transformer架构和分层二分图匹配，基于提出的排列等价建模来执行地图元素学习。广泛的实验表明，所提出的方法能够在具有挑战性的nuScenes数据集中精确感知任意形状的地图元素。我们希望MapTR能作为自动驾驶系统的基本模块，并推动下游任务（例如，运动预测和规划）的发展。
在这里插入图片描述

Appendix

A IMPLEMENTATION DETAILS

在这里插入图片描述
本节提供了该方法和实验的更多实现细节。

数据增强。源图像的分辨率为1600×900。对于MapTR-nano，我们将源图像的尺寸调整为0.2的比例。对于MapTR-tiny，我们将源图像的尺寸调整为0.5的比例。默认情况下使用颜色抖动。
在这里插入图片描述

模型设置。在所有实验中， $λ$ 被设置为2， $α$ 被设置为5， $β$ 在训练期间被设置为 $5e^{−3}$ 。对于MapTR-tiny，我们将实例级查询和点级查询的数量分别设置为 50 和 20。我们将每个 BEV 网格的大小设置为0.3米，并堆叠6层Transformer解码器。我们以总共 32 的批量大小（包含6视图图像）、 $6e^{−4}$ 的学习率、骨干网络的学习率乘数为 0.1 来训练 MapTR-tiny。所有的消融研究都是基于训练了24个周期的MapTR-tiny。对于MapTR-nano，我们将实例级查询和点级查询的数量分别设置为100和20。我们将每个BEV网格的大小设置为0.75米，并堆叠2层Transformer解码器。我们以110个epochs、总共192的batch size、4e−3的学习率、骨干网络的学习率乘数为0.1来训练MapTR-nano。我们使用GKT（Chen et al., 2022b）作为MapTR的默认2D到BEV模块。
在这里插入图片描述
数据集预处理。我们按照Liu et al. (2022a)和Li et al. (2022a)的方法处理地图注释。提取位于自车感知范围内的地图元素作为地面真实地图元素。默认情况下，感知范围在X轴为[−15.0m; 15.0m]，在Y轴为[−30.0m; 30.0m]。

BABLATION STUDY

在这里插入图片描述
点的数量。关于建模每个地图元素的点数的消融研究呈现在表5 中。点数太少无法描述地图元素复杂的几何形状。点数太多则会影响效率。我们采用20个点作为MapTR的默认设置。

元素数量。关于地图元素数量的消融研究呈现在表6 中。我们采用 50 作为MapTR-tiny默认的地图元素数量。
在这里插入图片描述

解码器层数。关于地图解码器层数的消融研究呈现在表7 中。随着层数的增加，地图构建性能得到提升，但当层数达到 6 时性能趋于饱和。

位置匹配成本。如第3.2节所述，我们在实例级匹配中采用了位置匹配成本项 $L_{\text{position}}(\hat{V}_{\hat{\pi}(i)}, V_i)$ ，以反映预测点集 $\hat{V}_{\hat{\pi}(i)}$ 和GT点集 $V_i$ 之间的位置相关性。在表8 中，我们比较了两种成本设计，即Chamfer距离成本和点对点成本。点对点成本与点级匹配成本类似。具体来说，我们找到最佳的点对点分配，并将所有点对的曼哈顿距离相加作为两个点集的位置匹配成本。实验表明，点对点成本优于Chamfer距离成本。

计算cost的最佳的的匹配点是怎么找到的呢？

在这里插入图片描述

Swin Transformer骨干网络。关于Swin Transformer骨干网络（Liu et al., 2021b）的消融研究呈现在表9 中。

模态。多传感器感知对于自动驾驶车辆的安全性至关重要，MapTR也兼容其他车载传感器，如激光雷达（LiDAR）。如 表10 所示，仅用24-epochs的schedule，多模态MapTR就显著超越了之前的最先进结果，提高了17.3 mAP，同时速度快了2倍。
在这里插入图片描述

对摄像头偏差的鲁棒性。在实际应用中，摄像头的内在参数通常是准确的并且变化很小，但摄像头的外参可能会因为摄像头位置的偏移、校准误差等原因而不准确。为了验证鲁棒性，我们遍历验证集，并对每个样本随机生成噪声。我们分别添加了不同程度平移和旋转偏差的噪声。注意，我们对所有摄像头和所有坐标添加了噪声，且噪声服从正态分布。在某些样本中存在非常大的偏差，这会很大程度上影响性能。如 表11 和 表12 所示，当 $\Delta x, \Delta y, \Delta z$ 的标准差为 0.1米或 $\theta x, \theta y, \theta z$ 的标准差为 0.02 弧度时，MapTR仍然保持了可比的性能。
在这里插入图片描述

详细的运行时间。为了更深入地了解MapTR的效率，我们在 表13 中展示了仅使用多摄像头输入时MapTR-tiny的每个组件的详细运行时间。