[CVPR 2022]Cross-view Transformers for real-time Map-view Semantic Segmentation

论文网址：Cross-View Transformers for Real-Time Map-View Semantic Segmentation

论文代码：cross_view_transformers/cross_view_transformer at master · bradyz/cross_view_transformers · GitHub

英文是纯手打的！论文原文的summarizing and paraphrasing。可能会出现难以避免的拼写错误和语法错误，若有发现欢迎评论指正！文章偏向于笔记，谨慎食用

1. 心得

2. 论文逐段精读

2.1. Abstract

2.2. Introduction

2.3. Related Works

2.4. Cross-view transformers

2.4.1. Cross-view attention

2.4.2. A cross-view transformer architecture

2.5. Implementation Details

2.6. Results

2.6.1. Comparison to prior work

2.6.2. Ablations of cross-view attention

2.6.3. Camera-aware positional embeddings

2.6.4. Accuracy vs distance

2.6.5. Robustness to sensor dropout

2.6.6. Qualitative Results

2.6.7. Geometric reasoning in cross-view attention

2.7. Conclusion

3. Reference

1. 心得

（1）虽然人需要挑战自己，但也不能总是挑战自己

2. 论文逐段精读

2.1. Abstract

①They implement segmentation on camera-aware cross-view attention way

2.2. Introduction

①Depth projection and measurement is a bottleneck

②Their cross view Transformer will never reason geometry but "learns to map between views through a geometry-aware positional embedding"

③Schematic of cross view Transformer:

2.3. Related Works

（1）Monocular 3D object detection

①Lists monocular detection methods, which convert 3D object to 2D and predict the depth

monocular adj. 单眼的 n. 单目；单筒望远镜

（2）Depth estimation

①Depth estimation methods rely on camera and accurate calibration

（3）Semantic mapping in the map-view

①This method divides input and output in two coordinate frames, where inputs are in calibrated camera views, outputs are rasterized onto a map

②The authors rekon that implicit geometric reasoning performs as well as explicit geometric models

rasterized v. 栅格化

2.4. Cross-view transformers

①Monocular views: $n$ with $(I_k,K_k,R_k,t_k)_{k=1}^n$ , where $I_k\in\mathbb{R}^{H\times W\times3}$ denotes input image, $K_k\in\mathbb{R}^{3\times3}$ denotes camera intrisics, $R_k\in\mathbb{R}^{3\times3}$ is extrinsic rotation, $t_k\in\mathbb{R}^3$ denotes translation relative to the center of the ego-vehicle

②They aim to predict a binary semantic segmentation mask $y\in\left\{0,1\right\}^{h\times w\times C}$

③Pipeline of cross-view transformer:

where positional embedding shares the same encoder with image

2.4.1. Cross-view attention

①For any world coordinate $x^{(W)}\in\mathbb{R}^3$ , perspective transformation converts it to corresponding image coordinate $x^{(I)}\in\mathbb{R}^3$ by:

$x^{(I)}\simeq K_kR_k(x^{(W)}-t_k)$

where $\simeq$ denotes equality up to a scale factor, $x^{(I)}=\left ( \cdot , \cdot, 1 \right )$ adopts homogeneous coordinates

②Reconstruct geometric relationship between world and image coordinates to cosine similarity:

$\begin{aligned} sim_k(x^{(I)},x^{(W)})=\frac{\left(R_k^{-1}K_k^{-1}x^{(I)}\right)\cdot\left(x^{(W)}-t_k\right)}{\|R_k^{-1}K_k^{-1}x^{(I)})\|\|(x^{(W)}-t_k\|} \end{aligned}$

（1）Camera-aware positional encoding

①Each unprojected image coordinate $d_{k,i}=R_k^{-1}K_k^{-1}x_i^{(I)}$ is direction vector from the origin $t_k$ of camera $k$ to the image plane at depth $1$

②Encoding direction vector to $\delta_{k,i}\in\mathbb{R}^{D}, D=1$ by MLP

③Learned positional encoding: $c^{(0)}\in\mathbb{R}^{w\times h\times D}$

④Generating new embedding $c^{(1)},c^{(2)},\ldots.$ by learning every element

（2）Map-view latent embedding

①Combining 2 positional embeddings:

$sim(\delta_{k,i},\phi_{k,i},c_j^{(n)},\tau_k)=\frac{(\delta_{k,i}+\phi_{k,i})\cdot\left(c_j^{(n)}-\tau_k\right)}{\|\delta_{k,i}+\phi_{k,i}\|\|c_j^{(n)}-\tau_k\|}$