【读论文】Gaussian Grouping: Segment and Edit Anything in 3D Scenes

Gaussian Grouping: Segment and Edit Anything in 3D Scenes

文章目录

Gaussian Grouping: Segment and Edit Anything in 3D Scenes
- 1. What
- 2. Why
- 3. How
- - 3.1 Anything Mask Input and Consistency
  - 3.2 3D Gaussian Rendering and Grouping
  - 3.3 Downstream: Local Gaussian Editing

1. What

What kind of thing is this article going to do (from the abstract and conclusion, try to summarize it in one sentence)

The first 3D Gaussian-based approach to jointly reconstruct and segment anything in the open-world 3D scene.
Each Gaussian with a compact Identity Encoding, supervised by 2D masks by SAM along with introduced 3D spatial consistency regularization, can also be further used for editing.

Explanation of Open-world

An open-world scenario refers to an uncertain, dynamic and complex environment that contains a variety of objects, scenes and tasks.

Or “open-world scene understanding” refers to the ability of a model to generalize to scenes or environments that it has not been explicitly trained on. In this context, the term “open-world” implies that the model needs to be able to adapt to and understand a wide range of scenes, including those that may be very different from the scenes in its training data.

2. Why

Under what conditions or needs this research plan was proposed (Intro), what problems/deficiencies should be solved at the core, what others have done, and what are the innovation points? (From Introduction and related work)

Maybe contain Background, Question, Others, Innovation:

Existing methods [8, 37] rely on manually-labeled datasets or require accurately scanned 3D point clouds [33, 42] as input.
Existing NeRFs-based methods [14, 17, 25, 39] are computation-hungry and hard to adjust for the downstream task because the learned neural networks, such as MLPs, cannot decompose each part or module in the 3D scene easily
As for Radiance-based Open World Scene Understanding: Unlike our approach, most of these methods are designed for in-domain scene modeling and cannot generalize to open-world scenarios.

3. How

Following this pipeline, we will introduce it in details.

在这里插入图片描述

3.1 Anything Mask Input and Consistency

Shown in Figure 2(a), a set of multi-view captures along with the automatically generated 2D segmentations by SAM, as well as the corresponding cameras calibrated via SfM are inputs.

Shown in Figure 2(b), to assign each 2D mask a unique ID in the 3D scene, a well-trained zero-shot tracker [7] was used to propagate and associate masks. Use colors to represent different segmentation labels, and the results are shown in Figure 2(b)

3.2 3D Gaussian Rendering and Grouping

Identity Encoding

A new parameter, i.e., Identity Encoding is introduced to each Gaussian with original $S_{\Theta_{i}}=\{\mathbf{p}_{i},\mathbf{s}_{i},\mathbf{q}_{i},\alpha_{i},\mathbf{c}_{i}\}$ . It is a compact vector of length 16 and similar to Spherical Harmonic (SH) coefficients in representing color, it is differentiable and learnable.
Grouping via Rendering

In the process of rendering labels, similar to $\alpha$ -blending:

$E_{\text{id}}=\sum_{i\in\mathcal{N}}e_i\alpha_i'\prod_{j=1}^{i-1}(1-\alpha_j'),$

but the denotations are different. $e_i$ is the Identity Encoding of length 16 for each Gaussian and $\alpha_i'$ is a new weight, calculated by multiplying opacity $\alpha_i$ and $\Sigma^{2\mathrm{D}}$ , where $\Sigma^{2\mathrm{D}}=JW\Sigma^{3\mathrm{D}}W^TJ^T$ according to [61].
Grouping Loss
- 2D Identity Loss: Given the rendered 2D features $E_{id}$ before as input, first add a linear layer $f$ to recover its feature dimension back to K+1 and then take $so f t ma x (f (E i d))$ for identity classification. And cross-entropy loss was used.
- 3D Regularization Loss:
  
  3D Regularization Loss leverages the 3D spatial consistency, which enforces the Identity Encodings of the top k-nearest 3D Gaussians to be close in their feature distance.
  
  $\mathcal{L}_{\mathrm{3d}}=\frac{1}{m}\sum_{j=1}^{m}D_{\mathrm{kl}}(P\|Q)=\frac{1}{mk}\sum_{j=1}^{m}\sum_{i=1}^{k}F(e_{j})\log\left(\frac{F(e_{j})}{F(e_{i}^{\prime})}\right)$
  
  where $P$ contains the sampled Identity Encoding $e$ of a 3D Gaussian, while the set $Q=\{e_1^{\prime},e_2^{\prime},...,e_k^{\prime}\}$ consists of its $k$ nearest neighbors in 3D spatial space.

3.3 Downstream: Local Gaussian Editing

在这里插入图片描述

Pay more attention to inpainting, first, delete the relevant 3D Gaussians and then add a small number of new Gaussians to be supervised by the 2D inpainting results by LAMA [41] during rendering.

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：/a/608651.html

如若内容造成侵权/违法违规/事实不符，请联系我们进行投诉反馈qq邮箱809451989@qq.com，一经查实，立即删除！