Gaussian Grouping: Segment and Edit Anything in 3D Scenes
文章目录
- Gaussian Grouping: Segment and Edit Anything in 3D Scenes
- 1. What
- 2. Why
- 3. How
- 3.1 Anything Mask Input and Consistency
- 3.2 3D Gaussian Rendering and Grouping
- 3.3 Downstream: Local Gaussian Editing
1. What
What kind of thing is this article going to do (from the abstract and conclusion, try to summarize it in one sentence)
The first 3D Gaussian-based approach to jointly reconstruct and segment anything in the open-world 3D scene.
Each Gaussian with a compact Identity Encoding, supervised by 2D masks by SAM along with introduced 3D spatial consistency regularization, can also be further used for editing.
-
Explanation of Open-world
An open-world scenario refers to an uncertain, dynamic and complex environment that contains a variety of objects, scenes and tasks.
Or “open-world scene understanding” refers to the ability of a model to generalize to scenes or environments that it has not been explicitly trained on. In this context, the term “open-world” implies that the model needs to be able to adapt to and understand a wide range of scenes, including those that may be very different from the scenes in its training data.
2. Why
Under what conditions or needs this research plan was proposed (Intro), what problems/deficiencies should be solved at the core, what others have done, and what are the innovation points? (From Introduction and related work)
Maybe contain Background, Question, Others, Innovation:
- Existing methods [8, 37] rely on manually-labeled datasets or require accurately scanned 3D point clouds [33, 42] as input.
- Existing NeRFs-based methods [14, 17, 25, 39] are computation-hungry and hard to adjust for the downstream task because the learned neural networks, such as MLPs, cannot decompose each part or module in the 3D scene easily
- As for Radiance-based Open World Scene Understanding: Unlike our approach, most of these methods are designed for in-domain scene modeling and cannot generalize to open-world scenarios.
3. How
Following this pipeline, we will introduce it in details.
3.1 Anything Mask Input and Consistency
Shown in Figure 2(a), a set of multi-view captures along with the automatically generated 2D segmentations by SAM, as well as the corresponding cameras calibrated via SfM are inputs.
Shown in Figure 2(b), to assign each 2D mask a unique ID in the 3D scene, a well-trained zero-shot tracker [7] was used to propagate and associate masks. Use colors to represent different segmentation labels, and the results are shown in Figure 2(b)
3.2 3D Gaussian Rendering and Grouping
Shown in Figure 2©, all of the core concepts of this paper were used.
-
Identity Encoding
A new parameter, i.e., Identity Encoding is introduced to each Gaussian with original S Θ i = { p i , s i , q i , α i , c i } S_{\Theta_{i}}=\{\mathbf{p}_{i},\mathbf{s}_{i},\mathbf{q}_{i},\alpha_{i},\mathbf{c}_{i}\} SΘi={pi,si,qi,αi,ci}. It is a compact vector of length 16 and similar to Spherical Harmonic (SH) coefficients in representing color, it is differentiable and learnable.
-
Grouping via Rendering
In the process of rendering labels, similar to α \alpha α-blending:
E id = ∑ i ∈ N e i α i ′ ∏ j = 1 i − 1 ( 1 − α j ′ ) , E_{\text{id}}=\sum_{i\in\mathcal{N}}e_i\alpha_i'\prod_{j=1}^{i-1}(1-\alpha_j'), Eid=i∈N∑eiαi′j=1∏i−1(1−αj′),
but the denotations are different. e i e_i ei is the Identity Encoding of length 16 for each Gaussian and α i ′ \alpha_i' αi′ is a new weight, calculated by multiplying opacity α i \alpha_i αi and Σ 2 D \Sigma^{2\mathrm{D}} Σ2D, where Σ 2 D = J W Σ 3 D W T J T \Sigma^{2\mathrm{D}}=JW\Sigma^{3\mathrm{D}}W^TJ^T Σ2D=JWΣ3DWTJT according to [61].
-
Grouping Loss
-
2D Identity Loss: Given the rendered 2D features E i d E_{id} Eid before as input, first add a linear layer f f f to recover its feature dimension back to K+1 and then take s o f t m a x ( f ( E i d ) ) softmax (f(Eid)) softmax(f(Eid)) for identity classification. And cross-entropy loss was used.
-
3D Regularization Loss:
3D Regularization Loss leverages the 3D spatial consistency, which enforces the Identity Encodings of the top k-nearest 3D Gaussians to be close in their feature distance.
L 3 d = 1 m ∑ j = 1 m D k l ( P ∥ Q ) = 1 m k ∑ j = 1 m ∑ i = 1 k F ( e j ) log ( F ( e j ) F ( e i ′ ) ) \mathcal{L}_{\mathrm{3d}}=\frac{1}{m}\sum_{j=1}^{m}D_{\mathrm{kl}}(P\|Q)=\frac{1}{mk}\sum_{j=1}^{m}\sum_{i=1}^{k}F(e_{j})\log\left(\frac{F(e_{j})}{F(e_{i}^{\prime})}\right) L3d=m1j=1∑mDkl(P∥Q)=mk1j=1∑mi=1∑kF(ej)log(F(ei′)F(ej))
where P P P contains the sampled Identity Encoding e e e of a 3D Gaussian, while the set Q = { e 1 ′ , e 2 ′ , . . . , e k ′ } Q=\{e_1^{\prime},e_2^{\prime},...,e_k^{\prime}\} Q={e1′,e2′,...,ek′} consists of its k k k nearest neighbors in 3D spatial space.
-
3.3 Downstream: Local Gaussian Editing
Pay more attention to inpainting, first, delete the relevant 3D Gaussians and then add a small number of new Gaussians to be supervised by the 2D inpainting results by LAMA [41] during rendering.