Instruct-NeRF2NeRF：通过用户指令编辑 NeRF 三维场景

Haque A, Tancik M, Efros A A, et al. Instruct-nerf2nerf: Editing 3d scenes with instructions[J]. arXiv preprint arXiv:2303.12789, 2023.

Instruct-NeRF2NeRF 是 ICCV 2023 Oral 论文，首次将图像编辑任务从二维提升到三维。

Instruct-NeRF2NeRF 所做的任务是根据用户指令编辑 NeRF 表示的三维场景。Instruct-NeRF2NeRF 使用预训练的 InstructPix2Pix 对 NeRF 的训练数据（即多视角视图）进行编辑，然后用编辑后的视图继续训练 NeRF，从而达到编辑三维场景的效果。为了确保编辑后的三维场景的连续性，使用 Iterative DU 的方式进行训练。

在这里插入图片描述

一. 研究思路

Instruct-NeRF2NeRF 的目的是按照人为指令对 NeRF 表示的三维场景进行编辑，因此训练模型只需要编辑指令和 NeRF 场景。正如 DreamFusion 中所说，三维场景的本质就是从多个视角观测一个场景 ¹，因此 Instruct-NeRF2NeRF 使用 InstructPix2Pix 对 NeRF 的多视角训练数据进行编辑，编辑后的图像就可以用来优化 NeRF 的三维表示。为了方便编辑，给定场景的 NeRF 表示时还保留了其训练数据（视图、机位等信息）。

二. Instruct-NeRF2NeRF 模型

Instruct-NeRF2NeRF 在 NeRF 表示的三维场景上使用 InstructPix2Pix 进行微调：

输入：NeRF 场景及其训练数据和编辑指令；
输出：编辑后的 NeRF 场景；

三. 训练方法

直接对不同视角的训练数据进行编辑会导致三维场景的不连续 (inconsistent edits across viewpoints)，因为不同视角的图像编辑之间相互独立：
在这里插入图片描述

于是， Instruct-NeRF2NeRF 的训练使用 迭代数据集更新 (Iterative Dataset Update, Iterative DU) 的方式，即交替编辑 NeRF 训练集图像和更新 NeRF 三维场景。

这也就是为什么不对所有训练图像编辑后从头训练 NeRF 的原因：NeRF 的训练数据可以保证三维场景的连续型，而 InstructPix2Pix 编辑后的多视角图像之间构成的三维场景很有可能不连续。

在这里插入图片描述

1. 编辑 NeRF 训练图像

编辑 NeRF 训练集图像时，将视角 $v$ 下的原始图像 $c_I$ 、编辑指令 $c_T$ 、噪声 $z_t$ 输入 InstructPix2Pix 模型。记 $I_{i}^{v}$ 表示第 $i$ 轮视角 $v$ 下的图像， $I_{0}^{v}=c_I$ ，则有随着迭代不断更新图像：
$I_{i+1}^{v} \leftarrow U_{\theta}(I_{i}^{v},t;I_{0}^{v},c_T)$

2. 更新 NeRF 训练集

Instruct-NeRF2NeRF 的核心就是交替编辑 NeRF 训练集图像和更新 NeRF 三维场景，称为 Iterative DU。训练前对 NeRF 训练集的多视角视图指定顺序，在每一轮训练中，先更新 $d$ 张图像，再采样 $n$ 条射线训练 NeRF：

图像更新时，随机选取部分视图进行编辑，然后将其替换成编辑后的视图；
NeRF 训练时，从新旧数据混合的训练集中采样部分视图对 NeRF 进行训练；

在这里插入图片描述

上述训练方法在训练初期可能也会出现不连续的三维场景，但随着不断迭代，会收敛到一个连续的三维场景：
在这里插入图片描述

四. 实验结果

使用 Nerfstudio 框架 ² 表示三维场景，每次编辑都需要在三维场景上重新训练。训练过程可视化如下：

在这里插入图片描述

不同方法效果对比如下：
在这里插入图片描述

五. 总结

Instruct-NeRF2NeRF 通过使用预训练的 InstructPix2Pix 对 NeRF 的训练数据进行编辑，然后以 Iterative DU 的方式使用编辑后的视图继续训练 NeRF，从而实现了三维场景的编辑，保持了场景的连贯性和真实感。³

其实 Instruct-NeRF2NeRF 在处理三维场景一致性时使用了 tricks：既然已经保留了 NeRF 的所有训练数据，为什么不对所有数据编辑后再训练 NeRF？因为 NeRF 的原始训练数据可以保证三维场景的连续型，而 InstructPix2Pix 编辑后的多视角图像之间构成的三维场景很有可能不连续。因此采用迭代更新数据集的方式来训练，使得 NeRF 逐渐收敛到一个连续三维场景。

但 Instruct-NeRF2NeRF 也有一些局限性：

Instruct-NeRF2NeRF 一次只能在一个视图上进行编辑，因此可能出现伪影；
有时 InstructPix2Pix 编辑不理想，因此 Instruct-NeRF2NeRF 的编辑也会因此出问题；
即使 InstructPix2Pix 编辑成功，Instruct-NeRF2NeRF 的编辑也可能不理想；

六. 复现

Instruct-NeRF2NeRF 基于 Nerfstudio：

平台：AutoDL
显卡：RTX 4090 24GB
镜像：PyTorch 2.0.0、Python 3.8(ubuntu20.04)、Cuda 11.8
源码：https://github.com/ayaanzhaque/instruct-nerf2nerf

实验记录：

先按照教程创建 nerfstudio 环境并安装依赖包，执行到 conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit 即可；
再克隆 Instruct-NeRF2NeRF 仓库并更新组件和包；
此时执行 ns-train -h 查看安装情况会出现 TypeEror：

需要先在 instruct-nerf2nerf 文件夹下安装 Nerfstudio ⁴ ，然后就可以成功验证：

(nerfstudio) root@autodl-container-9050458ceb-3f1684be:~/instruct-nerf2nerf/nerfstudio# ns-train -h
usage: ns-train [-h]
                {depth-nerfacto,dnerf,gaussian-splatting,generfacto,in2n,in2n-small,in2n-tiny,instant-ngp,instant-ngp-bounded,mipnerf,nerfacto,nerfact
o-big,nerfacto-huge,neus,neus-facto,phototourism,semantic-nerfw,tensorf,vanilla-nerf,kplanes,kplanes-dynamic,lerf,lerf-big,lerf-lite,nerfplayer-nerfac
to,nerfplayer-ngp,tetra-nerf,tetra-nerf-original,volinga}

Train a radiance field with nerfstudio. For real captures, we recommend using the nerfacto model.

Nerfstudio allows for customizing your training and eval configs from the CLI in a powerful way, but there are some things to understand.

The most demonstrative and helpful example of the CLI structure is the difference in output between the following commands:

    ns-train -h
    ns-train nerfacto -h nerfstudio-data
    ns-train nerfacto nerfstudio-data -h

In each of these examples, the -h applies to the previous subcommand (ns-train, nerfacto, and nerfstudio-data).

In the first example, we get the help menu for the ns-train script. In the second example, we get the help menu for the nerfacto model. In the third 
example, we get the help menu for the nerfstudio-data dataparser.

With our scripts, your arguments will apply to the preceding subcommand in your command, and thus where you put your arguments matters! Any optional 
arguments you discover from running

    ns-train nerfacto -h nerfstudio-data

need to come directly after the nerfacto subcommand, since these optional arguments only belong to the nerfacto subcommand:

    ns-train nerfacto {nerfacto optional args} nerfstudio-data

╭─ arguments ─────────────────────────────────────────────────────────────╮ ╭─ subcommands ──────────────────────────────────────────────────────────╮
│ -h, --help        show this help message and exit                       │ │ {depth-nerfacto,dnerf,gaussian-splatting,generfacto,in2n,in2n-small,i… │
╰─────────────────────────────────────────────────────────────────────────╯ │     depth-nerfacto                                                     │
                                                                            │                   Nerfacto with depth supervision.                     │
                                                                            │     dnerf         Dynamic-NeRF model. (slow)                           │
                                                                            │     gaussian-splatting                                                 │
                                                                            │                   Gaussian Splatting model                             │
                                                                            │     generfacto    Generative Text to NeRF model                        │
                                                                            │     in2n          Instruct-NeRF2NeRF primary method: uses LPIPS, IP2P  │
                                                                            │                   at full precision                                    │
                                                                            │     in2n-small    Instruct-NeRF2NeRF small method, uses LPIPs, IP2P at │
                                                                            │                   half precision                                       │
                                                                            │     in2n-tiny     Instruct-NeRF2NeRF tiny method, does not use LPIPs,  │
                                                                            │                   IP2P at half precision                               │
                                                                            │     instant-ngp   Implementation of Instant-NGP. Recommended real-time │
                                                                            │                   model for unbounded scenes.                          │
                                                                            │     instant-ngp-bounded                                                │
                                                                            │                   Implementation of Instant-NGP. Recommended for       │
                                                                            │                   bounded real and synthetic scenes                    │
                                                                            │     mipnerf       High quality model for bounded scenes. (slow)        │
                                                                            │     nerfacto      Recommended real-time model tuned for real captures. │
                                                                            │                   This model will be continually updated.              │
                                                                            │     nerfacto-big                                                       │
                                                                            │     nerfacto-huge                                                      │
                                                                            │     neus          Implementation of NeuS. (slow)                       │
                                                                            │     neus-facto    Implementation of NeuS-Facto. (slow)                 │
                                                                            │     phototourism  Uses the Phototourism data.                          │
                                                                            │     semantic-nerfw                                                     │
                                                                            │                   Predicts semantic segmentations and filters out      │
                                                                            │                   transient objects.                                   │
                                                                            │     tensorf       tensorf                                              │
                                                                            │     vanilla-nerf  Original NeRF model. (slow)                          │
                                                                            │     kplanes       [External] K-Planes model tuned to static blender    │
                                                                            │                   scenes                                               │
                                                                            │     kplanes-dynamic                                                    │
                                                                            │                   [External] K-Planes model tuned to dynamic DNeRF     │
                                                                            │                   scenes                                               │
                                                                            │     lerf          [External] LERF with OpenCLIP ViT-B/16, used in      │
                                                                            │                   paper                                                │
                                                                            │     lerf-big      [External] LERF with OpenCLIP ViT-L/14               │
                                                                            │     lerf-lite     [External] LERF with smaller network and less LERF   │
                                                                            │                   samples                                              │
                                                                            │     nerfplayer-nerfacto                                                │
                                                                            │                   [External] NeRFPlayer with nerfacto backbone         │
                                                                            │     nerfplayer-ngp                                                     │
                                                                            │                   [External] NeRFPlayer with instang-ngp-bounded       │
                                                                            │                   backbone                                             │
                                                                            │     tetra-nerf    [External] Tetra-NeRF. Different sampler - faster    │
                                                                            │                   and better                                           │
                                                                            │     tetra-nerf-original                                                │
                                                                            │                   [External] Tetra-NeRF. Official implementation from  │
                                                                            │                   the paper                                            │
                                                                            │     volinga       [External] Real-time rendering model from Volinga.   │
                                                                            │                   Directly exportable to NVOL format at                │
                                                                            │                   https://volinga.ai/                                  │
                                                                            ╰────────────────────────────────────────────────────────────────────────╯

Nerfstudio 安装完成后，就可以训练了。使用 bear 数据集进行训练：ns-train nerfacto --data data/bear：

训练时可以复制网址 https://viewer.nerf.studio/versions/23-05-15-1/?websocket_url=ws://localhost:7007 监控实时效果 ⁵。需要注意的是，在服务器上训练想要监视训练过程需要转发 ⁶ ⁷，监视窗口如下：
NeRF 场景训练完成后，就可以进行编辑：ns-train in2n --data data/bear --load-dir outputs/bear/nerfacto/2023-12-17_230904/nerfstudio_models --pipeline.prompt "Turn the bear into a polar bear" --pipeline.guidance-scale 7.5 --pipeline.image-guidance-scale 1.5。但 GPU 内存有限，加载全部模型会超限 ⁸ ⁹ ¹⁰：

作者也考虑到了这一点，因此提供了占用内存更小但效果更差的模型 in2n-small 和 in2n-tiny：ns-train in2n-small --data data/bear --load-dir outputs/bear/nerfacto/2023-12-17_230904/nerfstudio_models --pipeline.prompt "Turn the bear into a polar bear" --pipeline.guidance-scale 7.5 --pipeline.image-guidance-scale 1.5；

实验结果：

原始 NeRF 场景训练结果如下，3w 轮迭代大概需要 1h：
为了呈现可视化效果，在训练完成后可以使用 ns-viewer --load-config outputs/bear/nerfacto/2023-12-17_230904/config.yml 加载监视页面 ¹¹；在监视页面 LOAD PATH 选择 final-path 即可，点击 RENDER 即可复制指令：ns-render camera-path --load-config outputs/bear/nerfacto/2023-12-17_230904/config.yml --camera-path-filename data/bear/camera_paths/2023-12-17_230904.json --output-path renders/bear/2023-12-17_230904.mp4。原始场景是用完整 NeRF 训练的，参数量太大超过显存容量，无法渲染成视频，截一张图以供参考：
使用 in2n-small 模型编辑三维场景，迭代到 4k 次已经完全收敛，不必再继续训练（完整编辑会执行到 6w 步，没有必要），大概需要 2h：
继续使用 ns-viewer 指令可视化三维场景，并使用 ns-render 指令可以渲染成视频。由于显存容量问题，无法渲染成视频，截一张图以供参考：

MAV3D：从文本描述中生成三维动态场景 ↩︎
Tancik M, Weber E, Ng E, et al. Nerfstudio: A modular framework for neural radiance field development[C]//ACM SIGGRAPH 2023 Conference Proceedings. 2023: 1-12. ↩︎
一行字实现3D换脸！UC伯克利提出「Chat-NeRF」，说句话完成大片级渲染 ↩︎
Fresh install error #72 ↩︎
nerfstudio-project | nerfstudio # 2-training-your-first-model ↩︎
nerfstudio | Using the viewer # training-on-a-remote-machine ↩︎
AutoDL帮助文档 | VSCode远程开发 ↩︎
RuntimeError: CUDA out of memory. Tried to allocate 12.50 MiB (GPU 0; 10.92 GiB total capacity; 8.57 MiB already allocated; 9.28 GiB free; 4.68 MiB cached) #16417 ↩︎
How to avoid “CUDA out of memory” in PyTorch ↩︎
How to avoid “CUDA out of memory” in PyTorch ↩︎
nerfstudio-project | nerfstudio # Visualize existing run ↩︎