CVPR2023论文速览Scenes相关49篇

CVPR2023论文速览Scenes

在这里插入图片描述

Paper1 CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP

摘要原文: Contrastive Language-Image Pre-training (CLIP) achieves promising results in 2D zero-shot and few-shot learning. Despite the impressive performance in 2D, applying CLIP to help the learning in 3D scene understanding has yet to be explored. In this paper, we make the first attempt to investigate how CLIP knowledge benefits 3D scene understanding. We propose CLIP2Scene, a simple yet effective framework that transfers CLIP knowledge from 2D image-text pre-trained models to a 3D point cloud network. We show that the pre-trained 3D network yields impressive performance on various downstream tasks, i.e., annotation-free and fine-tuning with labelled data for semantic segmentation. Specifically, built upon CLIP, we design a Semantic-driven Cross-modal Contrastive Learning framework that pre-trains a 3D network via semantic and spatial-temporal consistency regularization. For the former, we first leverage CLIP’s text semantics to select the positive and negative point samples and then employ the contrastive loss to train the 3D network. In terms of the latter, we force the consistency between the temporally coherent point cloud features and their corresponding image features. We conduct experiments on SemanticKITTI, nuScenes, and ScanNet. For the first time, our pre-trained network achieves annotation-free 3D semantic segmentation with 20.8% and 25.08% mIoU on nuScenes and ScanNet, respectively. When fine-tuned with 1% or 100% labelled data, our method significantly outperforms other self-supervised methods, with improvements of 8% and 1% mIoU, respectively. Furthermore, we demonstrate the generalizability for handling cross-domain datasets. Code is publicly available.

中文总结: 这段话主要讨论了对比语言-图像预训练（CLIP）在2D零样本学习和少样本学习中取得了令人满意的结果。尽管在2D方面表现出色，但将CLIP应用于帮助3D场景理解的研究尚未被探索。作者在这篇论文中首次尝试调查CLIP知识如何有助于3D场景理解。他们提出了CLIP2Scene，一个简单而有效的框架，将CLIP知识从2D图像-文本预训练模型转移到3D点云网络。作者展示了预训练的3D网络在各种下游任务上取得了令人印象深刻的表现，即无标注注释和使用标记数据进行语义分割的微调。作者设计了一个基于语义驱动的跨模态对比学习框架，通过语义和空间-时间一致性正则化对3D网络进行预训练。作者在SemanticKITTI、nuScenes和ScanNet上进行了实验。作者的预训练网络首次在nuScenes和ScanNet上实现了无标注注释的3D语义分割，分别达到了20.8%和25.08%的mIoU。当使用1%或100%标记数据进行微调时，作者的方法明显优于其他自监督方法，分别提高了8%和1%的mIoU。此外，作者展示了处理跨领域数据集的泛化能力。代码已公开可用。

Paper2 Single View Scene Scale Estimation Using Scale Field

摘要原文: In this paper, we propose a single image scale estimation method based on a novel scale field representation. A scale field defines the local pixel-to-metric conversion ratio along the gravity direction on all the ground pixels. This representation resolves the ambiguity in camera parameters, allowing us to use a simple yet effective way to collect scale annotations on arbitrary images from human annotators. By training our model on calibrated panoramic image data and the in-the-wild human annotated data, our single image scene scale estimation network generates robust scale field on a variety of image, which can be utilized in various 3D understanding and scale-aware image editing applications.

中文总结: 在这篇论文中，我们提出了一种基于新颖尺度场表示的单图像尺度估计方法。尺度场定义了所有地面像素上沿重力方向的本地像素到度量单位的转换比率。这种表示解决了相机参数的歧义，使我们能够使用一种简单而有效的方式从人类注释者手中收集任意图像上的尺度注释。通过在校准的全景图像数据和野外人类注释数据上训练我们的模型，我们的单图像场景尺度估计网络在各种图像上生成稳健的尺度场，可以应用于各种三维理解和尺度感知图像编辑应用中。

Paper3 BUOL: A Bottom-Up Framework With Occupancy-Aware Lifting for Panoptic 3D Scene Reconstruction From a Single Image

摘要原文: Understanding and modeling the 3D scene from a single image is a practical problem. A recent advance proposes a panoptic 3D scene reconstruction task that performs both 3D reconstruction and 3D panoptic segmentation from a single image. Although having made substantial progress, recent works only focus on top-down approaches that fill 2D instances into 3D voxels according to estimated depth, which hinders their performance by two ambiguities. (1) instance-channel ambiguity: The variable ids of instances in each scene lead to ambiguity during filling voxel channels with 2D information, confusing the following 3D refinement. (2) voxel-reconstruction ambiguity: 2D-to-3D lifting with estimated single view depth only propagates 2D information onto the surface of 3D regions, leading to ambiguity during the reconstruction of regions behind the frontal view surface. In this paper, we propose BUOL, a Bottom-Up framework with Occupancy-aware Lifting to address the two issues for panoptic 3D scene reconstruction from a single image. For instance-channel ambiguity, a bottom-up framework lifts 2D information to 3D voxels based on deterministic semantic assignments rather than arbitrary instance id assignments. The 3D voxels are then refined and grouped into 3D instances according to the predicted 2D instance centers. For voxel-reconstruction ambiguity, the estimated multi-plane occupancy is leveraged together with depth to fill the whole regions of things and stuff. Our method shows a tremendous performance advantage over state-of-the-art methods on synthetic dataset 3D-Front and real-world dataset Matterport3D, respectively. Code and models will be released.

中文总结: 这段话主要讨论了从单个图像中理解和建模3D场景的实际问题。最近的一项进展提出了一项全景3D场景重建任务，可以从单个图像中执行3D重建和3D全景分割。尽管取得了实质性进展，但最近的研究仅专注于自顶向下的方法，根据估计的深度将2D实例填充到3D体素中，这通过两种模糊性阻碍了它们的性能。首先是实例通道模糊：每个场景中实例的可变id在用2D信息填充体素通道时会导致模糊，使得后续的3D细化变得混乱。其次是体素重建模糊：根据估计的单视图深度进行的2D到3D提升仅将2D信息传播到3D区域表面，导致在重建前视表面后面区域时出现模糊。在本文中，我们提出了一种名为BUOL的底层框架，其中包含基于占用感知的提升，以解决全景3D场景从单个图像中重建时遇到的这两个问题。对于实例通道模糊，底层框架根据确定性语义分配将2D信息提升到3D体素，而不是任意实例id分配。然后，根据预测的2D实例中心，对3D体素进行细化并分组成3D实例。对于体素重建模糊，利用估计的多平面占用率以及深度来填充物体和材料的整个区域。我们的方法在合成数据集3D-Front和真实世界数据集Matterport3D上分别表现出与最先进方法相比的巨大性能优势。代码和模型将会发布。

Paper4 Learning To Generate Language-Supervised and Open-Vocabulary Scene Graph Using Pre-Trained Visual-Semantic Space

摘要原文: Scene graph generation (SGG) aims to abstract an image into a graph structure, by representing objects as graph nodes and their relations as labeled edges. However, two knotty obstacles limit the practicability of current SGG methods in real-world scenarios: 1) training SGG models requires time-consuming ground-truth annotations, and 2) the closed-set object categories make the SGG models limited in their ability to recognize novel objects outside of training corpora. To address these issues, we novelly exploit a powerful pre-trained visual-semantic space (VSS) to trigger language-supervised and open-vocabulary SGG in a simple yet effective manner. Specifically, cheap scene graph supervision data can be easily obtained by parsing image language descriptions into semantic graphs. Next, the noun phrases on such semantic graphs are directly grounded over image regions through region-word alignment in the pre-trained VSS. In this way, we enable open-vocabulary object detection by performing object category name grounding with a text prompt in this VSS. On the basis of visually-grounded objects, the relation representations are naturally built for relation recognition, pursuing open-vocabulary SGG. We validate our proposed approach with extensive experiments on the Visual Genome benchmark across various SGG scenarios (i.e., supervised / language-supervised, closed-set / open-vocabulary). Consistent superior performances are achieved compared with existing methods, demonstrating the potential of exploiting pre-trained VSS for SGG in more practical scenarios.

中文总结: 这段话主要讨论了场景图生成（SGG）的概念和现实应用中所面临的挑战。SGG旨在将图像抽象成一个图结构，通过将物体表示为图节点，将它们之间的关系表示为带标签的边。然而，目前SGG方法在实际场景中存在两个困难障碍：1）训练SGG模型需要耗时的地面真实标注，2）封闭的物体类别使得SGG模型在识别训练数据之外的新物体方面受到限制。为了解决这些问题，研究者提出了一种新颖的方法，利用强大的预训练视觉-语义空间（VSS）来触发语言监督和开放词汇的SGG。具体来说，通过将图像语言描述解析为语义图，可以轻松获得廉价的场景图监督数据。接下来，通过在预训练VSS中进行区域-词对齐，直接将这些语义图中的名词短语与图像区域进行对应。通过这种方式，可以通过在VSS中使用文本提示执行对象类别名称对齐，实现开放词汇的对象检测。在视觉对齐的对象基础上，自然构建关系表示，实现开放词汇的SGG。通过在Visual Genome基准数据集上进行大量实验证实了该方法的有效性，包括各种SGG场景（监督/语言监督，封闭集/开放词汇）。与现有方法相比，我们的方法表现出一致的优越性能，展示了利用预训练VSS在更实际的场景中进行SGG的潜力。

Paper5 Patch-Based 3D Natural Scene Generation From a Single Example

摘要原文: We target a 3D generative model for general natural scenes that are typically unique and intricate. Lacking the necessary volumes of training data, along with the difficulties of having ad hoc designs in presence of varying scene characteristics, renders existing setups intractable. Inspired by classical patch-based image models, we advocate for synthesizing 3D scenes at the patch level, given a single example. At the core of this work lies important algorithmic designs w.r.t the scene representation and generative patch nearest-neighbor module, that address unique challenges arising from lifting classical 2D patch-based framework to 3D generation. These design choices, on a collective level, contribute to a robust, effective, and efficient model that can generate high-quality general natural scenes with both realistic geometric structure and visual appearance, in large quantities and varieties, as demonstrated upon a variety of exemplar scenes. Data and code can be found at http://wyysf-98.github.io/Sin3DGen.

中文总结: 本段话主要讨论了针对一般自然场景的3D生成模型。由于缺乏必要的训练数据和在不同场景特征存在的临时设计困难，现有的设置变得棘手。受经典基于补丁的图像模型启发，我们主张在给定单个示例的情况下，在补丁级别合成3D场景。这项工作的核心在于针对场景表示和生成补丁最近邻模块的重要算法设计，以应对将经典的2D基于补丁的框架升级到3D生成时出现的独特挑战。这些设计选择在集体层面上有助于构建一个强大、有效和高效的模型，可以生成大量和多样化的高质量一般自然场景，具有逼真的几何结构和视觉外观，如在各种示例场景上所展示的。数据和代码可在http://wyysf-98.github.io/Sin3DGen找到。

Paper6 Chat2Map: Efficient Scene Mapping From Multi-Ego Conversations

摘要原文: Can conversational videos captured from multiple egocentric viewpoints reveal the map of a scene in a cost-efficient way? We seek to answer this question by proposing a new problem: efficiently building the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation. Our hypothesis is that as multiple people (“egos”) move in a scene and talk among themselves, they receive rich audio-visual cues that can help uncover the unseen areas of the scene. Given the high cost of continuously processing egocentric visual streams, we further explore how to actively coordinate the sampling of visual information, so as to minimize redundancy and reduce power use. To that end, we present an audio-visual deep reinforcement learning approach that works with our shared scene mapper to selectively turn on the camera to efficiently chart out the space. We evaluate the approach using a state-of-the-art audio-visual simulator for 3D scenes as well as real-world video. Our model outperforms previous state-of-the-art mapping methods, and achieves an excellent cost-accuracy tradeoff. Project: https://vision.cs.utexas.edu/projects/chat2map.

中文总结: 这段话主要讨论了通过从多个自我中心视角捕获的对话视频，能否以一种经济高效的方式揭示场景的地图。作者提出了一个新问题：通过利用自然对话中参与者的自我中心音频-视觉观察中的共享信息，有效地构建先前未见的3D环境的地图。作者的假设是，当多个人（“自我”）在场景中移动并彼此交谈时，他们会接收到丰富的音频-视觉线索，可以帮助揭示场景的未见区域。鉴于持续处理自我中心视觉流的高成本，作者进一步探讨如何主动协调视觉信息的采样，以最小化冗余并降低功耗。为此，作者提出了一种音频-视觉深度强化学习方法，与共享场景映射器配合使用，选择性地打开摄像头以高效绘制空间。作者使用最先进的3D场景音频-视觉模拟器以及真实世界视频来评估该方法。该模型优于先前的最先进的映射方法，并实现了出色的成本-准确性权衡。

Paper7 Putting People in Their Place: Affordance-Aware Human Insertion Into Scenes

摘要原文: We study the problem of inferring scene affordances by presenting a method for realistically inserting people into scenes. Given a scene image with a marked region and an image of a person, we insert the person into the scene while respecting the scene affordances. Our model can infer the set of realistic poses given the scene context, re-pose the reference person, and harmonize the composition. We set up the task in a self-supervised fashion by learning to re- pose humans in video clips. We train a large-scale diffusion model on a dataset of 2.4M video clips that produces diverse plausible poses while respecting the scene context. Given the learned human-scene composition, our model can also hallucinate realistic people and scenes when prompted without conditioning and also enables interactive editing. We conduct quantitative evaluation and show that our method synthesizes more realistic human appearance and more natural human-scene interactions when compared to prior work.

中文总结: 本文研究了通过一种方法将人物逼真地插入场景来推断场景可供性的问题。给定一个带有标记区域的场景图像和一个人物图像，我们将人物插入到场景中，并尊重场景可供性。我们的模型可以根据场景背景推断出一组逼真的姿势，重新调整参考人物的姿势，并协调整个构图。我们通过学习在视频剪辑中重新调整人类的姿势来以自监督的方式设置任务。我们在一个包含240万个视频剪辑的数据集上训练了一个大规模扩散模型，该模型生成多样化的合理姿势，同时尊重场景背景。在学习了人物-场景构图之后，我们的模型还可以在提示时产生逼真的人物和场景，而无需条件，并且支持交互式编辑。我们进行了定量评估，并展示了与先前工作相比，我们的方法合成了更逼真的人物外观和更自然的人物-场景互动。

Paper8 Semantic Scene Completion With Cleaner Self

摘要原文: Semantic Scene Completion (SSC) transforms an image of single-view depth and/or RGB 2D pixels into 3D voxels, each of whose semantic labels are predicted. SSC is a well-known ill-posed problem as the prediction model has to “imagine” what is behind the visible surface, which is usually represented by Truncated Signed Distance Function (TSDF). Due to the sensory imperfection of the depth camera, most existing methods based on the noisy TSDF estimated from depth values suffer from 1) incomplete volumetric predictions and 2) confused semantic labels. To this end, we use the ground-truth 3D voxels to generate a perfect visible surface, called TSDF-CAD, and then train a “cleaner” SSC model. As the model is noise-free, it is expected to focus more on the “imagination” of unseen voxels. Then, we propose to distill the intermediate “cleaner” knowledge into another model with noisy TSDF input. In particular, we use the 3D occupancy feature and the semantic relations of the “cleaner self” to supervise the counterparts of the “noisy self” to respectively address the above two incorrect predictions. Experimental results validate that the proposed method improves the noisy counterparts with 3.1% IoU and 2.2% mIoU for measuring scene completion and SSC, and also achieves new state-of-the-art accuracy on the popular NYU dataset. The code is available at https://github.com/fereenwong/CleanerS.

中文总结: Semantic Scene Completion (SSC)将单视角深度和/或RGB 2D像素图像转换为3D体素，每个体素的语义标签都被预测。SSC是一个众所周知的不适定问题，因为预测模型必须“想象”可见表面背后的内容，通常由截断有符号距离函数（TSDF）表示。由于深度摄像头的感知缺陷，大多数现有基于深度值估计的嘈杂TSDF的方法遭受1）体积预测不完整和2）语义标签混淆的问题。为此，我们使用地面真实3D体素生成一个完美的可见表面，称为TSDF-CAD，然后训练一个“更干净”的SSC模型。由于模型是无噪声的，预计它将更多地关注未见体素的“想象”。然后，我们提出将中间的“更干净”知识提炼到另一个具有嘈杂TSDF输入的模型中。具体来说，我们使用“更干净自身”的3D占用特征和语义关系来监督“嘈杂自身”的对应部分，分别解决上述两个不正确的预测。实验结果验证了该方法提高了测量场景完成和SSC的3.1% IoU和2.2% mIoU的嘈杂对应部分，并且在流行的NYU数据集上实现了新的最先进准确性。代码可在https://github.com/fereenwong/CleanerS获得。

Paper9 Neural Scene Chronology

摘要原文: In this work, we aim to reconstruct a time-varying 3D model, capable of rendering photo-realistic renderings with independent control of viewpoint, illumination, and time, from Internet photos of large-scale landmarks. The core challenges are twofold. First, different types of temporal changes, such as illumination and changes to the underlying scene itself (such as replacing one graffiti artwork with another) are entangled together in the imagery. Second, scene-level temporal changes are often discrete and sporadic over time, rather than continuous. To tackle these problems, we propose a new scene representation equipped with a novel temporal step function encoding method that can model discrete scene-level content changes as piece-wise constant functions over time. Specifically, we represent the scene as a space-time radiance field with a per-image illumination embedding, where temporally-varying scene changes are encoded using a set of learned step functions. To facilitate our task of chronology reconstruction from Internet imagery, we also collect a new dataset of four scenes that exhibit various changes over time. We demonstrate that our method exhibits state-of-the-art view synthesis results on this dataset, while achieving independent control of viewpoint, time, and illumination. Code and data are available at https://zju3dv.github.io/NeuSC/.

中文总结: 这项工作旨在从互联网上的大型地标照片中重建一个具有时间变化的3D模型，能够以照片逼真的方式呈现独立控制视点、照明和时间。核心挑战有两个。首先，不同类型的时间变化，如照明和场景本身的变化（比如用另一幅涂鸦艺术品替换原有的）在图像中交织在一起。其次，场景级别的时间变化通常是离散的、零星的，而不是连续的。为了解决这些问题，我们提出了一种新的场景表示，配备了一种新颖的时间阶跃函数编码方法，可以将离散的场景级内容变化建模为随时间分段恒定的函数。具体地，我们将场景表示为一个具有每个图像照明嵌入的时空辐射场，其中通过一组学习到的阶跃函数来编码随时间变化的场景变化。为了便于从互联网图像中重建时间线，我们还收集了一个展示不同时间变化的四个场景的新数据集。我们展示了我们的方法在这个数据集上展现出最先进的视角合成结果，同时实现了对视点、时间和照明的独立控制。代码和数据可在https://zju3dv.github.io/NeuSC/获取。

Paper10 SceneTrilogy: On Human Scene-Sketch and Its Complementarity With Photo and Text

摘要原文: In this paper, we extend scene understanding to include that of human sketch. The result is a complete trilogy of scene representation from three diverse and complementary modalities – sketch, photo, and text. Instead of learning a rigid three-way embedding and be done with it, we focus on learning a flexible joint embedding that fully supports the “optionality” that this complementarity brings. Our embedding supports optionality on two axis: (i) optionality across modalities – use any combination of modalities as query for downstream tasks like retrieval, (ii) optionality across tasks – simultaneously utilising the embedding for either discriminative (e.g., retrieval) or generative tasks (e.g., captioning). This provides flexibility to end-users by exploiting the best of each modality, therefore serving the very purpose behind our proposal of a trilogy at the first place. First, a combination of information-bottleneck and conditional invertible neural networks disentangle the modality-specific component from modality-agnostic in sketch, photo, and text. Second, the modality-agnostic instances from sketch, photo, and text are synergised using a modified cross-attention. Once learned, we show our embedding can accommodate a multi-facet of scene-related tasks, including those enabled for the first time by the inclusion of sketch, all without any task-specific modifications. Project Page: http://www.pinakinathc.me/scenetrilogy

中文总结: 本文将场景理解扩展到包括人类素描的范畴。结果是从三种不同且互补的模态——素描、照片和文本中得到了完整的场景表示三部曲。我们不是简单地学习一个固定的三向嵌入并结束，而是专注于学习一个灵活的联合嵌入，完全支持这种互补性带来的“可选性”。我们的嵌入支持两个方面的可选性：(i)跨模态的可选性——将任意组合的模态用作下游任务的查询，如检索；(ii)跨任务的可选性——同时利用嵌入进行判别性任务（如检索）或生成性任务（如字幕生成）。这为最终用户提供了灵活性，通过充分利用每种模态的优势，从而实现了我们最初提出三部曲的目的。首先，信息瓶颈和条件可逆神经网络的组合将素描、照片和文本中的特定模态组件与模态无关组件分离开来。其次，通过修改的交叉注意力机制，将素描、照片和文本中的模态无关实例进行协同作用。一旦学习完成，我们展示了我们的嵌入可以适应多种与场景相关的任务，包括通过包含素描而首次实现的那些任务，而无需进行任何特定于任务的修改。项目页面：http://www.pinakinathc.me/scenetrilogy。

Paper11 PLA: Language-Driven Open-Vocabulary 3D Scene Understanding

摘要原文: Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space. The recent breakthrough of 2D open-vocabulary perception is largely driven by Internet-scale paired image-text data with rich vocabulary concepts. However, this success cannot be directly transferred to 3D scenarios due to the inaccessibility of large-scale 3D-text pairs. To this end, we propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D, which allows explicitly associating 3D and semantic-rich captions. Further, to foster coarse-to-fine visual-semantic representation learning from captions, we design hierarchical 3D-caption pairs, leveraging geometric constraints between 3D scenes and multi-view images. Finally, by employing contrastive learning, the model learns language-aware embeddings that connect 3D and text for open-vocabulary tasks. Our method not only remarkably outperforms baseline methods by 25.8% 44.7% hIoU and 14.5% 50.4% hAP_ 50 in open-vocabulary semantic and instance segmentation, but also shows robust transferability on challenging zero-shot domain transfer tasks. See the project website at https://dingry.github.io/projects/PLA.

中文总结: 这段话主要讨论了开放词汇场景理解的概念，旨在定位和识别超出注释标签空间的未见类别。最近在2D开放词汇感知方面取得的突破主要是由互联网规模的配对图像文本数据推动，其中包含丰富的词汇概念。然而，由于无法访问大规模的3D-文本配对数据，这一成功无法直接转移到3D场景。因此，提出通过从3D中为多视图图像加注释来提炼预训练的视觉语言（VL）基础模型中编码的知识，从而明确地将3D和语义丰富的标题进行关联。此外，为了促进从标题中进行粗到细的视觉语义表示学习，设计了层次化的3D-标题对，利用3D场景和多视图图像之间的几何约束。最后，通过采用对比学习，模型学习了连接3D和文本的语言感知嵌入，用于开放词汇任务。我们的方法不仅在开放词汇语义和实例分割方面的表现明显优于基线方法，分别提高了25.8%至44.7%的hIoU和14.5%至50.4%的hAP_50，而且在具有挑战性的零样本域转移任务上表现出了稳健的可转移性。请查看项目网站https://dingry.github.io/projects/PLA。

Paper12 Learning Human Mesh Recovery in 3D Scenes

摘要原文: We present a novel method for recovering the absolute pose and shape of a human in a pre-scanned scene given a single image. Unlike previous methods that perform sceneaware mesh optimization, we propose to first estimate absolute position and dense scene contacts with a sparse 3D CNN, and later enhance a pretrained human mesh recovery network by cross-attention with the derived 3D scene cues. Joint learning on images and scene geometry enables our method to reduce the ambiguity caused by depth and occlusion, resulting in more reasonable global postures and contacts. Encoding scene-aware cues in the network also allows the proposed method to be optimization-free, and opens up the opportunity for real-time applications. The experiments show that the proposed network is capable of recovering accurate and physically-plausible meshes by a single forward pass and outperforms state-of-the-art methods in terms of both accuracy and speed. Code is available on our project page: https://zju3dv.github.io/sahmr/.

中文总结: 我们提出了一种新颖的方法，可以在给定单个图像的情况下，在预先扫描的场景中恢复人体的绝对姿势和形状。与以往的方法不同，这些方法执行基于场景的网格优化，我们提出首先利用稀疏的3D CNN估计绝对位置和密集场景接触，然后通过与导出的3D场景线索进行交叉注意力来增强预训练的人体网格恢复网络。在图像和场景几何的联合学习中，我们的方法能够减少由深度和遮挡引起的歧义，从而产生更合理的全局姿势和接触。在网络中编码基于场景的线索还使得所提出的方法无需优化，并为实时应用提供了机会。实验表明，所提出的网络能够通过单次前向传递恢复准确且符合物理规律的网格，并在准确性和速度方面优于现有技术。我们的项目页面上提供了代码：https://zju3dv.github.io/sahmr/。

Paper13 Incremental 3D Semantic Scene Graph Prediction From RGB Sequences

摘要原文: 3D semantic scene graphs are a powerful holistic representation as they describe the individual objects and depict the relation between them. They are compact high-level graphs that enable many tasks requiring scene reasoning. In real-world settings, existing 3D estimation methods produce robust predictions that mostly rely on dense inputs. In this work, we propose a real-time framework that incrementally builds a consistent 3D semantic scene graph of a scene given an RGB image sequence. Our method consists of a novel incremental entity estimation pipeline and a scene graph prediction network. The proposed pipeline simultaneously reconstructs a sparse point map and fuses entity estimation from the input images. The proposed network estimates 3D semantic scene graphs with iterative message passing using multi-view and geometric features extracted from the scene entities. Extensive experiments on the 3RScan dataset show the effectiveness of the proposed method in this challenging task, outperforming state-of-the-art approaches.

中文总结: 这段话主要讨论了3D语义场景图作为一种强大的整体表示形式，描述了各个物体并描绘了它们之间的关系。这些图是紧凑的高级图形，可以支持许多需要场景推理的任务。在现实世界中，现有的3D估计方法主要依赖于密集输入来产生稳健的预测。在这项工作中，提出了一个实时框架，根据RGB图像序列逐步构建一个一致的3D语义场景图。该方法包括一个新颖的增量实体估计流程和一个场景图预测网络。提出的流程同时重建了一个稀疏点地图，并从输入图像中融合了实体估计。提出的网络利用从场景实体中提取的多视图和几何特征，通过迭代消息传递估计3D语义场景图。在3RScan数据集上进行的大量实验表明，所提出的方法在这一具有挑战性的任务中表现出了有效性，优于现有的方法。

Paper14 HexPlane: A Fast Representation for Dynamic Scenes

摘要原文: Modeling and re-rendering dynamic 3D scenes is a challenging task in 3D vision. Prior approaches build on NeRF and rely on implicit representations. This is slow since it requires many MLP evaluations, constraining real-world applications. We show that dynamic 3D scenes can be explicitly represented by six planes of learned features, leading to an elegant solution we call HexPlane. A HexPlane computes features for points in spacetime by fusing vectors extracted from each plane, which is highly efficient. Pairing a HexPlane with a tiny MLP to regress output colors and training via volume rendering gives impressive results for novel view synthesis on dynamic scenes, matching the image quality of prior work but reducing training time by more than 100x. Extensive ablations confirm our HexPlane design and show that it is robust to different feature fusion mechanisms, coordinate systems, and decoding mechanisms. HexPlane is a simple and effective solution for representing 4D volumes, and we hope they can broadly contribute to modeling spacetime for dynamic 3D scenes.

中文总结: 这段话主要讨论了在3D视觉中，对动态3D场景进行建模和重新渲染是一项具有挑战性的任务。先前的方法基于NeRF并依赖于隐式表示。由于需要进行许多MLP评估，这种方法速度较慢，限制了实际应用。作者展示了动态3D场景可以通过六个学习特征的平面来明确表示，从而提出了一种名为HexPlane的优雅解决方案。HexPlane通过从每个平面提取的向量融合来计算时空中点的特征，具有高效性。将HexPlane与微小的MLP配对，通过体素渲染进行训练，在动态场景的新视图合成方面取得了令人印象深刻的结果，与先前工作的图像质量相匹配，但训练时间减少了100倍以上。大量消融实验证实了HexPlane设计的有效性，并表明它对不同的特征融合机制、坐标系和解码机制具有鲁棒性。HexPlane是表示4D体积的简单有效解决方案，希望它们能广泛促进对动态3D场景时空建模的贡献。

Paper15 Indiscernible Object Counting in Underwater Scenes

摘要原文: Recently, indiscernible scene understanding has attracted a lot of attention in the vision community. We further advance the frontier of this field by systematically studying a new challenge named indiscernible object counting (IOC), the goal of which is to count objects that are blended with respect to their surroundings. Due to a lack of appropriate IOC datasets, we present a large-scale dataset IOCfish5K which contains a total of 5,637 high-resolution images and 659,024 annotated center points. Our dataset consists of a large number of indiscernible objects (mainly fish) in underwater scenes, making the annotation process all the more challenging. IOCfish5K is superior to existing datasets with indiscernible scenes because of its larger scale, higher image resolutions, more annotations, and denser scenes. All these aspects make it the most challenging dataset for IOC so far, supporting progress in this area. For benchmarking purposes, we select 14 mainstream methods for object counting and carefully evaluate them on IOCfish5K. Furthermore, we propose IOCFormer, a new strong baseline that combines density and regression branches in a unified framework and can effectively tackle object counting under concealed scenes. Experiments show that IOCFormer achieves state-of-the-art scores on IOCfish5K.

中文总结: 最近，难以分辨场景理解在视觉领域引起了很多关注。我们进一步推动了这一领域的前沿，通过系统地研究一个名为难以分辨对象计数（IOC）的新挑战，其目标是计算与周围环境混合的对象数量。由于缺乏适当的IOC数据集，我们提出了一个大规模数据集IOCfish5K，其中包含5637张高分辨率图像和659,024个标注的中心点。我们的数据集包含大量水下场景中的难以分辨对象（主要是鱼），使得标注过程变得更加具有挑战性。IOCfish5K相对于现有的难以分辨场景数据集更为优越，因为它具有更大的规模、更高的图像分辨率、更多的标注和更密集的场景。所有这些方面使其成为迄今为止最具挑战性的IOC数据集，支持这一领域的进展。为了进行基准测试，我们选择了14种主流的对象计数方法，并在IOCfish5K上进行了仔细评估。此外，我们提出了IOCFormer，这是一个新的强大基准，它将密度和回归分支结合在一个统一的框架中，可以有效地处理隐蔽场景下的对象计数。实验表明，IOCFormer在IOCfish5K上取得了最先进的成绩。

Paper16 NeuralField-LDM: Scene Generation With Hierarchical Latent Diffusion Models

摘要原文: Automatically generating high-quality real world 3D scenes is of enormous interest for applications such as virtual reality and robotics simulation. Towards this goal, we introduce NeuralField-LDM, a generative model capable of synthesizing complex 3D environments. We leverage Latent Diffusion Models that have been successfully utilized for efficient high-quality 2D content creation. We first train a scene auto-encoder to express a set of image and pose pairs as a neural field, represented as density and feature voxel grids that can be projected to produce novel views of the scene. To further compress this representation, we train a latent-autoencoder that maps the voxel grids to a set of latent representations. A hierarchical diffusion model is then fit to the latents to complete the scene generation pipeline. We achieve a substantial improvement over existing state-of-the-art scene generation models. Additionally, we show how NeuralField-LDM can be used for a variety of 3D content creation applications, including conditional scene generation, scene inpainting and scene style manipulation.

中文总结: 这段话主要讨论了自动生成高质量真实世界3D场景的重要性，以及介绍了一种名为NeuralField-LDM的生成模型，能够合成复杂的3D环境。该模型利用了已成功用于高效高质量2D内容创建的潜在扩散模型。首先训练一个场景自动编码器，将图像和姿势对表示为神经场，表示为密度和特征体素网格，可以投影以生成场景的新视图。为了进一步压缩这种表示，训练一个潜在自动编码器，将体素网格映射到一组潜在表示。然后拟合一个分层扩散模型到这些潜在表示，完成场景生成流程。该模型在现有最先进的场景生成模型上取得了显著的改进。此外，展示了NeuralField-LDM如何用于各种3D内容创建应用，包括条件场景生成、场景修补和场景风格操作。

Paper17 Fast Monocular Scene Reconstruction With Global-Sparse Local-Dense Grids

摘要原文: Indoor scene reconstruction from monocular images has long been sought after by augmented reality and robotics developers. Recent advances in neural field representations and monocular priors have led to remarkable results in scene-level surface reconstructions. The reliance on Multilayer Perceptrons (MLP), however, significantly limits speed in training and rendering. In this work, we propose to directly use signed distance function (SDF) in sparse voxel block grids for fast and accurate scene reconstruction without MLPs. Our globally sparse and locally dense data structure exploits surfaces’ spatial sparsity, enables cache-friendly queries, and allows direct extensions to multi-modal data such as color and semantic labels. To apply this representation to monocular scene reconstruction, we develop a scale calibration algorithm for fast geometric initialization from monocular depth priors. We apply differentiable volume rendering from this initialization to refine details with fast convergence. We also introduce efficient high-dimensional Continuous Random Fields (CRFs) to further exploit the semantic-geometry consistency between scene objects. Experiments show that our approach is 10x faster in training and 100x faster in rendering while achieving comparable accuracy to state-of-the-art neural implicit methods.

中文总结: 这段话主要讨论了从单眼图像中重建室内场景的技术。近年来，神经场表示和单眼先验知识的进展取得了显著成果，使得在场景级别的表面重建方面取得了显著进展。然而，对Multilayer Perceptrons（MLP）的依赖显著限制了训练和渲染速度。因此，作者提出了在稀疏体素块网格中直接使用有符号距离函数（SDF）进行快速准确的场景重建，而无需使用MLP。他们的全局稀疏和局部密集数据结构利用了表面的空间稀疏性，实现了友好的缓存查询，并允许直接扩展到多模态数据，如颜色和语义标签。为了将这种表示应用于单眼场景重建，他们开发了一个尺度校准算法，用于从单眼深度先验中快速进行几何初始化。他们通过不同iable体积渲染从这个初始化中细化细节并实现快速收敛。他们还引入了高效的高维连续随机场（CRFs），以进一步利用场景对象之间的语义-几何一致性。实验表明，他们的方法在训练速度上比最先进的神经隐式方法快10倍，在渲染速度上快100倍，并且在精度上达到了可比较的水平。

Paper18 Depth Estimation From Indoor Panoramas With Neural Scene Representation

摘要原文: Depth estimation from indoor panoramas is challenging due to the equirectangular distortions of panoramas and inaccurate matching. In this paper, we propose a practical framework to improve the accuracy and efficiency of depth estimation from multi-view indoor panoramic images with the Neural Radiance Field technology. Specifically, we develop two networks to implicitly learn the Signed Distance Function for depth measurements and the radiance field from panoramas. We also introduce a novel spherical position embedding scheme to achieve high accuracy. For better convergence, we propose an initialization method for the network weights based on the Manhattan World Assumption. Furthermore, we devise a geometric consistency loss, leveraging the surface normal, to further refine the depth estimation. The experimental results demonstrate that our proposed method outperforms state-of-the-art works by a large margin in both quantitative and qualitative evaluations. Our source code is available at https://github.com/WJ-Chang-42/IndoorPanoDepth.

中文总结: 这段话主要讨论了从室内全景图中进行深度估计的挑战性，主要是由于全景图的等距正交失真和匹配不准确。论文提出了一个实用的框架，利用神经辐射场技术改进多视角室内全景图像的深度估计的准确性和效率。具体来说，作者开发了两个网络，隐式学习了用于深度测量的有符号距离函数和全景图的辐射场。还引入了一种新颖的球面位置嵌入方案以实现高精度。为了更好地收敛，提出了一种基于曼哈顿世界假设的网络权重初始化方法。此外，设计了一个几何一致性损失，利用表面法线进一步细化深度估计。实验结果表明，我们提出的方法在定量和定性评估中明显优于最先进的工作。我们的源代码可在https://github.com/WJ-Chang-42/IndoorPanoDepth获得。

Paper19 SGLoc: Scene Geometry Encoding for Outdoor LiDAR Localization

摘要原文: LiDAR-based absolute pose regression estimates the global pose through a deep network in an end-to-end manner, achieving impressive results in learning-based localization. However, the accuracy of existing methods still has room to improve due to the difficulty of effectively encoding the scene geometry and the unsatisfactory quality of the data. In this work, we propose a novel LiDAR localization framework, SGLoc, which decouples the pose estimation to point cloud correspondence regression and pose estimation via this correspondence. This decoupling effectively encodes the scene geometry because the decoupled correspondence regression step greatly preserves the scene geometry, leading to significant performance improvement. Apart from this decoupling, we also design a tri-scale spatial feature aggregation module and inter-geometric consistency constraint loss to effectively capture scene geometry. Moreover, we empirically find that the ground truth might be noisy due to GPS/INS measuring errors, greatly reducing the pose estimation performance. Thus, we propose a pose quality evaluation and enhancement method to measure and correct the ground truth pose. Extensive experiments on the Oxford Radar RobotCar and NCLT datasets demonstrate the effectiveness of SGLoc, which outperforms state-of-the-art regression-based localization methods by 68.5% and 67.6% on position accuracy, respectively.

中文总结: 这段话主要讨论了基于LiDAR的绝对姿态回归通过一个端到端的深度网络来估计全局姿态，取得了在基于学习的定位方面令人印象深刻的结果。然而，由于有效地编码场景几何和数据质量不佳的困难，现有方法的准确性仍有改进空间。在这项工作中，提出了一种新颖的LiDAR定位框架SGLoc，将姿态估计分解为点云对应关系回归和通过这种对应关系进行姿态估计。这种分解有效地编码了场景几何，因为分解的对应关系回归步骤极大地保留了场景几何，从而带来了显著的性能提升。除了这种分解，还设计了一个三尺度空间特征聚合模块和互几何一致性约束损失，以有效捕捉场景几何。此外，我们经验性地发现，由于GPS/INS测量误差，地面真值可能存在噪声，大大降低了姿态估计性能。因此，我们提出了一种姿态质量评估和增强方法来测量和校正地面真值姿态。在牛津雷达机器人车和NCLT数据集上的大量实验证明了SGLoc的有效性，分别在位置准确性上比最先进的基于回归的定位方法提高了68.5%和67.6%。

Paper20 Turning a CLIP Model Into a Scene Text Detector

摘要原文: The recent large-scale Contrastive Language-Image Pretraining (CLIP) model has shown great potential in various downstream tasks via leveraging the pretrained vision and language knowledge. Scene text, which contains rich textual and visual information, has an inherent connection with a model like CLIP. Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection. In contrast to these works, this paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process. We demonstrate the advantages of the proposed TCM as follows: (1) The underlying principle of our framework can be applied to improve existing scene text detector. (2) It facilitates the few-shot training capability of existing methods, e.g., by using 10% of labeled data, we significantly improve the performance of the baseline method with an average of 22% in terms of the F-measure on 4 benchmarks. (3) By turning the CLIP model into existing scene text detection methods, we further achieve promising domain adaptation ability. The code will be publicly released at https://github.com/wenwenyu/TCM.

中文总结: 这段话主要介绍了最近大规模对比语言-图像预训练（CLIP）模型在各种下游任务中展现出巨大潜力，通过利用预训练的视觉和语言知识。场景文本具有丰富的文本和视觉信息，与CLIP这样的模型有固有联系。最近基于视觉语言模型的预训练方法在文本检测领域取得了有效进展。与这些工作相比，本文提出了一种新方法，称为TCM，专注于直接将CLIP模型用于文本检测而无需预训练过程。我们展示了提出的TCM的优势如下：（1）我们的框架的基本原理可以应用于改进现有的场景文本检测器。（2）它促进了现有方法的少样本训练能力，例如，通过使用10%的标记数据，我们在4个基准测试中的F-measure上平均提高了22%的基线方法性能。（3）通过将CLIP模型转换为现有的场景文本检测方法，我们进一步实现了有前途的领域自适应能力。代码将在https://github.com/wenwenyu/TCM 上公开发布。

Paper21 VisFusion: Visibility-Aware Online 3D Scene Reconstruction From Videos

摘要原文: We propose VisFusion, a visibility-aware online 3D scene reconstruction approach from posed monocular videos. In particular, we aim to reconstruct the scene from volumetric features. Unlike previous reconstruction methods which aggregate features for each voxel from input views without considering its visibility, we aim to improve the feature fusion by explicitly inferring its visibility from a similarity matrix, computed from its projected features in each image pair. Following previous works, our model is a coarse-to-fine pipeline including a volume sparsification process. Different from their works which sparsify voxels globally with a fixed occupancy threshold, we perform the sparsification on a local feature volume along each visual ray to preserve at least one voxel per ray for more fine details. The sparse local volume is then fused with a global one for online reconstruction. We further propose to predict TSDF in a coarse-to-fine manner by learning its residuals across scales leading to better TSDF predictions. Experimental results on benchmarks show that our method can achieve superior performance with more scene details. Code is available at: https://github.com/huiyu-gao/VisFusion

中文总结: 本文提出了一种名为VisFusion的可见性感知在线三维场景重建方法，从定位的单目视频中进行重建。具体来说，我们旨在从体积特征中重建场景。与以往的重建方法不同，以前的方法是从输入视图中为每个体素聚合特征，而不考虑其可见性，我们旨在通过从每个图像对中的投影特征计算的相似性矩阵明确推断其可见性来改善特征融合。遵循以前的工作，我们的模型是一个由粗到细的流程，包括一个体积稀疏化过程。与他们的工作不同，他们是全局稀疏化体素，并且使用固定的占用阈值，我们在每个视觉射线上的局部特征体积上执行稀疏化，以保留每个射线至少一个体素以获取更多的细节。然后将稀疏的局部体积与全局体积融合进行在线重建。我们进一步提出通过学习跨尺度的残差来以粗到细的方式预测TSDF，从而实现更好的TSDF预测。基准测试结果表明，我们的方法可以实现更出色的性能，并提供更多的场景细节。代码可在以下链接找到：https://github.com/huiyu-gao/VisFusion。

Paper22 FREDOM: Fairness Domain Adaptation Approach to Semantic Scene Understanding

摘要原文: Although Domain Adaptation in Semantic Scene Segmentation has shown impressive improvement in recent years, the fairness concerns in the domain adaptation have yet to be well defined and addressed. In addition, fairness is one of the most critical aspects when deploying the segmentation models into human-related real-world applications, e.g., autonomous driving, as any unfair predictions could influence human safety. In this paper, we propose a novel Fairness Domain Adaptation (FREDOM) approach to semantic scene segmentation. In particular, from the proposed formulated fairness objective, a new adaptation framework will be introduced based on the fair treatment of class distributions. Moreover, to generally model the context of structural dependency, a new conditional structural constraint is introduced to impose the consistency of predicted segmentation. Thanks to the proposed Conditional Structure Network, the self-attention mechanism has sufficiently modeled the structural information of segmentation. Through the ablation studies, the proposed method has shown the performance improvement of the segmentation models and promoted fairness in the model predictions. The experimental results on the two standard benchmarks, i.e., SYNTHIA -> Cityscapes and GTA5 -> Cityscapes, have shown that our method achieved State-of-the-Art (SOTA) performance.

中文总结: 尽管语义场景分割中的域自适应在近年来取得了显著的改进，但域自适应中的公平性问题尚未得到明确定义和解决。此外，公平性是将分割模型部署到人类相关的现实应用中时最关键的方面之一，例如自动驾驶，因为任何不公平的预测都可能影响人类的安全。在本文中，我们提出了一种新颖的公平域自适应（FREDOM）方法用于语义场景分割。特别是，基于所提出的公平性目标，将引入一种基于类分布公平对待的新的自适应框架。此外，为了普遍建模结构依赖性的上下文，引入了一种新的条件结构约束，以强制预测分割的一致性。由于提出的条件结构网络，自注意力机制已充分建模了分割的结构信息。通过消融研究，所提出的方法显示了分割模型性能的改进，并促进了模型预测的公平性。在两个标准基准测试中的实验结果，即SYNTHIA -> Cityscapes和GTA5 -> Cityscapes，表明我们的方法实现了最先进的性能。

Paper23 Devil’s on the Edges: Selective Quad Attention for Scene Graph Generation

摘要原文: Scene graph generation aims to construct a semantic graph structure from an image such that its nodes and edges respectively represent objects and their relationships. One of the major challenges for the task lies in the presence of distracting objects and relationships in images; contextual reasoning is strongly distracted by irrelevant objects or backgrounds and, more importantly, a vast number of irrelevant candidate relations. To tackle the issue, we propose the Selective Quad Attention Network (SQUAT) that learns to select relevant object pairs and disambiguate them via diverse contextual interactions. SQUAT consists of two main components: edge selection and quad attention. The edge selection module selects relevant object pairs, i.e., edges in the scene graph, which helps contextual reasoning, and the quad attention module then updates the edge features using both edge-to-node and edge-to-edge cross-attentions to capture contextual information between objects and object pairs. Experiments demonstrate the strong performance and robustness of SQUAT, achieving the state of the art on the Visual Genome and Open Images v6 benchmarks.

中文总结: 场景图生成旨在从图像中构建语义图结构，使其节点和边分别代表对象及其关系。该任务的一个主要挑战在于图像中存在干扰对象和关系；上下文推理很容易被无关的对象或背景以及大量无关的候选关系所干扰。为了解决这个问题，我们提出了选择性四重注意力网络（SQUAT），该网络学习选择相关的对象对并通过多样化的上下文交互消除歧义。SQUAT由两个主要组件组成：边选择和四重注意力。边选择模块选择相关的对象对，即场景图中的边，从而有助于上下文推理；四重注意力模块通过边到节点和边到边的交叉注意力更新边特征，以捕捉对象和对象对之间的上下文信息。实验证明了SQUAT的强大性能和鲁棒性，实现了Visual Genome和Open Images v6基准测试的最新水平。

Paper24 Probing Neural Representations of Scene Perception in a Hippocampally Dependent Task Using Artificial Neural Networks

摘要原文: Deep artificial neural networks (DNNs) trained through backpropagation provide effective models of the mammalian visual system, accurately capturing the hierarchy of neural responses through primary visual cortex to inferior temporal cortex (IT). However, the ability of these networks to explain representations in higher cortical areas is relatively lacking and considerably less well researched. For example, DNNs have been less successful as a model of the egocentric to allocentric transformation embodied by circuits in retrosplenial and posterior parietal cortex. We describe a novel scene perception benchmark inspired by a hippocampal dependent task, designed to probe the ability of DNNs to transform scenes viewed from different egocentric perspectives. Using a network architecture inspired by the connectivity between temporal lobe structures and the hippocampus, we demonstrate that DNNs trained using a triplet loss can learn this task. Moreover, by enforcing a factorized latent space, we can split information propagation into “what” and “where” pathways, which we use to reconstruct the input. This allows us to beat the state-of-the-art for unsupervised object segmentation on the CATER and MOVi-A,B,C benchmarks.

中文总结: 这段话主要讨论了深度人工神经网络（DNNs）通过反向传播训练提供了对哺乳动物视觉系统的有效模型，准确地捕捉了从初级视觉皮层到下颞皮层（IT）的神经响应层次结构。然而，这些网络解释更高级皮层区域的表征能力相对较差，并且研究得不够充分。例如，DNNs在作为反映在后脑回和后顶叶皮层电路中体现的自我中心到他我中心转换的模型方面表现不佳。作者描述了一项受海马依赖任务启发的新型场景感知基准测试，旨在探究DNNs转换不同自我中心视角观察到的场景的能力。通过使用受颞叶结构和海马连接启发的网络架构，作者展示了使用三元损失训练的DNNs可以学习这一任务。此外，通过强制实现分解的潜在空间，我们可以将信息传播分为“什么”和“在哪里”路径，用于重构输入。这使我们能够在CATER和MOVi-A，B，C基准测试中击败最先进的无监督对象分割技术。

Paper25 Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

摘要原文: How does audio describe the world around us? In this paper, we propose a method for generating an image of a scene from sound. Our method addresses the challenges of dealing with the large gaps that often exist between sight and sound. We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities despite their information gaps. The key idea is to enrich the audio features with visual information by learning to align audio to visual latent space. We translate the input audio to visual features, then use a pre-trained generator to produce an image. To further improve the quality of our generated images, we use sound source localization to select the audio-visual pairs that have strong cross-modal correlations. We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches. We also show that we can control our model’s predictions by applying simple manipulations to the input waveform, or to the latent space.

中文总结: 这篇论文提出了一种从声音生成场景图像的方法。该方法解决了处理视听之间常存在的巨大差距的挑战。作者设计了一个模型，通过安排每个模型组件的学习过程来关联视听模态，尽管它们之间存在信息差距。关键思想是通过学习将音频特征与视觉信息对齐，丰富音频特征。他们将输入的音频转换为视觉特征，然后使用预训练的生成器生成图像。为了进一步提高生成图像的质量，他们使用声源定位来选择具有强交叉模态相关性的视听对。在VEGAS和VGGSound数据集上，相比之前的方法，他们获得了显著更好的结果。他们还展示了通过对输入波形或潜在空间进行简单操作，可以控制模型的预测。

Paper26 VoxFormer: Sparse Voxel Transformer for Camera-Based 3D Semantic Scene Completion

摘要原文: Humans can easily imagine the complete 3D geometry of occluded objects and scenes. This appealing ability is vital for recognition and understanding. To enable such capability in AI systems, we propose VoxFormer, a Transformer-based semantic scene completion framework that can output complete 3D volumetric semantics from only 2D images. Our framework adopts a two-stage design where we start from a sparse set of visible and occupied voxel queries from depth estimation, followed by a densification stage that generates dense 3D voxels from the sparse ones. A key idea of this design is that the visual features on 2D images correspond only to the visible scene structures rather than the occluded or empty spaces. Therefore, starting with the featurization and prediction of the visible structures is more reliable. Once we obtain the set of sparse queries, we apply a masked autoencoder design to propagate the information to all the voxels by self-attention. Experiments on SemanticKITTI show that VoxFormer outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics and reduces GPU memory during training to less than 16GB. Our code is available on https://github.com/NVlabs/VoxFormer.

中文总结: 这段话主要讲述了人类能够轻松想象被遮挡物体和场景的完整三维几何形状，并且指出这种能力对于识别和理解至关重要。为了在人工智能系统中实现这种能力，提出了一种基于Transformer的语义场景完成框架VoxFormer，可以仅从二维图像中输出完整的三维体积语义。该框架采用了两阶段设计，首先从深度估计中得到一组稀疏的可见和占据的体素查询，然后通过一种稀疏体素生成稠密三维体素的方式进行密集化。设计的一个关键思想是，二维图像上的视觉特征仅对应于可见的场景结构，而不是被遮挡或空白的空间。因此，从可见结构的特征化和预测开始更加可靠。一旦获得稀疏查询集，就通过自注意力机制将信息传播到所有体素，采用了掩码自编码器设计。在SemanticKITTI上的实验证明，VoxFormer在几何和语义方面的表现优于现有技术，几何方面相对改善了20.0％，语义方面相对改善了18.1％，并且在训练期间将GPU内存减少到不到16GB。我们的代码可在https://github.com/NVlabs/VoxFormer上找到。

Paper27 Panoptic Video Scene Graph Generation

摘要原文: Towards building comprehensive real-world visual perception systems, we propose and study a new problem called panoptic scene graph generation (PVSG). PVSG is related to the existing video scene graph generation (VidSGG) problem, which focuses on temporal interactions between humans and objects localized with bounding boxes in videos. However, the limitation of bounding boxes in detecting non-rigid objects and backgrounds often causes VidSGG systems to miss key details that are crucial for comprehensive video understanding. In contrast, PVSG requires nodes in scene graphs to be grounded by more precise, pixel-level segmentation masks, which facilitate holistic scene understanding. To advance research in this new area, we contribute a high-quality PVSG dataset, which consists of 400 videos (289 third-person + 111 egocentric videos) with totally 150K frames labeled with panoptic segmentation masks as well as fine, temporal scene graphs. We also provide a variety of baseline methods and share useful design practices for future work.

中文总结: 为了构建全面的现实世界视觉感知系统，我们提出并研究了一个称为全景场景图生成（PVSG）的新问题。PVSG与现有的视频场景图生成（VidSGG）问题相关，后者侧重于视频中使用边界框定位的人类和对象之间的时间交互。然而，边界框在检测非刚性对象和背景方面的局限性通常会导致VidSGG系统错过对全面视频理解至关重要的关键细节。相比之下，PVSG要求场景图中的节点通过更精确的像素级分割蒙版进行支撑，这有助于全面理解场景。为了推进这一新领域的研究，我们提供了一个高质量的PVSG数据集，包括400个视频（289个第三人称 + 111个自我中心视频），总共有150K帧，标记有全景分割蒙版以及精细的时间场景图。我们还提供了各种基线方法，并分享了未来工作的有用设计实践。

Paper28 Where We Are and What We’re Looking At: Query Based Worldwide Image Geo-Localization Using Hierarchies and Scenes

摘要原文: Determining the exact latitude and longitude that a photo was taken is a useful and widely applicable task, yet it remains exceptionally difficult despite the accelerated progress of other computer vision tasks. Most previous approaches have opted to learn single representations of query images, which are then classified at different levels of geographic granularity. These approaches fail to exploit the different visual cues that give context to different hierarchies, such as the country, state, and city level. To this end, we introduce an end-to-end transformer-based architecture that exploits the relationship between different geographic levels (which we refer to as hierarchies) and the corresponding visual scene information in an image through hierarchical cross-attention. We achieve this by learning a query for each geographic hierarchy and scene type. Furthermore, we learn a separate representation for different environmental scenes, as different scenes in the same location are often defined by completely different visual features. We achieve state of the art accuracy on 4 standard geo-localization datasets : Im2GPS, Im2GPS3k, YFCC4k, and YFCC26k, as well as qualitatively demonstrate how our method learns different representations for different visual hierarchies and scenes, which has not been demonstrated in the previous methods. Above previous testing datasets mostly consist of iconic landmarks or images taken from social media, which makes the dataset a simple memory task, or makes it biased towards certain places. To address this issue we introduce a much harder testing dataset, Google-World-Streets-15k, comprised of images taken from Google Streetview covering the whole planet and present state of the art results. Our code can be found at https://github.com/AHKerrigan/GeoGuessNet.

中文总结: 这段话主要讨论了确定照片拍摄的精确纬度和经度是一项有用且广泛适用的任务，尽管其他计算机视觉任务取得了加速进展，但仍然异常困难。大多数先前的方法选择学习查询图像的单一表示，然后在不同级别的地理粒度上进行分类。这些方法未能利用赋予不同层次上下文的不同视觉线索，例如国家、州和城市级别。为此，他们介绍了一种利用不同地理层次（称为层次结构）和图像中相应视觉场景信息之间关系的端到端基于Transformer的架构，通过层次交叉注意力来实现这一目标。他们通过学习每个地理层次和场景类型的查询来实现这一目标。此外，他们学习了不同环境场景的单独表示，因为同一位置的不同场景通常由完全不同的视觉特征定义。他们在4个标准地理定位数据集：Im2GPS、Im2GPS3k、YFCC4k和YFCC26k上实现了最先进的准确性，并在定性上展示了他们的方法如何学习不同的视觉层次和场景的表示，这在先前的方法中没有得到证明。以上先前的测试数据集主要由标志性地标或来自社交媒体的图像组成，使数据集成为简单的记忆任务，或者使其偏向于某些地方。为了解决这个问题，他们引入了一个更难的测试数据集，Google-World-Streets-15k，由来自Google Streetview覆盖整个地球的图像组成，并呈现了最先进的结果。他们的代码可以在https://github.com/AHKerrigan/GeoGuessNet找到。

Paper29 Seeing With Sound: Long-range Acoustic Beamforming for Multimodal Scene Understanding

摘要原文: Existing autonomous vehicles primarily use sensors that rely on electromagnetic waves which are undisturbed in good environmental conditions but can suffer in adverse scenarios, such as low light or for objects with low reflectance. Moreover, only objects in direct line-of-sight are typically detected by these existing methods. Acoustic pressure waves emanating from road users do not share these limitations. However, such signals are typically ignored in automotive perception because they suffer from low spatial resolution and lack directional information. In this work, we introduce long-range acoustic beamforming of pressure waves from noise directly produced by automotive vehicles in-the-wild as a complementary sensing modality to traditional optical sensor approaches for detection of objects in dynamic traffic environments. To this end, we introduce the first multimodal long-range acoustic beamforming dataset. We propose a neural aperture expansion method for beamforming and we validate its utility for multimodal automotive object detection. We validate the benefit of adding sound detections to existing RGB cameras in challenging automotive scenarios, where camera-only approaches fail or do not deliver the ultra-fast rates of pressure sensors.

中文总结: 这段话主要讨论了现有自动驾驶车辆主要使用依赖电磁波的传感器，这些传感器在良好环境条件下不受干扰，但在恶劣情况下可能会受到影响，比如光线不足或者物体反射率低。现有方法通常只能检测到直线视野中的物体。而来自道路用户的声压波不受这些限制。然而，由于声音信号通常具有较低的空间分辨率和缺乏方向信息，因此在汽车感知中通常被忽略。本文介绍了一种新的长距离声学波束成形技术，用于从实际汽车噪音中产生的压力波中检测动态交通环境中的物体，作为传统光学传感器方法的补充。为此，我们介绍了第一个多模式长距离声学波束成形数据集。我们提出了一种神经孔径扩展方法用于波束成形，并验证了其在多模式汽车物体检测中的效用。我们验证了在挑战性汽车场景中添加声音检测到现有RGB摄像头的好处，这些场景中仅使用摄像头方法失败或无法提供压力传感器的超快速率。

Paper30 OpenScene: 3D Scene Understanding With Open Vocabularies

摘要原文: Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries. For example, to perform SOTA zero-shot 3D semantic segmentation it first infers CLIP features for every 3D point and later classifies them based on similarities to embeddings of arbitrary class labels. More interestingly, it enables a suite of open-vocabulary scene understanding applications that have never been done before. For example, it allows a user to enter an arbitrary text query and then see a heat map indicating which parts of a scene match. Our approach is effective at identifying objects, materials, affordances, activities, and room types in complex 3D scenes, all using a single model trained without any labeled 3D data.

中文总结: 这段话主要介绍了传统的3D场景理解方法依赖于标记的3D数据集来训练模型，用于单一任务的监督学习。作者提出了一种名为OpenScene的替代方法，其中模型预测3D场景点的密集特征，这些特征与文本和图像像素在CLIP特征空间中共同嵌入。这种零样本方法实现了无任务依赖的训练和开放词汇查询。例如，为了执行SOTA零样本3D语义分割，它首先推断每个3D点的CLIP特征，然后基于与任意类标签嵌入的相似性对它们进行分类。更有趣的是，它实现了一系列以前从未实现过的开放词汇场景理解应用。例如，它允许用户输入任意文本查询，然后查看一个热图，指示场景的哪些部分匹配。我们的方法在复杂的3D场景中有效地识别对象、材料、功能、活动和房间类型，所有这些都是使用单一模型训练而没有任何标记的3D数据。

Paper31 Movies2Scenes: Using Movie Metadata To Learn Scene Representation

摘要原文: Understanding scenes in movies is crucial for a variety of applications such as video moderation, search, and recommendation. However, labeling individual scenes is a time-consuming process. In contrast, movie level metadata (e.g., genre, synopsis, etc.) regularly gets produced as part of the film production process, and is therefore significantly more commonly available. In this work, we propose a novel contrastive learning approach that uses movie metadata to learn a general-purpose scene representation. Specifically, we use movie metadata to define a measure of movie similarity, and use it during contrastive learning to limit our search for positive scene-pairs to only the movies that are considered similar to each other. Our learned scene representation consistently outperforms existing state-of-the-art methods on a diverse set of tasks evaluated using multiple benchmark datasets. Notably, our learned representation offers an average improvement of 7.9% on the seven classification tasks and 9.7% improvement on the two regression tasks in LVU dataset. Furthermore, using a newly collected movie dataset, we present comparative results of our scene representation on a set of video moderation tasks to demonstrate its generalizability on previously less explored tasks.

中文总结: 这段话的主要内容是：理解电影中的场景对于各种应用非常重要，如视频内容管理、搜索和推荐。然而，标记单个场景是一个耗时的过程。相比之下，电影级别的元数据（例如类型、梗概等）通常作为电影制作过程的一部分而产生，因此更常见。在这项工作中，我们提出了一种新颖的对比学习方法，利用电影元数据来学习通用的场景表示。具体地，我们利用电影元数据定义了电影相似性的度量，并在对比学习过程中使用它，以限制我们对正样本场景对的搜索仅限于被认为相似的电影。我们学习到的场景表示在多个基准数据集上的各种任务上一贯优于现有的最先进方法。值得注意的是，我们学习到的表示在LVU数据集的七个分类任务上平均提高了7.9%，在两个回归任务上提高了9.7%。此外，利用一个新收集的电影数据集，我们展示了我们的场景表示在一组视频内容管理任务上的比较结果，以展示其在以前较少探索的任务上的泛化能力。

Paper32 Collaborative Noisy Label Cleaner: Learning Scene-Aware Trailers for Multi-Modal Highlight Detection in Movies

摘要原文: Movie highlights stand out of the screenplay for efficient browsing and play a crucial role on social media platforms. Based on existing efforts, this work has two observations: (1) For different annotators, labeling highlight has uncertainty, which leads to inaccurate and time-consuming annotations. (2) Besides previous supervised or unsupervised settings, some existing video corpora can be useful, e.g., trailers, but they are often noisy and incomplete to cover the full highlights. In this work, we study a more practical and promising setting, i.e., reformulating highlight detection as “learning with noisy labels”. This setting does not require time-consuming manual annotations and can fully utilize existing abundant video corpora. First, based on movie trailers, we leverage scene segmentation to obtain complete shots, which are regarded as noisy labels. Then, we propose a Collaborative noisy Label Cleaner (CLC) framework to learn from noisy highlight moments. CLC consists of two modules: augmented cross-propagation (ACP) and multi-modality cleaning (MMC). The former aims to exploit the closely related audio-visual signals and fuse them to learn unified multi-modal representations. The latter aims to achieve cleaner highlight labels by observing the changes in losses among different modalities. To verify the effectiveness of CLC, we further collect a large-scale highlight dataset named MovieLights. Comprehensive experiments on MovieLights and YouTube Highlights datasets demonstrate the effectiveness of our approach. Code has been made available at: https://github.com/TencentYoutuResearch/HighlightDetection-CLC

中文总结: 这段话主要讨论了电影中的精华片段在快速浏览和在社交媒体平台上发挥关键作用。作者指出了两个观察结果：1）不同标注者对于精华片段的标注存在不确定性，导致标注不准确且耗时。2）除了以往的监督或无监督设置外，一些现有视频语料库可能会有所帮助，例如预告片，但它们往往嘈杂且不完整，无法覆盖所有精华片段。因此，作者提出了将精华片段检测重新构想为“学习带有嘈杂标签”的更实用和有前景的设置。他们利用电影预告片进行场景分割，获得完整镜头作为嘈杂标签，并提出了一种协作噪声标签清洁器（CLC）框架来学习嘈杂的精华片段。CLC包括两个模块：增强交叉传播（ACP）和多模态清洁（MMC）。作者进一步收集了一个名为MovieLights的大规模精华片段数据集，并在MovieLights和YouTube Highlights数据集上进行了全面实验，证明了他们方法的有效性。代码已在https://github.com/TencentYoutuResearch/HighlightDetection-CLC 上开源。

Paper33 Long Range Pooling for 3D Large-Scale Scene Understanding

摘要原文: Inspired by the success of recent vision transformers and large kernel design in convolutional neural networks (CNNs), in this paper, we analyze and explore essential reasons for their success. We claim two factors that are critical for 3D large-scale scene understanding: a larger receptive field and operations with greater non-linearity. The former is responsible for providing long range contexts and the latter can enhance the capacity of the network. To achieve the above properties, we propose a simple yet effective long range pooling (LRP) module using dilation max pooling, which provides a network with a large adaptive receptive field. LRP has few parameters, and can be readily added to current CNNs. Also, based on LRP, we present an entire network architecture, LRPNet, for 3D understanding. Ablation studies are presented to support our claims, and show that the LRP module achieves better results than large kernel convolution yet with reduced computation, due to its non-linearity. We also demonstrate the superiority of LRPNet on various benchmarks: LRPNet performs the best on ScanNet and surpasses other CNN-based methods on S3DIS and Matterport3D. Code will be avalible at https://github.com/li-xl/LRPNet.

中文总结: 这篇论文受到最近视觉变换器和卷积神经网络（CNNs）中大内核设计成功的启发，分析和探讨了它们成功的关键原因。论文指出了3D大规模场景理解的两个关键因素：更大的感受野和具有更大非线性的操作。前者负责提供长距离上下文，而后者可以增强网络的容量。为了实现上述特性，他们提出了一种简单而有效的长距离池化（LRP）模块，使用扩张最大池化，为网络提供了一个大的自适应感受野。LRP具有很少的参数，并且可以轻松添加到当前的CNN中。此外，基于LRP，他们提出了一个完整的网络架构LRPNet，用于3D理解。他们进行了消融研究来支持他们的观点，并显示LRP模块比大内核卷积具有更好的结果，但由于其非线性，计算量减少。他们还展示了LRPNet在各种基准测试中的优越性：LRPNet在ScanNet上表现最佳，并在S3DIS和Matterport3D上超越了其他基于CNN的方法。代码将在https://github.com/li-xl/LRPNet 上提供。

Paper34 Towards Unified Scene Text Spotting Based on Sequence Generation

摘要原文: Sequence generation models have recently made significant progress in unifying various vision tasks. Although some auto-regressive models have demonstrated promising results in end-to-end text spotting, they use specific detection formats while ignoring various text shapes and are limited in the maximum number of text instances that can be detected. To overcome these limitations, we propose a UNIfied scene Text Spotter, called UNITS. Our model unifies various detection formats, including quadrilaterals and polygons, allowing it to detect text in arbitrary shapes. Additionally, we apply starting-point prompting to enable the model to extract texts from an arbitrary starting point, thereby extracting more texts beyond the number of instances it was trained on. Experimental results demonstrate that our method achieves competitive performance compared to state-of-the-art methods. Further analysis shows that UNITS can extract a larger number of texts than it was trained on. We provide the code for our method at https://github.com/clovaai/units.

中文总结: 这段话主要介绍了最近在统一各种视觉任务方面取得了重大进展的序列生成模型。虽然一些自回归模型在端到端文本定位方面取得了令人期待的结果，但它们使用特定的检测格式，忽略了各种文本形状，并且在可以检测的最大文本实例数量上存在限制。为了克服这些限制，提出了一种名为UNITS的统一场景文本定位器。该模型统一了各种检测格式，包括四边形和多边形，使其能够检测任意形状的文本。此外，我们应用了起始点提示来使模型能够从任意起始点提取文本，从而提取超出其训练实例数量的更多文本。实验结果表明，我们的方法与最先进的方法相比取得了竞争性的性能。进一步分析显示，UNITS能够提取比其训练实例数量更多的文本。我们在https://github.com/clovaai/units提供了我们方法的代码。

Paper35 Probabilistic Debiasing of Scene Graphs

摘要原文: The quality of scene graphs generated by the state-of-the-art (SOTA) models is compromised due to the long-tail nature of the relationships and their parent object pairs. Training of the scene graphs is dominated by the majority relationships of the majority pairs and, therefore, the object-conditional distributions of relationship in the minority pairs are not preserved after the training is converged. Consequently, the biased model performs well on more frequent relationships in the marginal distribution of relationships such as ‘on’ and ‘wearing’, and performs poorly on the less frequent relationships such as ‘eating’ or ‘hanging from’. In this work, we propose virtual evidence incorporated within-triplet Bayesian Network (BN) to preserve the object-conditional distribution of the relationship label and to eradicate the bias created by the marginal probability of the relationships. The insufficient number of relationships in the minority classes poses a significant problem in learning the within-triplet Bayesian network. We address this insufficiency by embedding-based augmentation of triplets where we borrow samples of the minority triplet classes from its neighboring triplets in the semantic space. We perform experiments on two different datasets and achieve a significant improvement in the mean recall of the relationships. We also achieve a better balance between recall and mean recall performance compared to the SOTA de-biasing techniques of scene graph models.

中文总结: 这段话主要讨论了目前最先进的模型生成的场景图质量受到了关系及其父对象对之间长尾特性的影响。由于训练场景图的过程中主要受到大多数关系和大多数对象对的影响，因此在训练收敛后，少数对象对中的关系的条件分布并未得到保留。因此，偏向性模型在关系的边际分布中更频繁出现的关系（如“在”和“穿着”）上表现良好，而在较少出现的关系（如“吃”或“悬挂”）上表现不佳。作者提出了在三元贝叶斯网络中引入虚拟证据，以保留关系标签的对象条件分布，并消除边际概率造成的偏差。少数类别中关系数量不足在学习三元贝叶斯网络中构成了一个重要问题，作者通过基于嵌入的三元组增强来解决这一问题，从语义空间中的相邻三元组中借用少数类别三元组的样本。作者在两个不同数据集上进行了实验，并取得了关系平均召回率的显著提高。与最先进的场景图模型去偏方法相比，作者的方法在召回率和平均召回率表现之间取得了更好的平衡。

Paper36 MIME: Human-Aware 3D Scene Generation

摘要原文: Generating realistic 3D worlds occupied by moving humans has many applications in games, architecture, and synthetic data creation. But generating such scenes is expensive and labor intensive. Recent work generates human poses and motions given a 3D scene. Here, we take the opposite approach and generate 3D indoor scenes given 3D human motion. Such motions can come from archival motion capture or from IMU sensors worn on the body, effectively turning human movement in a “scanner” of the 3D world. Intuitively, human movement indicates the free-space in a room and human contact indicates surfaces or objects that support activities such as sitting, lying or touching. We propose MIME (Mining Interaction and Movement to infer 3D Environments), which is a generative model of indoor scenes that produces furniture layouts that are consistent with the human movement. MIME uses an auto-regressive transformer architecture that takes the already generated objects in the scene as well as the human motion as input, and outputs the next plausible object. To train MIME, we build a dataset by populating the 3D FRONT scene dataset with 3D humans. Our experiments show that MIME produces more diverse and plausible 3D scenes than a recent generative scene method that does not know about human movement. Code and data will be available for research.

中文总结: 这段话主要讨论了生成真实的3D世界中移动人物的应用，如游戏、建筑和合成数据创建等。但是生成这样的场景是昂贵且劳动密集的。最近的研究是在给定3D场景的情况下生成人物姿势和动作。然而，这里采取了相反的方法，即在给定3D人体运动的情况下生成3D室内场景。这些动作可以来自存档的动作捕捉或者穿戴在身体上的IMU传感器，有效地将人体运动转化为3D世界的“扫描仪”。直观地，人体运动指示了房间中的自由空间，人体接触则指示支持坐、躺或触摸等活动的表面或物体。研究提出了MIME（挖掘互动和运动以推断3D环境），这是一个室内场景的生成模型，可以生成与人体运动一致的家具布局。MIME使用自回归变压器架构，将已生成的场景中的物体和人体运动作为输入，输出下一个可能的物体。为了训练MIME，研究人员通过在3D FRONT场景数据集中放置3D人体来构建数据集。实验表明，MIME生成的3D场景比不了解人体运动的最近的生成场景方法更具多样性和合理性。研究代码和数据将提供给研究人员使用。

Paper37 pCON: Polarimetric Coordinate Networks for Neural Scene Representations

摘要原文: Neural scene representations have achieved great success in parameterizing and reconstructing images, but current state of the art models are not optimized with the preservation of physical quantities in mind. While current architectures can reconstruct color images correctly, they create artifacts when trying to fit maps of polar quantities. We propose polarimetric coordinate networks (pCON), a new model architecture for neural scene representations aimed at preserving polarimetric information while accurately parameterizing the scene. Our model removes artifacts created by current coordinate network architectures when reconstructing three polarimetric quantities of interest.

中文总结: 这段话主要讨论了神经场景表示在参数化和重建图像方面取得了巨大成功，但目前的最先进模型并未针对保留物理量而进行优化。当前的架构可以正确重建彩色图像，但在尝试拟合极坐标量图时会产生伪影。作者提出了极坐标坐标网络（pCON），这是一种新的神经场景表示模型架构，旨在在准确参数化场景的同时保留极坐标信息。我们的模型消除了当前坐标网络架构在重建三个极坐标量时产生的伪影。

Paper38 Unbiased Scene Graph Generation in Videos

摘要原文: The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context using complex architectures without addressing the challenges mentioned above, especially the long-tailed distribution of relationships. This often leads to the generation of biased scene graphs. To address these challenges, we introduce a new framework called TEMPURA: TEmporal consistency and Memory Prototype guided UnceRtainty Attenuation for unbiased dynamic SGG. TEMPURA employs object-level temporal consistencies via transformer-based sequence modeling, learns to synthesize unbiased relationship representations using memory-guided training, and attenuates the predictive uncertainty of visual relations using a Gaussian Mixture Model (GMM). Extensive experiments demonstrate that our method achieves significant (up to 10% in some cases) performance gain over existing methods highlight- ing its superiority in generating more unbiased scene graphs. Code: https://github.com/sayaknag/unbiasedSGG.git

中文总结: 这段话主要介绍了动态场景图生成（SGG）任务的复杂性和挑战性，包括场景固有的动态性、模型预测的时间波动、视觉关系的长尾分布等因素，以及现有图像为基础的SGG所面临的挑战。现有的动态SGG方法主要集中在利用复杂的架构捕获时空上下文，但未解决上述挑战，尤其是关系的长尾分布。这常常导致生成偏倚的场景图。为了解决这些挑战，引入了一种名为TEMPURA的新框架：通过基于transformer的序列建模实现对象级的时间一致性，通过内存引导训练学习合成无偏见的关系表示，通过高斯混合模型（GMM）减弱视觉关系的预测不确定性。大量实验表明，该方法在某些情况下实现了显著的性能提升（高达10%），突显了其在生成更无偏见的场景图方面的优越性。代码链接：https://github.com/sayaknag/unbiasedSGG.git。

Paper39 Diffusion-Based Generation, Optimization, and Planning in 3D Scenes

摘要原文: We introduce SceneDiffuser, a conditional generative model for 3D scene understanding. SceneDiffuser provides a unified model for solving scene-conditioned generation, optimization, and planning. In contrast to prior works, SceneDiffuser is intrinsically scene-aware, physics-based, and goal-oriented. With an iterative sampling strategy, SceneDiffuser jointly formulates the scene-aware generation, physics-based optimization, and goal-oriented planning via a diffusion-based denoising process in a fully differentiable fashion. Such a design alleviates the discrepancies among different modules and the posterior collapse of previous scene-conditioned generative models. We evaluate SceneDiffuser with various 3D scene understanding tasks, including human pose and motion generation, dexterous grasp generation, path planning for 3D navigation, and motion planning for robot arms. The results show significant improvements compared with previous models, demonstrating the tremendous potential of SceneDiffuser for the broad community of 3D scene understanding.

中文总结: 我们介绍了SceneDiffuser，这是一个用于3D场景理解的有条件生成模型。SceneDiffuser提供了一个统一的模型，用于解决基于场景的生成、优化和规划问题。与先前的作品相比，SceneDiffuser在本质上是场景感知的、基于物理的，并且以目标为导向的。通过迭代采样策略，SceneDiffuser通过基于扩散的去噪过程以完全可微的方式联合制定了场景感知生成、基于物理的优化和以目标为导向的规划。这样的设计减轻了不同模块之间的差异以及先前基于场景的生成模型的后验崩溃。我们使用各种3D场景理解任务对SceneDiffuser进行评估，包括人体姿势和运动生成、灵巧抓取生成、用于3D导航的路径规划以及用于机器人臂的运动规划。结果显示与先前模型相比有显著改进，展示了SceneDiffuser在广泛的3D场景理解社区中的巨大潜力。

Paper40 Text2Scene: Text-Driven Indoor Scene Stylization With Part-Aware Details

摘要原文: We propose Text2Scene, a method to automatically create realistic textures for virtual scenes composed of multiple objects. Guided by a reference image and text descriptions, our pipeline adds detailed texture on labeled 3D geometries in the room such that the generated colors respect the hierarchical structure or semantic parts that are often composed of similar materials. Instead of applying flat stylization on the entire scene at a single step, we obtain weak semantic cues from geometric segmentation, which are further clarified by assigning initial colors to segmented parts. Then we add texture details for individual objects such that their projections on image space exhibit feature embedding aligned with the embedding of the input. The decomposition makes the entire pipeline tractable to a moderate amount of computation resources and memory. As our framework utilizes the existing resources of image and text embedding, it does not require dedicated datasets with high-quality textures designed by skillful artists. To the best of our knowledge, it is the first practical and scalable approach that can create detailed and realistic textures of the desired style that maintain structural context for scenes with multiple objects.

中文总结: 我们提出了Text2Scene方法，用于自动为由多个对象组成的虚拟场景创建逼真的纹理。在参考图像和文本描述的指导下，我们的流程在房间中标记的3D几何体上添加详细的纹理，使生成的颜色尊重通常由相似材料组成的分层结构或语义部分。我们不是在单个步骤上对整个场景应用平面风格化，而是从几何分割中获得弱语义线索，通过为分割部分分配初始颜色进一步澄清。然后，我们为各个对象添加纹理细节，使它们在图像空间上的投影展现出与输入嵌入对齐的特征嵌入。这种分解使整个流程能够在适量的计算资源和内存下运行。由于我们的框架利用了图像和文本嵌入的现有资源，因此不需要由熟练艺术家设计的高质量纹理的专用数据集。据我们所知，这是第一个能够创建保持多个对象场景结构上下文的所需风格的详细和逼真纹理的实用且可扩展的方法。

Paper41 Prototype-Based Embedding Network for Scene Graph Generation

摘要原文: Current Scene Graph Generation (SGG) methods explore contextual information to predict relationships among entity pairs. However, due to the diverse visual appearance of numerous possible subject-object combinations, there is a large intra-class variation within each predicate category, e.g., “man-eating-pizza, giraffe-eating-leaf”, and the severe inter-class similarity between different classes, e.g., “man-holding-plate, man-eating-pizza”, in model’s latent space. The above challenges prevent current SGG methods from acquiring robust features for reliable relation prediction. In this paper, we claim that predicate’s categoryinherent semantics can serve as class-wise prototypes in the semantic space for relieving the above challenges caused by the diverse visual appearances. To the end, we propose the Prototype-based Embedding Network (PE-Net), which models entities/predicates with prototype-aligned compact and distinctive representations and establishes matching between entity pairs and predicates in a common embedding space for relation recognition. Moreover, Prototypeguided Learning (PL) is introduced to help PE-Net efficiently learn such entity-predicate matching, and Prototype Regularization (PR) is devised to relieve the ambiguous entity-predicate matching caused by the predicate’s semantic overlap. Extensive experiments demonstrate that our method gains superior relation recognition capability on SGG, achieving new state-of-the-art performances on both Visual Genome and Open Images datasets.

中文总结: 当前的场景图生成（SGG）方法探索上下文信息以预测实体对之间的关系。然而，由于大量可能的主-宾组合具有多样化的视觉外观，每个谓词类别内存在很大的类内变化，例如“男人吃披萨，长颈鹿吃树叶”，以及不同类别之间存在严重的类间相似性，例如“男人拿盘子，男人吃披萨”，在模型的潜在空间中。上述挑战阻碍了当前的SGG方法获取可靠的关系预测特征。在本文中，我们认为谓词类别固有的语义可以作为语义空间中的类别原型，以减轻由多样化视觉外观造成的上述挑战。为此，我们提出了基于原型的嵌入网络（PE-Net），该网络使用与原型对齐的紧凑和独特的表示来建模实体/谓词，并在共同的嵌入空间中建立实体对和谓词之间的匹配以进行关系识别。此外，引入了基于原型的学习（PL）来帮助PE-Net有效学习实体-谓词匹配，还设计了原型正则化（PR）来减轻由于谓词语义重叠导致的模糊的实体-谓词匹配。大量实验证明，我们的方法在SGG上获得了卓越的关系识别能力，实现了Visual Genome和Open Images数据集上的新的最先进性能。

Paper42 I2-SDF: Intrinsic Indoor Scene Reconstruction and Editing via Raytracing in Neural SDFs

摘要原文: In this work, we present I^2-SDF, a new method for intrinsic indoor scene reconstruction and editing using differentiable Monte Carlo raytracing on neural signed distance fields (SDFs). Our holistic neural SDF-based framework jointly recovers the underlying shapes, incident radiance and materials from multi-view images. We introduce a novel bubble loss for fine-grained small objects and error-guided adaptive sampling scheme to largely improve the reconstruction quality on large-scale indoor scenes. Further, we propose to decompose the neural radiance field into spatially-varying material of the scene as a neural field through surface-based, differentiable Monte Carlo raytracing and emitter semantic segmentations, which enables physically based and photorealistic scene relighting and editing applications. Through a number of qualitative and quantitative experiments, we demonstrate the superior quality of our method on indoor scene reconstruction, novel view synthesis, and scene editing compared to state-of-the-art baselines. Our project page is at https://jingsenzhu.github.io/i2-sdf.

中文总结: 在这项工作中，我们提出了一种名为I^2-SDF的新方法，用于利用可微分的蒙特卡洛光线追踪技术在神经有符号距离场（SDF）上进行内部场景的固有重建和编辑。我们的整体神经SDF框架从多视图图像中联合恢复基础形状、入射辐射和材质。我们引入了一种针对细粒度小物体的新型气泡损失以及基于误差引导的自适应采样方案，大大提高了大规模室内场景的重建质量。此外，我们提出将神经辐射场分解为场景的空间变化材质，作为通过基于表面的、可微分的蒙特卡洛光线追踪和发射器语义分割实现的神经场，从而实现基于物理的、逼真的场景照明和编辑应用。通过一系列定性和定量实验，我们展示了我们的方法在室内场景重建、新视角合成和场景编辑方面相对于最先进基线方法的卓越质量。我们的项目页面位于https://jingsenzhu.github.io/i2-sdf。

Paper43 Multi-View Inverse Rendering for Large-Scale Real-World Indoor Scenes

摘要原文: We present a efficient multi-view inverse rendering method for large-scale real-world indoor scenes that reconstructs global illumination and physically-reasonable SVBRDFs. Unlike previous representations, where the global illumination of large scenes is simplified as multiple environment maps, we propose a compact representation called Texture-based Lighting (TBL). It consists of 3D mesh and HDR textures, and efficiently models direct and infinite-bounce indirect lighting of the entire large scene. Based on TBL, we further propose a hybrid lighting representation with precomputed irradiance, which significantly improves the efficiency and alleviates the rendering noise in the material optimization. To physically disentangle the ambiguity between materials, we propose a three-stage material optimization strategy based on the priors of semantic segmentation and room segmentation. Extensive experiments show that the proposed method outperforms the state-of-the-art quantitatively and qualitatively, and enables physically-reasonable mixed-reality applications such as material editing, editable novel view synthesis and relighting. The project page is at https://lzleejean.github.io/TexIR.

中文总结: 这段话主要介绍了一种用于大规模真实室内场景的高效多视角反渲染方法，该方法重建全局照明和物理合理的SVBRDFs。与以往简化大场景全局照明为多个环境贴图的表示不同，作者提出了一种称为基于纹理的照明（TBL）的紧凑表示。它由3D网格和HDR纹理组成，有效地模拟了整个大场景的直接和无限反射间接照明。基于TBL，作者进一步提出了一种具有预计算辐照度的混合照明表示，显著提高了效率并减轻了材质优化中的渲染噪声。为了在材料之间物理上解开歧义，作者提出了基于语义分割和房间分割先验的三阶段材料优化策略。大量实验表明，所提出的方法在定量和定性上优于现有技术，并实现了物理合理的混合现实应用，如材质编辑、可编辑的新视图合成和重新照明。项目页面位于https://lzleejean.github.io/TexIR。

Paper44 Hierarchical Semantic Contrast for Scene-Aware Video Anomaly Detection

摘要原文: Increasing scene-awareness is a key challenge in video anomaly detection (VAD). In this work, we propose a hierarchical semantic contrast (HSC) method to learn a scene-aware VAD model from normal videos. We first incorporate foreground object and background scene features with high-level semantics by taking advantage of pre-trained video parsing models. Then, building upon the autoencoder-based reconstruction framework, we introduce both scene-level and object-level contrastive learning to enforce the encoded latent features to be compact within the same semantic classes while being separable across different classes. This hierarchical semantic contrast strategy helps to deal with the diversity of normal patterns and also increases their discrimination ability. Moreover, for the sake of tackling rare normal activities, we design a skeleton-based motion augmentation to increase samples and refine the model further. Extensive experiments on three public datasets and scene-dependent mixture datasets validate the effectiveness of our proposed method.

中文总结: 本文主要讨论视频异常检测（VAD）中增加场景感知的关键挑战。作者提出了一种层次语义对比（HSC）方法，从正常视频中学习一个具有场景感知的VAD模型。首先，利用预训练的视频解析模型，将前景对象和背景场景特征与高级语义结合起来。然后，在基于自动编码器的重建框架上，引入了场景级和对象级对比学习，以确保编码的潜在特征在同一语义类别内紧凑，并在不同类别之间可分。这种层次语义对比策略有助于处理正常模式的多样性，并增加它们的区分能力。此外，为了处理罕见的正常活动，作者设计了基于骨架的运动增强方法，以增加样本并进一步完善模型。在三个公共数据集和场景相关混合数据集上进行的大量实验验证了我们提出的方法的有效性。

Paper45 Fast Contextual Scene Graph Generation With Unbiased Context Augmentation

摘要原文: Scene graph generation (SGG) methods have historically suffered from long-tail bias and slow inference speed. In this paper, we notice that humans can analyze relationships between objects relying solely on context descriptions,and this abstract cognitive process may be guided by experience. For example, given descriptions of cup and table with their spatial locations, humans can speculate possible relationships < cup, on, table > or < table, near, cup >. Even without visual appearance information, some impossible predicates like flying in and looking at can be empirically excluded. Accordingly, we propose a contextual scene graph generation (C-SGG) method without using visual information and introduce a context augmentation method. We propose that slight perturbations in the position and size of objects do not essentially affect the relationship between objects. Therefore, at the context level, we can produce diverse context descriptions by using a context augmentation method based on the original dataset. These diverse context descriptions can be used for unbiased training of C-SGG to alleviate long-tail bias. In addition, we also introduce a context guided visual scene graph generation (CV-SGG) method, which leverages the C-SGG experience to guide vision to focus on possible predicates. Through extensive experiments on the publicly available dataset, C-SGG alleviates long-tail bias and omits the huge computation of visual feature extraction to realize real-time SGG. CV-SGG achieves a great trade-off between common predicates and tail predicates.

中文总结: 本文讨论了场景图生成（SGG）方法在历史上存在长尾偏差和推理速度慢的问题。作者注意到人类可以仅依靠上下文描述分析对象之间的关系，这种抽象的认知过程可能受经验指导。例如，给定杯子和桌子的描述及它们的空间位置，人类可以推测可能的关系<杯子，放在，桌子>或<桌子，靠近，杯子>。即使没有视觉外观信息，一些不可能的谓词如飞行和看着也可以被实证排除。因此，作者提出了一种不使用视觉信息的上下文场景图生成（C-SGG）方法，并引入了一种上下文增强方法。作者认为，对对象位置和大小进行轻微扰动并不会从根本上影响对象之间的关系。因此，在上下文层面上，可以通过基于原始数据集的上下文增强方法生成多样的上下文描述。这些多样的上下文描述可用于对C-SGG进行无偏训练，以减轻长尾偏差。此外，作者还介绍了一种上下文引导的视觉场景图生成（CV-SGG）方法，利用C-SGG的经验指导视觉关注可能的谓词。通过对公开数据集的广泛实验，C-SGG减轻了长尾偏差，并省略了视觉特征提取的巨大计算量，实现了实时的SGG。CV-SGG在常见谓词和尾部谓词之间取得了很好的折衷。

Paper46 SUDS: Scalable Urban Dynamic Scenes

摘要原文: We extend neural radiance fields (NeRFs) to dynamic large-scale urban scenes. Prior work tends to reconstruct single video clips of short durations (up to 10 seconds). Two reasons are that such methods (a) tend to scale linearly with the number of moving objects and input videos because a separate model is built for each and (b) tend to require supervision via 3D bounding boxes and panoptic labels, obtained manually or via category-specific models. As a step towards truly open-world reconstructions of dynamic cities, we introduce two key innovations: (a) we factorize the scene into three separate hash table data structures to efficiently encode static, dynamic, and far-field radiance fields, and (b) we make use of unlabeled target signals consisting of RGB images, sparse LiDAR, off-the-shelf self-supervised 2D descriptors, and most importantly, 2D optical flow. Operationalizing such inputs via photometric, geometric, and feature-metric reconstruction losses enables SUDS to decompose dynamic scenes into the static background, individual objects, and their motions. When combined with our multi-branch table representation, such reconstructions can be scaled to tens of thousands of objects across 1.2 million frames from 1700 videos spanning geospatial footprints of hundreds of kilometers, (to our knowledge) the largest dynamic NeRF built to date. We present qualitative initial results on a variety of tasks enabled by our representations, including novel-view synthesis of dynamic urban scenes, unsupervised 3D instance segmentation, and unsupervised 3D cuboid detection. To compare to prior work, we also evaluate on KITTI and Virtual KITTI 2, surpassing state-of-the-art methods that rely on ground truth 3D bounding box annotations while being 10x quicker to train.

中文总结: 本文将神经辐射场（NeRFs）扩展到动态的大规模城市场景中。先前的工作往往重建单个视频片段，持续时间较短（最多10秒）。两个原因是这样的方法(a)往往随着移动物体和输入视频数量线性扩展，因为为每个物体建立了一个单独的模型，(b)往往需要通过手动获得的3D边界框和全景标签或通过特定类别的模型进行监督。为了实现对动态城市的真正开放世界重建，我们引入了两个关键创新：(a)我们将场景分解为三个独立的哈希表数据结构，以有效地编码静态、动态和远场辐射场，(b)我们利用未标记的目标信号，包括RGB图像、稀疏LiDAR、现成的自监督2D描述符，最重要的是2D光流。通过光度、几何和特征度量重建损失来操作这些输入，使SUDS能够将动态场景分解为静态背景、个体对象和它们的运动。当结合我们的多分支表达方式时，这种重建可以扩展到跨越数百公里的地理空间范围的1700个视频中的120万帧，（据我们所知）是迄今为止构建的最大动态NeRF。我们提供了关于我们表达方式所实现的各种任务的定性初步结果，包括动态城市场景的新视角合成、无监督的3D实例分割和无监督的3D长方体检测。为了与先前的工作进行比较，我们还在KITTI和Virtual KITTI 2上进行评估，超过了依赖于地面真实3D边界框注释的最先进方法，同时训练速度快10倍。

Paper47 Dionysus: Recovering Scene Structures by Dividing Into Semantic Pieces

摘要原文: Most existing 3D reconstruction methods result in either detail loss or unsatisfying efficiency. However, effectiveness and efficiency are equally crucial in real-world applications, e.g., autonomous driving and augmented reality. We argue that this dilemma comes from wasted resources on valueless depth samples. This paper tackles the problem by proposing a novel learning-based 3D reconstruction framework named Dionysus. Our main contribution is to find out the most promising depth candidates from estimated semantic maps. This strategy simultaneously enables high effectiveness and efficiency by attending to the most reliable nominators. Specifically, we distinguish unreliable depth candidates by checking the cross-view semantic consistency and allow adaptive sampling by redistributing depth nominators among pixels. Experiments on the most popular datasets confirm our proposed framework’s effectiveness.

中文总结: 这段话的主要内容是：现有的大多数3D重建方法在细节损失或效率不尽人意之间存在抉择。然而，在现实世界的应用中，如自动驾驶和增强现实中，效果和效率同样至关重要。作者认为这一困境源于在无价值的深度样本上浪费资源。该论文通过提出一种名为Dionysus的新型基于学习的3D重建框架来解决这一问题。作者的主要贡献在于从估计的语义地图中找出最有前途的深度候选者。这一策略通过关注最可靠的提名者同时实现高效性和有效性。具体地，我们通过检查跨视图语义一致性来区分不可靠的深度候选者，并通过在像素之间重新分配深度提名者来实现自适应采样。在最流行的数据集上进行的实验验证了我们提出的框架的有效性。

Paper48 Advancing Visual Grounding With Scene Knowledge: Benchmark and Method

摘要原文: Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which do not require sufficient reasoning over the images and texts. This has been demonstrated in a recent study, where a simple LSTM-based text encoder without pretraining can achieve state-of-the-art performance on mainstream VG datasets. Therefore, in this paper, we propose a novel benchmark of Scene Knowledge-guided Visual Grounding (SK-VG), where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge. To perform this task, we propose two approaches to accept the triple-type input, where the former embeds knowledge into the image features before the image-query interaction; the latter leverages linguistic structure to assist in computing the image-text matching. We conduct extensive experiments to analyze the above methods and show that the proposed approaches achieve promising results but still leave room for improvement, including performance and interpretability.

中文总结: 这段话主要讨论了视觉 grounding（VG）的概念，即建立视觉和语言之间的细粒度对齐。理想情况下，VG 可以成为视觉与语言模型的测试基础，用于评估它们对图像和文本的理解以及它们在联合空间上的推理能力。然而，大多数现有的 VG 数据集是使用简单的描述文本构建的，这些文本不需要对图像和文本进行足够的推理。最近的一项研究证明了这一点，其中一个简单的基于 LSTM 的文本编码器在没有预训练的情况下就可以在主流的 VG 数据集上取得最先进的性能。因此，在这篇论文中，我们提出了一个新的基准测试场景知识引导的视觉 grounding（SK-VG），其中图像内容和指代表达并不足以将目标对象与之对应，迫使模型具有对长形式场景知识的推理能力。为了完成这项任务，我们提出了两种方法来接受三元输入，前者在图像-查询交互之前将知识嵌入到图像特征中；后者利用语言结构来辅助计算图像-文本匹配。我们进行了大量实验来分析上述方法，并展示提出的方法取得了令人满意的结果，但仍有改进的空间，包括性能和可解释性。

Paper49 Learning To Fuse Monocular and Multi-View Cues for Multi-Frame Depth Estimation in Dynamic Scenes

摘要原文: Multi-frame depth estimation generally achieves high accuracy relying on the multi-view geometric consistency. When applied in dynamic scenes, e.g., autonomous driving, this consistency is usually violated in the dynamic areas, leading to corrupted estimations. Many multi-frame methods handle dynamic areas by identifying them with explicit masks and compensating the multi-view cues with monocular cues represented as local monocular depth or features. The improvements are limited due to the uncontrolled quality of the masks and the underutilized benefits of the fusion of the two types of cues. In this paper, we propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the heuristically crafted masks. As unveiled in our analyses, the multi-view cues capture more accurate geometric information in static areas, and the monocular cues capture more useful contexts in dynamic areas. To let the geometric perception learned from multi-view cues in static areas propagate to the monocular representation in dynamic areas and let monocular cues enhance the representation of multi-view cost volume, we propose a cross-cue fusion (CCF) module, which includes the cross-cue attention (CCA) to encode the spatially non-local relative intra-relations from each source to enhance the representation of the other. Experiments on real-world datasets prove the significant effectiveness and generalization ability of the proposed method.

中文总结: 这段话主要介绍了多帧深度估计通常依赖于多视角几何一致性来实现高精度。在动态场景中应用时，比如自动驾驶，这种一致性通常会在动态区域被违反，导致估计结果不准确。许多多帧方法通过识别动态区域并使用显式掩模以及利用表示为局部单眼深度或特征的单眼线索来补偿多视角线索。然而，由于掩模质量无法控制以及两种线索融合的利益未被充分利用，改进有限。本文提出了一种新颖的方法，学习将编码为体积的多视角和单眼线索融合，无需启发式地制作掩模。根据我们的分析，多视角线索捕获了静态区域中更准确的几何信息，而单眼线索在动态区域中捕获了更有用的上下文。为了让从静态区域中学到的几何感知通过单眼线索在动态区域中传播，并让单眼线索增强多视角代价体积的表示，我们提出了一个跨线索融合（CCF）模块，其中包括交叉线索注意力（CCA）来编码来自每个源的空间非局部相对关系，以增强另一个的表示。对真实世界数据集的实验证明了所提方法的显著有效性和泛化能力。

CVPR2023论文速览Scenes相关49篇

CVPR2023论文速览Scenes

Paper1 CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP

Paper2 Single View Scene Scale Estimation Using Scale Field

Paper3 BUOL: A Bottom-Up Framework With Occupancy-Aware Lifting for Panoptic 3D Scene Reconstruction From a Single Image

Paper4 Learning To Generate Language-Supervised and Open-Vocabulary Scene Graph Using Pre-Trained Visual-Semantic Space

Paper5 Patch-Based 3D Natural Scene Generation From a Single Example

Paper6 Chat2Map: Efficient Scene Mapping From Multi-Ego Conversations

Paper7 Putting People in Their Place: Affordance-Aware Human Insertion Into Scenes

Paper8 Semantic Scene Completion With Cleaner Self

Paper9 Neural Scene Chronology

Paper10 SceneTrilogy: On Human Scene-Sketch and Its Complementarity With Photo and Text

Paper11 PLA: Language-Driven Open-Vocabulary 3D Scene Understanding

Paper12 Learning Human Mesh Recovery in 3D Scenes

Paper13 Incremental 3D Semantic Scene Graph Prediction From RGB Sequences

Paper14 HexPlane: A Fast Representation for Dynamic Scenes

Paper15 Indiscernible Object Counting in Underwater Scenes

Paper16 NeuralField-LDM: Scene Generation With Hierarchical Latent Diffusion Models

Paper17 Fast Monocular Scene Reconstruction With Global-Sparse Local-Dense Grids

Paper18 Depth Estimation From Indoor Panoramas With Neural Scene Representation

Paper19 SGLoc: Scene Geometry Encoding for Outdoor LiDAR Localization

Paper20 Turning a CLIP Model Into a Scene Text Detector

Paper21 VisFusion: Visibility-Aware Online 3D Scene Reconstruction From Videos

Paper22 FREDOM: Fairness Domain Adaptation Approach to Semantic Scene Understanding

Paper23 Devil’s on the Edges: Selective Quad Attention for Scene Graph Generation

Paper24 Probing Neural Representations of Scene Perception in a Hippocampally Dependent Task Using Artificial Neural Networks

Paper25 Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment

Paper26 VoxFormer: Sparse Voxel Transformer for Camera-Based 3D Semantic Scene Completion

Paper27 Panoptic Video Scene Graph Generation

Paper28 Where We Are and What We’re Looking At: Query Based Worldwide Image Geo-Localization Using Hierarchies and Scenes

Paper29 Seeing With Sound: Long-range Acoustic Beamforming for Multimodal Scene Understanding

Paper30 OpenScene: 3D Scene Understanding With Open Vocabularies

Paper31 Movies2Scenes: Using Movie Metadata To Learn Scene Representation

Paper32 Collaborative Noisy Label Cleaner: Learning Scene-Aware Trailers for Multi-Modal Highlight Detection in Movies

Paper33 Long Range Pooling for 3D Large-Scale Scene Understanding

Paper34 Towards Unified Scene Text Spotting Based on Sequence Generation

Paper35 Probabilistic Debiasing of Scene Graphs

Paper36 MIME: Human-Aware 3D Scene Generation

Paper37 pCON: Polarimetric Coordinate Networks for Neural Scene Representations

Paper38 Unbiased Scene Graph Generation in Videos

Paper39 Diffusion-Based Generation, Optimization, and Planning in 3D Scenes

Paper40 Text2Scene: Text-Driven Indoor Scene Stylization With Part-Aware Details

Paper41 Prototype-Based Embedding Network for Scene Graph Generation

Paper42 I2-SDF: Intrinsic Indoor Scene Reconstruction and Editing via Raytracing in Neural SDFs

Paper43 Multi-View Inverse Rendering for Large-Scale Real-World Indoor Scenes

Paper44 Hierarchical Semantic Contrast for Scene-Aware Video Anomaly Detection

Paper45 Fast Contextual Scene Graph Generation With Unbiased Context Augmentation

Paper46 SUDS: Scalable Urban Dynamic Scenes

Paper47 Dionysus: Recovering Scene Structures by Dividing Into Semantic Pieces

Paper48 Advancing Visual Grounding With Scene Knowledge: Benchmark and Method

Paper49 Learning To Fuse Monocular and Multi-View Cues for Multi-Frame Depth Estimation in Dynamic Scenes

相关文章