在2024年的NeurIPS(神经信息处理系统大会)中,多模态学习和大语言模型研究持续保持强劲势头。通过对3000多篇录用论文的分析,我们发现:
- 多模态强相关论文超200篇,占比约8%
- 大语言模型与多模态大模型相关研究占比超15%
- 录用率显示该领域仍处于蓬勃发展阶段
当前多模态学习研究主要集中在以下几个关键领域:
- 视觉认知类
- 视觉识别与理解
- 视觉对齐技术
- 视觉感知增强
- 模型优化类
- 偏好对齐与幻觉处理
- 高效模型架构
- 模型压缩与优化
- 智能交互类
- Agent技术应用
- 强化学习方法
- 人机协作优化
下面是搜集的有关多模态学习、多模态大模型的强相关论文,包括论文标题和摘要、翻译,并且博主根据摘要打上了一些方向标签。目前研究的热门方向还是包括: 视觉识别
、视觉理解
、视觉对齐
、视觉感知
、偏好对齐(幻觉处理)
、高效模型
、Agent
、强化学习
等等方面对 MLLM 开展研究。
#NeurIPS2024 #多模态学习 #大语言模型 #人工智能 #机器学习 #视觉识别 #MLLM #深度学习 #AI研究趋势 #计算机视觉
本文是对2024年NeurIPS会议多模态学习相关论文的系统性总结。如果您对相关研究感兴趣,欢迎关注后续更新。
博主新博客地址:BbiHH’s blog | bbihh.top 近期持续跟新AI定会的多模态论文研究趋势,欢迎关注。
(部分论文整理展示,26 篇,后续持续更新,阅读时间~30min)
·Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding
We present a novel framework for OCR-free document understanding based on pretrained Multimodal Large Language Models (MLLMs). Our approach employs multiscale visual features to effectively handle various font sizes within document images. To address the increasing costs of considering the multi-scale visual inputs for the pretrained MLLMs, we propose a hierarchical visual feature aggregation module designed to reduce the number of input tokens to LLMs. Our approach leverages feature pyramid hierarchy with cross-attentive pooling, effectively handling the trade-off between information loss and efficiency without being affected by varying document image sizes.Additionally, we introduce a novel instruction tuning task that aims to enhance model readability by incorporating text positional information within images, which is robust to text truncation issue. Through comprehensive experiments, we demonstrate the efficacy of our framework in achieving outstanding document understanding performance on various tasks.
文档理解、OCR-free、多尺度特征、层次聚合
用于无 OCR 文档理解的分层视觉特征聚合
提出了一种基于多模式大语言模型(MLLMS)的无 OCR 文档理解框架。该方法利用 多尺度视觉特征来有效地处理文档图像中的各种字体大小,针对预先训练的 MLLMS 考虑多尺度视觉输入的代价不断增加的问题,提出了一种层次化视觉特征 聚合模块,旨在减少 LLMS 的输入标记数。该方法利用特征金字塔层次结构和交 叉注意池,在不受文档图像大小影响的情况下,有效地处理了信息损失和效率之间 的权衡;此外,我们还引入了一种新的指令调优任务,旨在通过在图像中加入文本 位置信息来增强模型的可读性,该任务对文本截断问题具有很强的鲁棒性。通过全 面的实验,我们证明了我们的框架在各种任务上取得了出色的文档理解性能。
(文档理解,高效感知)
M3GPT: An Advanced Multimodal, Multitask Framework for Motion Comprehension and Generation
This paper presents M 3 M^3 M3GPT, an advanced \textbf{M}ultimodal, \textbf{M}ultitask framework for \textbf{M}otion comprehension and generation. M 3 M^3 M3GPT operates on three fundamental principles. The first focuses on creating a unified representation space for various motion-relevant modalities. We employ discrete vector quantization for multimodal control and generation signals, such as text, music and motion/dance, enabling seamless integration into a large language model (LLM) with a single vocabulary.The second involves modeling model generation directly in the raw motion space. This strategy circumvents the information loss associated with discrete tokenizer, resulting in more detailed and comprehensive model generation. Third, M 3 M^3 M3GPT learns to model the connections and synergies among various motion-relevant tasks. Text, the most familiar and well-understood modality for LLMs, is utilized as a bridge to establish connections between different motion tasks, facilitating mutual reinforcement. To our knowledge, M 3 M^3 M3GPT is the first model capable of comprehending and generating motions based on multiple signals. Extensive experiments highlight M 3 M^3 M3GPT’s superior performance across various motion-relevant tasks and its powerful zero-shot generalization capabilities for extremely challenging tasks.
动作生成、多模态统一、多任务学习、零样本泛化
KaTeX parse error: Expected 'EOF', got '#' at position 2: M#̲3GPT:用于运动理解和生成的高级多模式、多任务框架
本文提出了一个高级的多通道、多任务框架– M 3 M^3 M3GPT,用于理解和生成文本操 作。 M 3 M^3 M3GPT 遵循三个基本原则。第一个重点是为各种运动相关的模态创建一个 统一的表示空间。我们将离散矢量量化用于文本、音乐和动作/舞蹈等多模式控制 和生成信号,使其能够无缝集成到具有单一词汇的大型语言模型(LLM)中。第二, 直接在原始运动空间中建模模型生成。该策略避免了与离散标记器相关的信息损失, 从而产生更详细和更全面的模型生成。第三, M 3 M^3 M3GPT 学习对各种与运动相关的 任务之间的联系和协同作用进行建模。语篇是 LLMS 最熟悉和最被理解的情态形 式,它被用作在不同的动作任务之间建立联系的桥梁,促进了相互加强。据我们所 知, M 3 M^3 M3GPT 是第一个能够理解和生成基于多个信号的运动的模型。广泛的实验 突出了 M 3 M^3 M3GPT 在各种与运动相关的任务中的卓越性能,以及其针对极具挑战性 的任务的强大的零射泛化能力。
(运动理解和生成)
·Training-Free Visual Prompt Learning for Multimodal Large Language Models
In this work, we propose a training-free method to inject visual referring into Multimodal Large Language Models (MLLMs) through learnable visual token optimization. We observe the relationship between text prompt tokens and visual tokens in MLLMs, where attention layers model the connection between them. Our approach involves adjusting visual tokens from the MLP output during inference, controlling which text prompt tokens attend to which visual tokens. We optimize a learnable visual token based on an energy function, enhancing the strength of referential regions in the attention map. This enables detailed region description and reasoning without the need for substantial training costs or model retraining. Our method offers a promising direction for integrating referential abilities into MLLMs. Our method support referring with box, mask, scribble and point. The results demonstrate that our method exhibits controllability and interpretability.
视觉指代、无训练优化、多形式标注
多模式大型语言模型的免培训视觉提示学习
在这项工作中,我们提出了一种免训练的方法,通过可学习的视觉标记优化将视觉 引用注入多模式大型语言模型(MLLM)。我们观察 MLLM 中文本提示标记和视 觉标记之间的关系,其中注意力层对它们之间的连接进行建模。我们的方法涉及在 推理期间调整 MLP 输出的视觉标记,控制哪些文本提示标记关注哪些视觉标记。 我们基于能量函数优化可学习的视觉标记,增强注意力地图中参考区域的强度。这 使得可以进行详细的区域描述和推理,而不需要大量的培训成本或模型再培训。我 们的方法为将参考能力集成到 MLLM 中提供了一个有希望的方向。我们的方法支 持引用框、面具、涂鸦和点。结果表明我们的方法具有可控性和可解释性。
(提示学习、后训练)
Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs). Our approach employs a sequence of visual resamplers that capture visual details at various spacial scales.This architecture not only leverages global and local visual contexts effectively, but also facilitates the flexible extension of visual tokens through a compound token scaling strategy, allowing up to a 16x increase in the token count post pretraining.Consequently, Chain-of-Sight requires significantly fewer visual tokens in the pre-training phase compared to the fine-tuning phase. This intentional reduction of visual tokens during pre-training notably accelerates the pre-training process, cutting down the wall-clock training time by $\sim$73%.Empirical results on a series of vision-language benchmarks reveal that the pre-train acceleration through Chain-of-Sight is achieved without sacrificing performance, matching or surpassing the standard pipeline of utilizing all visual tokens throughout the entire training process. Further scaling up the number of visual tokens for pre-training leads to stronger performances, competitive to existing approaches in a series of benchmarks. Codes and models will be released.
预训练加速、多尺度视觉、令牌扩展
通过视线链加速多模式 LLM 的预训练
本文介绍了一种视觉-语言桥梁模块–视链,它加速了多通道大语言模型(MLLMS) 的预训练。我们的方法使用一系列视觉重采样器来捕捉不同空间尺度上的视觉细节。 该架构不仅有效地利用了全局和局部视觉上下文,而且通过复合令牌缩放策略促进 了视觉令牌的灵活扩展,使得训练前的令牌计数增加了 16 倍。因此,与微调阶段相比,视链在预训练阶段需要的视觉令牌显著减少。这种在训练前有意减少视觉标 记的方法显著加快了训练前的过程,将挂钟训练时间缩短了 73 美元。一系列视觉 语言基准的经验结果表明,通过视链进行训练前的加速是在不牺牲性能的情况下实 现的,在整个训练过程中匹配或超过了使用所有视觉标记的标准流水线。在一系列 基准中,进一步扩大训练前视觉标记的数量将导致更强的表现,与现有方法具有竞 争力。代码和型号将会公布。
(预训练)
·Visual Perception by Large Language Model’s Weights
Existing Multimodal Large Language Models (MLLMs) follow the paradigm that perceives visual information by aligning visual features with the input space of Large Language Models (LLMs), and concatenating visual tokens with text tokens to form a unified sequence input for LLMs. These methods demonstrate promising results on various vision-language tasks but are limited by the high computational effort due to the extended input sequence resulting from the involvement of visual tokens. In this paper, instead of input space alignment, we propose a novel parameter space alignment paradigm that represents visual information as model weights. For each input image, we use a vision encoder to extract visual features, convert features into perceptual weights, and merge the perceptual weights with LLM’s weights. In this way, the input of LLM does not require visual tokens, which reduces the length of the input sequence and greatly improves efficiency. Following this paradigm, we propose VLoRA with the perceptual weights generator. The perceptual weights generator is designed to convert visual features to low-rank perceptual weights since the visual information is redundant. Due to the low-rank property, our generated perceptual weights exhibit a form similar to LoRA. The experimental results show that our VLoRA achieves comparable performance on various benchmarks for MLLMs, while significantly reducing the computational costs for both training and inference.
视觉感知、参数空间对齐、低秩权重、计算优化
大型语言模型权重的视觉感知
现有的多通道大语言模型遵循的是通过将视觉特征与大语言模型的输入空间对齐, 并将视觉标记与文本标记连接以形成大语言模型的统一序列输入来感知视觉信息的 范例。这些方法在不同的视觉语言任务上显示出良好的结果,但由于视觉标记的参 与导致输入序列的扩展而限制了较高的计算工作量。在本文中,我们提出了一种新 的参数空间对齐范式,将视觉信息表示为模型权重,而不是输入空间对齐。对于每 一幅输入图像,我们使用视觉编码器来提取视觉特征,将特征转换为感知权重,并 将感知权重与 LLM 的权重进行合并。这样,LLM 的输入不需要视觉标记,减少了 输入序列的长度,大大提高了效率。遵循这一范式,我们提出了具有知觉权重生成 器的 Vlora。由于视觉信息是冗余的,感知权重生成器被设计为将视觉特征转换为 低等级感知权重。由于低阶性质,我们生成的知觉权重呈现出类似于 LORA 的形 式。实验结果表明,我们的 Vlora 在 MLLMS 的各种基准上取得了相当的性能,同 时显著降低了训练和推理的计算代价。
(高效视觉感知)
·Dense Connector for MLLMs
Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B→70B), and diverse architectures of MLLMs (e.g., LLaVA and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance on across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development. Code is available at https://anonymous.4open.science/r/DCNIPS.
视觉接口、多层视觉特征、即插即用、跨模态通用性
MLLM 的密集连接器
我们是否充分利用了多模式大型语言模型(MLLMS)中的可视编码器的潜力?近年 来,最大似然模型在多模式理解方面的出色表现引起了学术界和产业界的广泛关注。 在目前的 MLLM 激烈竞争中,焦点似乎主要集中在语言方面。我们见证了更大、 更高质量的指令数据集的崛起,以及更大规模的 LLM 的参与。然而,很少有人注 意到 MLLMS 所利用的视觉信号,这些信号通常被认为是由冻结的视觉编码器提 取的最终高级特征。在本文中,我们介绍了密集连接器-一种简单、有效和即插即 用的视觉语言连接器,它通过利用多层视觉功能显著增强了现有的 MLLMS,而额 外的计算开销最小。此外,我们的模型仅针对图像进行培训,在视频理解方面也展 示了非凡的零镜头能力。在不同视觉编码器、图像分辨率、训练数据集比例、不同 大小的 LLMS(2.7B→70B)和不同架构的 MLLMS(例如 LLaVA 和 Mini-Gemini)上的 实验结果验证了我们方法的通用性和可扩展性,在 19 个图像和视频基准上获得了 最先进的性能。我们希望这项工作将提供宝贵的经验,并作为未来 MLLM 发展的 基本模块。代码可在 https://anonymous.4open.science/r/DC-NIPS.上找到
(高效视觉感知)