VLMs之Agent之CogAgent：《CogAgent: A Visual Language Model for GUI Agents》翻译与解读

导读：这篇论文介绍了CogAgent，一个专注于图形用户界面 (GUI) 理解和导航的视觉语言模型 (VLM)。这篇论文提出了一种新的视觉语言模型 CogAgent，并通过精心设计的数据集和模型架构，有效地解决了 LLM 在 GUI 理解和导航方面的不足，为构建更强大的 AI 智能体提供了新的思路和方法。

>> 背景痛点：当前大型语言模型 (LLM) 擅长处理文本任务，但在理解和交互GUI方面存在不足，限制了其自动化能力。GUI 交互缺乏标准 API，图标、图像、图表和空间关系等重要信息难以用文字直接表达。即使在网页等文本呈现的 GUI 中，canvas 和 iframe 等元素也难以通过 HTML 解析其功能。

>> 解决方案：论文提出 CogAgent，一个拥有 180 亿参数的视觉语言模型，专门用于 GUI 理解和导航。它通过结合低分辨率和高分辨率图像编码器，支持 1120x1120 分辨率的输入，能够识别微小的页面元素和文本。

>> 核心思路步骤：

● 数据构建：针对 GUI 图像与自然图像分布差异，构建了大规模标注的 GUI 和 OCR 数据集用于持续预训练。数据集包括合成文本渲染、自然图像 OCR 结果和学术文档等，涵盖文本识别、视觉 grounding 和 GUI 图像理解三个方面。特别是，构建了 CCS400K 数据集，包含 40 万个网页截图及其对应的 DOM 元素和渲染框，用于增强模型对 GUI 元素的理解。

● 高分辨率跨注意力模块：为了有效处理高分辨率图像，同时避免计算量过大，设计了一个高分辨率跨注意力模块。该模块采用轻量级的高分辨率图像编码器，并通过跨注意力机制将高分辨率图像特征与 VLM 解码器的每一层融合，在保证效率的同时提升高分辨率图像的理解能力。这避免了直接使用高分辨率图像导致的计算复杂度呈二次方增长的瓶颈。

● 预训练和微调：CogAgent 首先在构建的数据集上进行预训练，然后在多个 VQA 数据集和 GUI 导航数据集 (Mind2Web 和 AITW) 上进行多任务微调，以提高模型在各种任务上的性能并使其与自由形式的人类指令对齐。

>> 优势：

● 在多个 VQA 基准测试中取得了最先进的性能：包括 VQAv2、OK-VQA、Text-VQA、ST-VQA、ChartQA、infoVQA、DocVQA、MM-Vet 和 POPE 等，展现了其强大的视觉理解能力，尤其是在文本丰富的 VQA 任务上。

● 在 GUI 导航任务上超越了基于 LLM 的方法：在 Mind2Web 和 AITW 数据集上，CogAgent 只使用截图作为输入，就超越了使用提取的 HTML 文本的基于 LLM 的方法，证明了 VLM 在 GUI 导航中的优势。

● 高效处理高分辨率图像：高分辨率跨注意力模块的设计显著降低了处理高分辨率图像的计算成本。

>> 结论和观点：

● CogAgent 是一个强大的 VLM，能够有效地理解和导航 GUI。

● VLM 在构建 GUI 智能体方面具有显著优势，能够超越仅依赖文本信息的 LLM 方法。

● CogAgent 的高分辨率跨注意力模块在处理高分辨率图像方面具有计算效率优势。

● 构建领域特定的预训练数据对于训练 GUI 智能体至关重要。

● 尽管 CogAgent 取得了显著成果，但仍存在一些不足，例如输出坐标精度和多图像处理能力等，需要进一步研究。

《CogAgent: A Visual Language Model for GUI Agents》翻译与解读

Abstract

Figure 1:Samples of visual agents generated by CogAgent. More samples are demonstrated in the Appendix.图 1：CogAgent 生成的视觉代理示例。更多示例见附录。

1、Introduction

Conclusion

《CogAgent: A Visual Language Model for GUI Agents》翻译与解读

地址

论文地址：https://arxiv.org/abs/2312.08914

时间

2023年12月14日

最新：2024年12月27日

作者

清华大学，智谱AI团队

Abstract

People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at this https URL, with a new version of CogAgent-9B-20241220 available at this https URL.

人们在图形用户界面（GUI）上花费了大量时间，例如电脑或智能手机屏幕。像 ChatGPT 这样的大型语言模型（LLM）能够帮助人们完成诸如写邮件之类的任务，但在理解与交互图形用户界面方面却存在困难，这限制了它们提高自动化水平的潜力。在本文中，我们介绍了 CogAgent，这是一个拥有 180 亿参数的视觉语言模型（VLM），专门用于图形用户界面的理解和导航。通过利用低分辨率和高分辨率图像编码器，CogAgent 支持 1120*1120 分辨率的输入，能够识别页面上的微小元素和文本。作为一个通用的视觉语言模型，CogAgent 在五个文本丰富的和四个通用的视觉问答基准测试中达到了最先进的水平，包括 VQAv2、OK-VQA、Text-VQA、ST-VQA、ChartQA、infoVQA、DocVQA、MM-Vet 和 POPE。CogAgent 仅使用截图作为输入，在 PC 和 Android 图形用户界面导航任务——Mind2Web 和 AITW 上的表现优于使用提取的 HTML 文本的基于 LLM 的方法，从而推进了该领域的技术前沿。该模型和代码可在以下 https 网址获取，CogAgent-9B-20241220 的新版本可在以下 https 网址获取。

Figure 1:Samples of visual agents generated by CogAgent. More samples are demonstrated in the Appendix.图 1：CogAgent 生成的视觉代理示例。更多示例见附录

1、Introduction

Autonomous agents in the digital world are ideal assistants that many modern people dream of. Picture this scenario: You type in a task description, then relax and enjoy a cup of coffee while watching tasks like booking tickets online, conducting web searches, managing files, and creating PowerPoint presentations get completed automatically.

Recently, the emergence of agents based on large language models (LLMs) is bringing us closer to this dream. For example, AutoGPT [33], a 150,000-star open-source project, leverages ChatGPT [29] to integrate language understanding with pre-defined actions like Google searches and local file operations. Researchers are also starting to develop agent-oriented LLMs [42, 7]. However, the potential of purely language-based agents is quite limited in real-world scenarios, as most applications interact with humans through Graphical User Interfaces (GUIs), which are characterized by the following perspectives:

>> Standard APIs for interaction are often lacking.

>> Important information including icons, images, diagrams, and spatial relations are difficult to directly convey in words.

>> Even in text-rendered GUIs like web pages, elements like canvas and iframe cannot be parsed to grasp their functionality via HTML.

数字世界中的自主代理是许多现代人梦寐以求的理想助手。想象一下这样的场景：您输入任务描述，然后放松下来，喝杯咖啡，同时看着诸如在线订票、网络搜索、文件管理以及创建 PowerPoint 演示文稿等任务自动完成。

最近，基于大型语言模型（LLM）的代理的出现让我们离这个梦想更近了一步。例如，AutoGPT [33]，一个拥有 15 万颗星的开源项目，利用 ChatGPT [29] 将语言理解与诸如谷歌搜索和本地文件操作等预定义动作相结合。研究人员也开始开发面向代理的 LLM [42， 7]。然而，纯语言代理在现实场景中的潜力相当有限，因为大多数应用程序通过图形用户界面（GUI）与人类交互，其特点如下：

>> 通常缺乏标准的交互 API。包括图标、图像、图表和空间关系在内的重要信息很难直接用文字表述清楚。

>> 即使是在像网页这样的文本渲染型图形用户界面（GUI）中，像画布（canvas）和内联框架（iframe）这样的元素也无法通过 HTML 进行解析以理解其功能。

Agents based on visual language models (VLMs) have the potential to overcome these limitations. Instead of relying exclusively on textual inputs such as HTML [28] or OCR results [31], VLM-based agents directly perceive visual GUI signals. Since GUIs are designed for human users, VLM-based agents can perform as effectively as humans, as long as the VLMs match human-level vision understanding. In addition, VLMs are also capable of skills such as extremely fast reading and programming that are usually beyond the reach of most human users, extending the potential of VLM-based agents. A few prior studies utilized visual features merely as auxiliaries in specific scenarios. e.g. WebShop [39] which employs visual features primarily for object recognition purposes. With the rapid development of VLM, can we naturally achieve universality on GUIs by relying solely on visual inputs?

基于视觉语言模型（VLM）的代理有可能克服这些局限性。它们不再仅仅依赖于诸如 HTML [28] 或 OCR 结果 [31] 这样的文本输入，而是直接感知图形用户界面的视觉信号。由于图形用户界面是为人类用户设计的，只要视觉语言模型达到人类级别的视觉理解水平，基于视觉语言模型的代理就能像人类一样有效地工作。此外，视觉语言模型还具备诸如极快的阅读和编程等技能，这些技能通常是大多数人类用户难以企及的，从而进一步拓展了基于视觉语言模型的代理的潜力。此前的一些研究仅在特定场景中将视觉特征作为辅助手段使用，例如 WebShop [39] 主要将视觉特征用于对象识别。随着视觉语言模型的快速发展，我们能否仅依靠视觉输入在图形用户界面中实现普遍适用性呢？

In this work, we present CogAgent, a visual language foundation model specializing in GUI understanding and planning while maintaining a strong ability for general cross-modality tasks. By building upon CogVLM [38]—a recent open-source VLM, CogAgent tackles the following challenges for building GUI agents:

>> Training Data. Most current VLMs are pre-trained on datasets like LAION [32], consisting of natural images on the Web. However, we notice that the GUI images share a different distribution from natural images. We thus construct a large-scale annotated dataset about GUIs and OCR for continual pre-training.

>> High-Resolution vs. Compute. In GUIs, tiny icons and text are ubiquitous, and it is hard to recognize them in commonly-used 224×224 resolution. However, increasing the resolution of input images results in significantly long sequence length in language models. For example, a 1120×1120 image corresponds to a sequence of 6400 tokens if the patch size is 14, demanding excessive training and inference compute. To address this, we design a cross-attention branch that allows for a trade-off between the resolution and the hidden size within a proper computation budget. Specifically, we propose to combine the original large ViT [12] (4.4B parameters) used in CogVLM [38] and a new small high-resolution cross-module (with image encoder of 0.30B parameters) to jointly model visual features.

在这项工作中，我们提出了 CogAgent，这是一种专注于图形用户界面（GUI）理解和规划的视觉语言基础模型，同时在通用跨模态任务方面也具备强大的能力。通过基于最近开源的 CogVLM [38] 构建，CogAgent 解决了构建 GUI 代理时面临的以下挑战：

>> 训练数据。目前大多数视觉语言模型（VLM）都是在诸如 LAION [32] 这样的数据集上进行预训练的，这些数据集包含网络上的自然图像。然而，我们注意到 GUI 图像与自然图像的分布不同。因此，我们构建了一个大规模的关于 GUI 和 OCR 的标注数据集，用于持续预训练。

>> 高分辨率与计算。在 GUI 中，微小的图标和文本随处可见，在常用的 224×224 分辨率下很难识别它们。然而，提高输入图像的分辨率会导致语言模型中的序列长度显著增加。例如，一张 1120×1120 的图像，如果补丁大小为 14，则对应 6400 个标记的序列，这需要大量的训练和推理计算。为了解决这个问题，我们设计了一个交叉注意力分支，能够在适当的计算预算内实现分辨率和隐藏大小之间的权衡。具体而言，我们提议将 CogVLM [38] 中使用的原始大型 ViT [12]（44 亿参数）与一个新的小型高分辨率跨模块（图像编码器为 3 亿参数）相结合，以共同建模视觉特征。

Our experiments show that:

>> CogAgent tops popular GUI understanding and decision-making benchmarks, including AITW [31] and Mind2Web [10]. To the best of our knowledge, this is the first time that a generalist VLM can outperform LLM-based methods with extracted structured text.

>> Though CogAgent focuses on GUIs, it achieves state-of-the-art generalist performance on nine visual question-answering benchmarks including VQAv2 [1], OK-VQA [23], TextVQA [34], ST-VQA [4], ChartQA [24], infoVQA [26], DocVQA [25], MM-Vet [41], and POPE [19].

>> The separated design of high- and low-resolution branches in CogAgent significantly lows the compute cost for consuming high-resolution images, e.g., the number of the floating-point operations (FLOPs) for CogAgent-18B with 1120×1120 inputs is less than half that of CogVLM-17B with its default 490×490 inputs.

我们的实验表明：

>> CogAgent 在流行的图形用户界面理解和决策基准测试中名列前茅，包括 AITW [31] 和 Mind2Web [10]。据我们所知，这是首次有通用视觉语言模型在提取结构化文本的情况下超越基于大型语言模型的方法。

>> 尽管 CogAgent 主要关注图形用户界面，但它在包括 VQAv2 [1]、OK-VQA [23]、TextVQA [34]、ST-VQA [4]、ChartQA [24]、infoVQA [26]、DocVQA [25]、MM-Vet [41] 和 POPE [19] 在内的九个视觉问答基准测试中达到了最先进的通用性能。

>> CogAgent 中高分辨率和低分辨率分支的分离设计显著降低了处理高分辨率图像的计算成本，例如，CogAgent-18B 处理 1120×1120 输入的浮点运算次数（FLOPs）不到 CogVLM-17B 处理其默认 490×490 输入的一半。

CogAgent is open-sourced at https://github.com/THUDM/CogVLM, with a new version of CogAgent-9B-20241220 available at https://github.com/THUDM/CogAgent. It represents an effort to promote the future research and application of AI agents, facilitated by advanced VLMs.

CogAgent 已在 https://github.com/THUDM/CogVLM 开源，CogAgent-9B-20241220 的新版本可在 https://github.com/THUDM/CogAgent 获取。它代表了借助先进的视觉语言模型推动未来 AI 代理研究和应用的努力。

Conclusion

We introduce CogAgent, a VLM-based GUI agent with enhanced pre-train data construction and efficient architecture for high-resolution input. CogAgent achieves state-of-the-art performance on a wide range of VQA and GUI benchmarks, and will be open-sourced. CogAgent is an initial exploration of VLM-based GUI agent, and still has some shortcomings, e.g. imprecise output coordinates and incapability of processing multiple images, necessitating further research.

我们推出了 CogAgent，这是一款基于视觉语言模型（VLM）的图形用户界面（GUI）代理，具有增强的预训练数据构建和高效的架构，能够处理高分辨率输入。CogAgent 在广泛的视觉问答（VQA）和 GUI 基准测试中取得了最先进的性能，并将开源。CogAgent 是基于 VLM 的 GUI 代理的初步探索，仍存在一些不足之处，例如输出坐标不够精确以及无法处理多张图片，这需要进一步的研究。