LLMs之RAG：《EdgeRAG: Online-Indexed RAG for Edge Devices》翻译与解读

导读：这篇论文针对在资源受限的边缘设备上部署检索增强生成 (RAG) 系统的挑战，提出了一种名为 EdgeRAG 的高效方法。EdgeRAG 通过巧妙地结合预计算、在线生成和缓存策略，有效地解决了在边缘设备上部署 RAG 系统的内存和延迟问题，为在资源受限的设备上运行更强大的 LLM 提供了一种可行方案。

>> 背景痛点：在边缘设备上部署 RAG 系统面临两大主要挑战：

● 内存限制：边缘设备的内存有限，无法容纳大型向量数据库，导致内存抖动和性能下降，即使使用两级倒排文件 (IVF) 索引，也可能出现这个问题。

● 计算能力限制：边缘设备的计算能力有限，在线生成嵌入向量会带来高延迟，特别是对于大型集群，延迟会非常显著，形成长尾延迟。

>> 具体的解决方案：EdgeRAG 提出了一种针对边缘设备的内存高效 RAG 系统，通过以下策略解决上述问题：

● 选择性索引存储：对 IVF 索引的第二级嵌入向量进行剪枝，只存储计算成本高的集群的嵌入向量，其余嵌入向量在检索时在线生成。

● 自适应成本感知缓存：对在线生成的嵌入向量进行缓存，优先缓存生成成本高的嵌入向量，避免重复计算，提高效率。

>> 核心思路步骤：EdgeRAG 的核心思路包含以下步骤：

● 索引阶段：对文本数据进行分块、生成嵌入向量，使用 K-means 聚类算法进行聚类，存储第一级索引（聚类中心）和第二级索引（数据块引用以及嵌入向量生成延迟）。如果一个集群的嵌入向量生成延迟超过预设阈值 (SLO)，则预先计算并存储该集群的嵌入向量；否则，只存储数据块引用，在检索时在线生成嵌入向量。

● 检索阶段：收到查询后，先查找最相似的聚类中心，然后检查该集群的嵌入向量是否已预先计算并存储。如果已存储，则直接加载；否则，检查缓存，如果命中则加载；如果未命中，则在线生成嵌入向量，并更新缓存。最后，检索相关数据块并传递给 LLM 生成答案。

>> 优势：EdgeRAG 的主要优势在于：

● 内存效率：通过剪枝第二级嵌入向量，减少了内存占用，使大型数据集能够在边缘设备上运行。

● 低延迟：通过预计算大型集群的嵌入向量和缓存生成的嵌入向量，降低了检索延迟，特别是长尾延迟。

● 保持生成质量：虽然进行了优化，但 EdgeRAG 保持了与基线系统相似的生成质量。

>> 结论和观点：

● 论文通过在 Nvidia Jetson Orin Nano 上的实验结果表明，EdgeRAG 平均将检索延迟提高了 1.8 倍，对于大型数据集，提高了 3.82 倍，同时保持了与基线系统相似的生成质量。

● EdgeRAG 成功地将多个超过边缘设备内存容量的数据集加载到内存中，避免了内存抖动。

● 论文讨论了 EdgeRAG 与其他 RAG 系统的集成以及利用硬件加速器的潜力。

● 论文认为 EdgeRAG 提供了一种在边缘设备上高效部署 RAG 系统的方法，尤其适用于那些内存和计算能力受限的场景。

《EdgeRAG: Online-Indexed RAG for Edge Devices》翻译与解读

Abstract

1、Introduction

Figure 1:RAG Pipelines

Figure 2:Retrieval process of Inverted File Index

Conclusion

《EdgeRAG: Online-Indexed RAG for Edge Devices》翻译与解读

地址	论文地址：https://arxiv.org/abs/2412.21023
时间	2024年12月31日
作者	弗吉尼亚大学、滑铁卢大学、Google

Abstract

Deploying Retrieval Augmented Generation (RAG) on resource-constrained edge devices is challenging due to limited memory and processing power. In this work, we propose EdgeRAG which addresses the memory constraint by pruning embeddings within clusters and generating embeddings on-demand during retrieval. To avoid the latency of generating embeddings for large tail clusters, EdgeRAG pre-computes and stores embeddings for these clusters, while adaptively caching remaining embeddings to minimize redundant computations and further optimize latency. The result from BEIR suite shows that EdgeRAG offers significant latency reduction over the baseline IVF index, but with similar generation quality while allowing all of our evaluated datasets to fit into the memory.

在资源受限的边缘设备上部署检索增强生成（RAG）具有挑战性，因为这些设备的内存和处理能力有限。在本研究中，我们提出了 EdgeRAG，它通过在簇内修剪嵌入以及在检索期间按需生成嵌入来解决内存限制问题。为避免为大型尾部簇生成嵌入时的延迟，EdgeRAG 会预先计算并存储这些簇的嵌入，同时自适应地缓存其余嵌入以减少冗余计算并进一步优化延迟。BEIR 套件的结果表明，EdgeRAG 相较于基线 IVF 索引显著降低了延迟，同时生成质量相似，并且使我们评估的所有数据集都能适配内存。

1、Introduction

Large Language Models (LLMs) enable new applications such as smart assistants (Google, 2024; Microsoft,). The processing of these powerful LLMs is usually offloaded to the datacenter due to the enormous resources required. However, the latest mobile platforms enable smaller LLM to run locally. These lightweight models cannot directly compare with the state-of-the-art hundred-billion parameter LLMs. To enhance these models to process users’ custom data and applications, a promising solution is to build a compounding system. By integrating LLMs with Retrieval Augmented Generation (RAG), these smaller models can leverage local personal data to generate high-quality responses.

Even though RAG removes the requirement for a heavy-weight LLM for generation, retrieval still has a high overhead. The core of a RAG system is a vector database that enables vector similarity search. Different from LLMs, the overhead of RAG mainly comes from its memory footprint. For example, a Flat Index stores and sequentially searches every vector representation of the data chunks to identify the closest match to the query. For example, in fever(Thorne et al., 2018) dataset, a vector database that holds 5.23 million records has an index size of 18.5 GB. In comparison, mobile devices tend to have 4-12 GB of main memory (Wiens, 2024; Sam, 2024). Thus even using the whole memory on a mobile platform is not sufficient to run a large vector database. On the other hand, storing the vector database on disk introduces substantial access latency, impacting performance.

Our research focused on the challenges of implementing Retrieval Augmented Generation (RAG) on edge systems—mainly the overhead of the vector similarity search. We find that naively keeping the entire index in the main memory would not fit into the memory of mobile platforms. A Flat index which performs sequential search of all embeddings not only expensive in term of computation, but also trash memory leading to poor performance. On the other hand, Two-level Inverted File (IVF) index clusters embeddings of data chunks into centroids. The retrieval process first searches for the closest centroid and then performs a second search within the cluster. Thus, avoid expensive sequential search of all embeddings. However, keeping all embeddings in memory still leads to excessive memory thrashing and increased latency. Thus, keeping the first-level centroid in memory and generating the second level online can be a promising solution. Through further profiling the RAG on mobile platforms using widely-used RAG benchmarks (Thakur et al., 2021), we find that both the data access pattern and the access latency are highly skewed. First, most of the embedding vectors are not searched during the retrieval process. Second, the cost of generating the embedding vector is not the same for all clusters and has an extreme tail distribution. These skewness leave space for further optimizations.

大型语言模型（LLMs）催生了诸如智能助手（谷歌，2024 年；微软）等新应用。由于这些强大的 LLM 需要大量资源，其处理过程通常被转移到数据中心。然而，最新的移动平台使得较小的 LLM 能够在本地运行。这些轻量级模型无法与最先进的百亿参数 LLM 相提并论。为了增强这些模型以处理用户的自定义数据和应用，一个有前景的解决方案是构建一个复合系统。通过将 LLM 与检索增强生成（RAG）相结合，这些较小的模型能够利用本地个人数据生成高质量的响应。

尽管 RAG 消除了生成过程中对重型 LLM 的需求，但检索仍具有较高的开销。RAG 系统的核心是一个向量数据库，它支持向量相似度搜索。与 LLM 不同，RAG 的开销主要来自其内存占用。例如，平面索引会存储并按顺序搜索每个数据块的向量表示，以确定与查询最接近的匹配项。例如，在 Fever（Thorne 等人，2018 年）数据集中，一个包含 523 万条记录的向量数据库的索引大小为 18.5GB。相比之下，移动设备通常只有 4 至 12GB 的主内存（Wiens，2024 年；Sam，2024 年）。因此，即使使用移动平台上的全部内存也不足以运行大型向量数据库。另一方面，将向量数据库存储在磁盘上会引入大量的访问延迟，从而影响性能。

我们的研究重点在于在边缘系统上实现检索增强生成（RAG）所面临的挑战——主要是向量相似性搜索的开销。我们发现，简单地将整个索引保留在主内存中无法适应移动平台的内存。执行所有嵌入的顺序搜索的平面索引不仅计算成本高昂，而且浪费内存，导致性能不佳。另一方面，两级倒排文件（IVF）索引将数据块的嵌入聚类到质心。检索过程首先搜索最近的质心，然后在聚类内进行第二次搜索。因此，避免了对所有嵌入进行昂贵的顺序搜索。然而，将所有嵌入都保留在内存中仍会导致过多的内存抖动和增加的延迟。因此，将第一级质心保留在内存中，并在线生成第二级质心可能是一个有前景的解决方案。通过在移动平台上使用广泛使用的 RAG 基准（Thakur 等人，2021 年）对 RAG 进行进一步的性能分析，我们发现数据访问模式和访问延迟都高度倾斜。首先，在检索过程中，大多数嵌入向量未被搜索。其次，生成嵌入向量的成本对于所有簇来说并不相同，并且具有极端的尾部分布。这些倾斜性为进一步的优化留下了空间。

In this work, we develop a mobile RAG system, EdgeRAG that enables RAG-based LLM on mobile platforms, by fitting the vector database in the limited mobile memory while ensuring that the response time meets the service level objectives (SLOs) of mobile AI assistant applications. Based on these observations, our key ideas are the following: First, we prune the vector embedding of the data embeddings within centroid clusters which are only used for second-level search to save the memory capacity. EdgeRAG then generates the vector embedding online during the retrieval process. However, due to limited computing on edge systems, generating vector embedding online could suffer from long embedding generation latency from large tail clusters. To overcome this challenge, our second solution is to pre-compute and store the embeddings of large tail clusters to avoid long tail latency of generating vector embeddings of data within those tail clusters. Then, EdgeRAG can adaptively cache the remaining vector embeddings to minimize redundant computation and improve overall latency, based on the spare memory capacity and SLO requirements.

We evaluate EdgeRAG on a mobile platform based on Nvidia Jetson Orin Nano equipped with 8 GB of shared main memory, similar to a mobile edge platform that has neural processing capabilities (Sam, 2024). We use 6 workloads from the BEIR benchmark suite (Thakur et al., 2021) and tune the retrieval hyperparameters to normalize the recall against the Flat index baseline. We also evaluate the generation quality using GPT-4o (OpenAI,) LLM as an LLM evaluator (Saad-Falcon et al., 2023). We use the time-to-first-token (TTFT) latency as the main metric. The result shows that EdgeRAG offers 1.8 × faster TTFT over the baseline IVF index on average and 3.82 × for larger datasets. At the same time, EdgeRAG maintains a similar generation quality with recall and generation scores within 5 percent of the Flat Index baseline while allowing all of our evaluated datasets to fit into the memory and avoid memory thrashing.

在本工作中，我们开发了一个移动 RAG 系统 EdgeRAG，它能够在移动平台上实现基于 RAG 的 LLM，通过在有限的移动内存中适配向量数据库，同时确保响应时间满足移动 AI 助手应用程序的服务水平目标（SLO）。基于这些观察结果，我们的关键思路如下：首先，我们修剪质心簇内仅用于第二级搜索的数据嵌入的向量嵌入，以节省内存容量。EdgeRAG 在检索过程中在线生成向量嵌入。然而，由于边缘系统计算能力有限，在线生成向量嵌入可能会因大型尾部簇而导致嵌入生成延迟过长。为克服这一挑战，我们的第二个解决方案是预先计算并存储大型尾部簇的嵌入，以避免生成这些尾部簇内数据的向量嵌入时出现长尾延迟。然后，EdgeRAG 可以根据空闲内存容量和 SLO 要求自适应地缓存剩余的向量嵌入，以减少冗余计算并降低整体延迟。

我们在基于英伟达 Jetson Orin Nano 的移动平台上对 EdgeRAG 进行评估，该平台配备 8GB 共享主内存，类似于具备神经处理能力的移动边缘平台（Sam，2024）。我们使用来自 BEIR 基准测试套件（Thakur 等人，2021）的 6 个工作负载，并调整检索超参数以将召回率与 Flat 索引基线进行标准化。我们还使用 GPT-4o（OpenAI）LLM 作为 LLM 评估器（Saad-Falcon 等人，2023）来评估生成质量。我们采用首次生成标记的时间（TTFT）延迟作为主要指标。结果表明，EdgeRAG 的 TTFT 平均比基线 IVF 索引快 1.8 倍，对于较大的数据集则快 3.82 倍。同时，EdgeRAG 保持了相似的生成质量，其召回率和生成得分与 Flat 索引基线相差在 5%以内，同时使所有评估的数据集都能适配内存，避免了内存碎片化。

In summary, the contributions of this work are the following:

• We identify two key challenges of implementing RAG on edge devices: First, limited memory capacity does not allow loading large embedding vector database in the memory leading to memory thrashing and poor performance. Although only a small subset of embeddings are searched for two-level IVF index. Second, limited computing power of edge devices which slows down online embedding generation especially on few large tail and repeatedly used clusters.

• To enable scalable and memory-efficient RAG on edge systems, we propose EdgeRAG which improves upon IVF index by pruning second-level embeddings to reduce memory footprint and generate the embedding online during retrieval time. EdgeRAG mitigates long tail latency from generating the embeddings of heavy cluster by pre-computing and storing those tails. To further optimize latency, EdgeRAG selectively caches generated embeddings to reduce redundant computation while minimizing memory overhead.

• We implement EdgeRAG on Jetson Orin edge platform and evaluate our system with datasets from BEIR benchmark. The results show that EdgeRAG significantly improves the retrieval latency of large datasets with embedding sizes larger than the memory capacity by 131% with slight reduction in retrieval and generation quality.

总之，本工作的贡献如下：

• 我们指出了在边缘设备上实现 RAG 的两个关键挑战：首先，有限的内存容量无法将大型嵌入向量数据库加载到内存中，从而导致内存抖动和性能不佳。尽管在两层 IVF 索引中仅搜索嵌入的小子集。其次，边缘设备的计算能力有限，这会减慢在线嵌入生成的速度，尤其是在少数大型尾部和频繁使用的簇上。

• 为了在边缘系统上实现可扩展且内存高效的 RAG，我们提出了 EdgeRAG，它通过修剪第二层嵌入来改进 IVF 索引，以减少内存占用，并在检索时在线生成嵌入。EdgeRAG 通过预先计算并存储这些尾部来缓解生成重尾簇嵌入的长尾延迟。为了进一步优化延迟，EdgeRAG 选择性地缓存生成的嵌入，以减少冗余计算，同时将内存开销降至最低。

• 我们在 Jetson Orin 边缘平台上实现了 EdgeRAG，并使用来自 BEIR 基准的数据集对我们的系统进行了评估。结果表明，EdgeRAG 显著降低了嵌入尺寸大于内存容量的大数据集的检索延迟，提高了 131%，但检索和生成质量略有下降。

Figure 1:RAG Pipelines

Figure 2:Retrieval process of Inverted File Index

Conclusion

In this work, we propose EdgeRAG, a novel RAG system designed to address the memory limitations of edge platforms. EdgeRAG optimizes the two-level IVF index by pruning unnecessary second-level embeddings, selectively storing or regenerating them during execution, and caching generated embeddings to minimize redundant computations. This approach enables efficient RAG applications on datasets that exceed available memory, while preserving low retrieval latency and without compromising generation quality. Our evaluation results show that EdgeRAG improves retrieval latency by 1.22× on average and by a substantial 3.69× for large datasets that cannot fit in the memory.

在这项工作中，我们提出了 EdgeRAG，这是一种新型的检索增强生成（RAG）系统，旨在解决边缘平台的内存限制问题。EdgeRAG 通过修剪不必要的二级嵌入、在执行期间有选择地存储或重新生成它们以及缓存生成的嵌入来优化两级 IVF 索引，从而最大限度地减少冗余计算。这种方法使得在超出可用内存的数据集上也能高效运行 RAG 应用程序，同时保持低检索延迟且不降低生成质量。我们的评估结果表明，EdgeRAG 平均将检索延迟提高了 1.22 倍，对于无法完全装入内存的大数据集，检索延迟更是大幅提高了 3.69 倍。