LLM之RAG实战（三十二）| 使用RAGAs和LlamaIndex评估RAG

在之前的文章中，我们介绍了RAG的基本流程和各种优化方法（query重写，语义分块策略以及重排序等）。那么，如果发现现有的RAG不够有效，该如何评估RAG系统的有效性呢？

在本文中，我们将介绍RAG评估框架RAGAs[1]，并使用RAGAs+LlamaIndex来实现整个RAG评估过程。

一、RAG评估指标

简单地说，RAG的过程包括三个主要部分：输入查询、检索的上下文和LLM生成的响应。这三个元素构成了RAG过程中最重要的三元组，并且是相互依存的。

因此，RAG的有效性可以通过测量这些三元组之间的相关性来评估，如图1所示：

论文《RAGAS: Automated Evaluation of Retrieval Augmented Generation》[1]提到了3个RAG评估指标：1）可信度（Faithfulness）、2）答案相关性（Answer Relevance）和3）上下文相关性（Context Relevance），这些指标不需要人工标注数据集或参考答案。

此外，RAGAs网站[2]还引入了两个指标：上下文精度（Context Precision）和上下文召回（Context Recall）。

1.1 可信度/忠诚度

可信度是指确保答案是基于给定的上下文生成的。这对于避免幻觉和确保检索到的上下文可以用作生成答案是非常重要的。如果得分较低，则表明LLM的响应不符合检索到的知识，这样提供幻觉答案的可能性增加。例如：

为了评估可信度，我们首先使用LLM来提取一组语句S(a(q))，方法是使用以下提示：

Given a question and answer, create one or more statements from each sentence in the given answer.question: [question]answer: [answer]

在生成S(a(q))之后，LLM确定是否可以从c(q)推断出每个语句si。使用以下提示执行此验证步骤：

Consider the given context and following statements, then determine whether they are supported by the information present in the context. Provide a brief explan ation for each statement before arriving at the verdict (Yes/No). Provide a final verdict for each statement in order at the end in the given format. Do not deviate from the specified format.statement: [statement 1]...statement: [statement n]

最终可信度分数F计算为F=|V|/|S|，其中|V|表示根据LLM支持的语句数，|S|表示语句总数。

1.2 答案相关性

答案相关性衡量的是生成答案和查询之间的相关性。得分越高表示相关性越好。例如：

为了估计答案的相关性，我们提示LLM基于给定的答案a(q)生成n个潜在问题qi，如下所示：

Generate a question for the given answer.answer: [answer]

然后，我们利用文本嵌入模型来获得所有问题的嵌入。对于每个qi，我们计算与原始问题q的相似性sim(q,qi)，相似性计算可以使用嵌入之间的余弦相似性，计算问题q的答案相关性得分AR，如下图公式所示：

1.3 上下文相关性

上下文相关性是一个衡量检索质量的指标，主要评估检索到的上下文支持查询的程度。得分低表示检索到大量不相关的内容，这可能会影响LLM生成的最终答案。例如：

为了估计上下文的相关性，使用LLM从上下文(c(q))中提取一组关键句子（Sext）。这些句子对回答问题至关重要。提示如下：

Please extract relevant sentences from the provided context that can potentially help answer the following question. If no relevant sentences are found, or if you believe the question cannot be answered from the given context, return the phrase "Insufficient Information". While extracting candidate sentences you’re not allowed to make any changes to sentences from given context.

在RAGAs中，使用以下公式计算句子级别的相关性：

1.4 上下文召回

该指标衡量检索到的上下文和标注答案之间的一致性水平。它是使用基本事实和检索到的上下文来计算的，值越高表示性能越好。例如：

该评估方法需要提供标注数据。

计算公式如下：

1.5 上下文精度

该度量相对复杂，用于衡量检索到的包含真实事实的所有相关上下文是否排名靠前。分数越高表示精度越高。

该指标的计算公式如下：

上下文精度的优势在于它能够感知排名效果。然而，它的缺点是，如果相关召回很少，但都排名很高，那么分数也会很高。因此，有必要结合其他几个指标来考虑整体效果。

二、使用RAGAs+LlamaIndex进行RAG评估

主要流程如图6所示：

2.1 环境配置

使用pip安装ragas，并检查当前版本。

(py) Florian:~ Florian$ pip list | grep ragasragas                        0.0.22

如果您使用pip-install-git+https://github.com/explodinggradients/ragas.git安装最新版本（v0.1.0rc1），但该版本不支持LlamaIndex。

然后，导入相关库，设置环境和全局变量

import osos.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"dir_path = "YOUR_DIR_PATH"from llama_index import VectorStoreIndex, SimpleDirectoryReaderfrom ragas.metrics import (    faithfulness,    answer_relevancy,    context_relevancy,    context_recall,    context_precision)from ragas.llama_index import evaluate

目录指定的是论文《TinyLlama: An Open-Source Small Language Model》[3]PDF文件。

(py) Florian:~ Florian$ ls /Users/Florian/Downloads/pdf_test/tinyllama.pdf

2.2 使用LlamaIndex构建一个简单的RAG查询引擎

documents = SimpleDirectoryReader(dir_path).load_data()index = VectorStoreIndex.from_documents(documents)query_engine = index.as_query_engine()

LlamaIndex默认情况下使用OpenAI模型，LLM和嵌入模型可以使用ServiceContext轻松配置。

构建评估数据集

由于有些指标需要手动标注数据集，下面是一些问题及其相应的答案的示例：

eval_questions = [    "Can you provide a concise description of the TinyLlama model?",    "I would like to know the speed optimizations that TinyLlama has made.",    "Why TinyLlama uses Grouped-query Attention?",    "Is the TinyLlama model open source?",    "Tell me about starcoderdata dataset",]eval_answers = [    "TinyLlama is a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes.",    "During training, our codebase has integrated FSDP to leverage multi-GPU and multi-node setups efficiently. Another critical improvement is the integration of Flash Attention, an optimized attention mechanism. We have replaced the fused SwiGLU module from the xFormers (Lefaudeux et al., 2022) repository with the original SwiGLU module, further enhancing the efficiency of our codebase. With these features, we can reduce the memory footprint, enabling the 1.1B model to fit within 40GB of GPU RAM.",      "To reduce memory bandwidth overhead and speed up inference, we use grouped-query attention in our model. We have 32 heads for query attention and use 4 groups of key-value heads. With this technique, the model can share key and value representations across multiple heads without sacrificing much performance",    "Yes, TinyLlama is open-source",    "This dataset was collected to train StarCoder (Li et al., 2023), a powerful opensource large code language model. It comprises approximately 250 billion tokens across 86 programming languages. In addition to code, it also includes GitHub issues and text-code pairs that involve natural languages.",]eval_answers = [[a] for a in eval_answers]

指标选择和RAGA评估

metrics = [    faithfulness,    answer_relevancy,    context_relevancy,    context_precision,    context_recall,]result = evaluate(query_engine, metrics, eval_questions, eval_answers)result.to_pandas().to_csv('YOUR_CSV_PATH', sep=',')

请注意，默认情况下，在RAGA中，使用OpenAI模型。

在RAGAs中，如果您想使用另一个LLM（如Gemini）来使用LlamaIndex进行评估，即使在调试了RAGAs的源代码后，我也没有在版本0.0.22中找到任何有用的方法。

2.3 最终代码

import osos.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"dir_path = "YOUR_DIR_PATH"from llama_index import VectorStoreIndex, SimpleDirectoryReaderfrom ragas.metrics import (    faithfulness,    answer_relevancy,    context_relevancy,    context_recall,    context_precision)from ragas.llama_index import evaluatedocuments = SimpleDirectoryReader(dir_path).load_data()index = VectorStoreIndex.from_documents(documents)query_engine = index.as_query_engine()eval_questions = [    "Can you provide a concise description of the TinyLlama model?",    "I would like to know the speed optimizations that TinyLlama has made.",    "Why TinyLlama uses Grouped-query Attention?",    "Is the TinyLlama model open source?",    "Tell me about starcoderdata dataset",]eval_answers = [    "TinyLlama is a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes.",    "During training, our codebase has integrated FSDP to leverage multi-GPU and multi-node setups efficiently. Another critical improvement is the integration of Flash Attention, an optimized attention mechanism. We have replaced the fused SwiGLU module from the xFormers (Lefaudeux et al., 2022) repository with the original SwiGLU module, further enhancing the efficiency of our codebase. With these features, we can reduce the memory footprint, enabling the 1.1B model to fit within 40GB of GPU RAM.",      "To reduce memory bandwidth overhead and speed up inference, we use grouped-query attention in our model. We have 32 heads for query attention and use 4 groups of key-value heads. With this technique, the model can share key and value representations across multiple heads without sacrificing much performance",    "Yes, TinyLlama is open-source",    "This dataset was collected to train StarCoder (Li et al., 2023), a powerful opensource large code language model. It comprises approximately 250 billion tokens across 86 programming languages. In addition to code, it also includes GitHub issues and text-code pairs that involve natural languages.",]eval_answers = [[a] for a in eval_answers]metrics = [    faithfulness,    answer_relevancy,    context_relevancy,    context_precision,    context_recall,]result = evaluate(query_engine, metrics, eval_questions, eval_answers)result.to_pandas().to_csv('YOUR_CSV_PATH', sep=',')