在之前的文章中,我们介绍了RAG的基本流程和各种优化方法(query重写,语义分块策略以及重排序等)。那么,如果发现现有的RAG不够有效,该如何评估RAG系统的有效性呢?
在本文中,我们将介绍RAG评估框架RAGAs[1],并使用RAGAs+LlamaIndex来实现整个RAG评估过程。
一、RAG评估指标
简单地说,RAG的过程包括三个主要部分:输入查询、检索的上下文和LLM生成的响应。这三个元素构成了RAG过程中最重要的三元组,并且是相互依存的。
因此,RAG的有效性可以通过测量这些三元组之间的相关性来评估,如图1所示:
论文《RAGAS: Automated Evaluation of Retrieval Augmented Generation》[1]提到了3个RAG评估指标:1)可信度(Faithfulness)、2)答案相关性(Answer Relevance)和3)上下文相关性(Context Relevance),这些指标不需要人工标注数据集或参考答案。
此外,RAGAs网站[2]还引入了两个指标:上下文精度(Context Precision)和上下文召回(Context Recall)。
1.1 可信度/忠诚度
可信度是指确保答案是基于给定的上下文生成的。这对于避免幻觉和确保检索到的上下文可以用作生成答案是非常重要的。如果得分较低,则表明LLM的响应不符合检索到的知识,这样提供幻觉答案的可能性增加。例如:
为了评估可信度,我们首先使用LLM来提取一组语句S(a(q)),方法是使用以下提示:
Given a question and answer, create one or more statements from each sentence in the given answer.
question: [question]
answer: [answer]
在生成S(a(q))之后,LLM确定是否可以从c(q)推断出每个语句si。使用以下提示执行此验证步骤:
Consider the given context and following statements, then determine whether they are supported by the information present in the context. Provide a brief explan ation for each statement before arriving at the verdict (Yes/No). Provide a final verdict for each statement in order at the end in the given format. Do not deviate from the specified format.
statement: [statement 1]
...
statement: [statement n]
最终可信度分数F计算为F=|V|/|S|,其中|V|表示根据LLM支持的语句数,|S|表示语句总数。
1.2 答案相关性
答案相关性衡量的是生成答案和查询之间的相关性。得分越高表示相关性越好。例如:
为了估计答案的相关性,我们提示LLM基于给定的答案a(q)生成n个潜在问题qi,如下所示:
Generate a question for the given answer.
answer: [answer]
然后,我们利用文本嵌入模型来获得所有问题的嵌入。对于每个qi,我们计算与原始问题q的相似性sim(q,qi),相似性计算可以使用嵌入之间的余弦相似性,计算问题q的答案相关性得分AR,如下图公式所示:
1.3 上下文相关性
上下文相关性是一个衡量检索质量的指标,主要评估检索到的上下文支持查询的程度。得分低表示检索到大量不相关的内容,这可能会影响LLM生成的最终答案。例如:
为了估计上下文的相关性,使用LLM从上下文(c(q))中提取一组关键句子(Sext)。这些句子对回答问题至关重要。提示如下:
Please extract relevant sentences from the provided context that can potentially help answer the following question.
If no relevant sentences are found, or if you believe the question cannot be answered from the given context,
return the phrase "Insufficient Information".
While extracting candidate sentences you’re not allowed to make any changes to sentences from given context.
在RAGAs中,使用以下公式计算句子级别的相关性:
1.4 上下文召回
该指标衡量检索到的上下文和标注答案之间的一致性水平。它是使用基本事实和检索到的上下文来计算的,值越高表示性能越好。例如:
该评估方法需要提供标注数据。
计算公式如下:
1.5 上下文精度
该度量相对复杂,用于衡量检索到的包含真实事实的所有相关上下文是否排名靠前。分数越高表示精度越高。
该指标的计算公式如下:
上下文精度的优势在于它能够感知排名效果。然而,它的缺点是,如果相关召回很少,但都排名很高,那么分数也会很高。因此,有必要结合其他几个指标来考虑整体效果。
二、使用RAGAs+LlamaIndex进行RAG评估
主要流程如图6所示:
2.1 环境配置
使用pip安装ragas,并检查当前版本。
(py) Florian:~ Florian$ pip list | grep ragas
ragas 0.0.22
如果您使用pip-install-git+https://github.com/explodinggradients/ragas.git安装最新版本(v0.1.0rc1),但该版本不支持LlamaIndex。
然后,导入相关库,设置环境和全局变量
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"
dir_path = "YOUR_DIR_PATH"
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_relevancy,
context_recall,
context_precision
)
from ragas.llama_index import evaluate
目录指定的是论文《TinyLlama: An Open-Source Small Language Model》[3]PDF文件。
(py) Florian:~ Florian$ ls /Users/Florian/Downloads/pdf_test/
tinyllama.pdf
2.2 使用LlamaIndex构建一个简单的RAG查询引擎
documents = SimpleDirectoryReader(dir_path).load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
LlamaIndex默认情况下使用OpenAI模型,LLM和嵌入模型可以使用ServiceContext轻松配置。
构建评估数据集
由于有些指标需要手动标注数据集,下面是一些问题及其相应的答案的示例:
eval_questions = [
"Can you provide a concise description of the TinyLlama model?",
"I would like to know the speed optimizations that TinyLlama has made.",
"Why TinyLlama uses Grouped-query Attention?",
"Is the TinyLlama model open source?",
"Tell me about starcoderdata dataset",
]
eval_answers = [
"TinyLlama is a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes.",
"During training, our codebase has integrated FSDP to leverage multi-GPU and multi-node setups efficiently. Another critical improvement is the integration of Flash Attention, an optimized attention mechanism. We have replaced the fused SwiGLU module from the xFormers (Lefaudeux et al., 2022) repository with the original SwiGLU module, further enhancing the efficiency of our codebase. With these features, we can reduce the memory footprint, enabling the 1.1B model to fit within 40GB of GPU RAM.",
"To reduce memory bandwidth overhead and speed up inference, we use grouped-query attention in our model. We have 32 heads for query attention and use 4 groups of key-value heads. With this technique, the model can share key and value representations across multiple heads without sacrificing much performance",
"Yes, TinyLlama is open-source",
"This dataset was collected to train StarCoder (Li et al., 2023), a powerful opensource large code language model. It comprises approximately 250 billion tokens across 86 programming languages. In addition to code, it also includes GitHub issues and text-code pairs that involve natural languages.",
]
eval_answers = [[a] for a in eval_answers]
指标选择和RAGA评估
metrics = [
faithfulness,
answer_relevancy,
context_relevancy,
context_precision,
context_recall,
]
result = evaluate(query_engine, metrics, eval_questions, eval_answers)
result.to_pandas().to_csv('YOUR_CSV_PATH', sep=',')
请注意,默认情况下,在RAGA中,使用OpenAI模型。
在RAGAs中,如果您想使用另一个LLM(如Gemini)来使用LlamaIndex进行评估,即使在调试了RAGAs的源代码后,我也没有在版本0.0.22中找到任何有用的方法。
2.3 最终代码
import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"
dir_path = "YOUR_DIR_PATH"
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_relevancy,
context_recall,
context_precision
)
from ragas.llama_index import evaluate
documents = SimpleDirectoryReader(dir_path).load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
eval_questions = [
"Can you provide a concise description of the TinyLlama model?",
"I would like to know the speed optimizations that TinyLlama has made.",
"Why TinyLlama uses Grouped-query Attention?",
"Is the TinyLlama model open source?",
"Tell me about starcoderdata dataset",
]
eval_answers = [
"TinyLlama is a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes.",
"During training, our codebase has integrated FSDP to leverage multi-GPU and multi-node setups efficiently. Another critical improvement is the integration of Flash Attention, an optimized attention mechanism. We have replaced the fused SwiGLU module from the xFormers (Lefaudeux et al., 2022) repository with the original SwiGLU module, further enhancing the efficiency of our codebase. With these features, we can reduce the memory footprint, enabling the 1.1B model to fit within 40GB of GPU RAM.",
"To reduce memory bandwidth overhead and speed up inference, we use grouped-query attention in our model. We have 32 heads for query attention and use 4 groups of key-value heads. With this technique, the model can share key and value representations across multiple heads without sacrificing much performance",
"Yes, TinyLlama is open-source",
"This dataset was collected to train StarCoder (Li et al., 2023), a powerful opensource large code language model. It comprises approximately 250 billion tokens across 86 programming languages. In addition to code, it also includes GitHub issues and text-code pairs that involve natural languages.",
]
eval_answers = [[a] for a in eval_answers]
metrics = [
faithfulness,
answer_relevancy,
context_relevancy,
context_precision,
context_recall,
]
result = evaluate(query_engine, metrics, eval_questions, eval_answers)
result.to_pandas().to_csv('YOUR_CSV_PATH', sep=',')
请注意,在终端中运行程序时,pandas数据框可能无法完全显示。要查看它,可以将其导出为CSV文件,如图7所示:
从图7中可以明显看出,第四个问题“Tell me about starcoderdata dataset”全部为0,这是因为LLM无法提供答案。第二个和第三个问题的上下文精度为0,表明检索到的上下文中的相关上下文没有排在最前面。第二个问题的上下文调用为0,表示检索到的上下文与标注答案不匹配。
现在,让我们研究问题0到3。这些问题的答案相关性得分很高,表明答案与问题之间有很强的相关性。此外,忠实度得分并不低,这表明答案主要是从上下文中得出或总结的,可以得出结论,答案不是由于LLM的幻觉而产生的。
此外,我们发现,尽管我们的上下文相关性得分较低,gpt-3.5-turb-16k(RAGA的默认模型)仍然能够从中推断出答案。
基于这些结果,很明显,这个基本的RAG系统仍有很大的改进空间。
三、结论
一般来说,RAGAs为评估RAG提供了全面的评估指标,调用比较方便。
在调试了RAGAs的内部源代码后,发现RAGAs仍处于早期开发阶段。我们对其未来的更新和改进持乐观态度。
参考文献:
[1] https://arxiv.org/pdf/2309.15217.pdf
[2] https://docs.ragas.io/en/latest/concepts/metrics/index.html
[3] https://arxiv.org/pdf/2401.02385.pdf