LLM之RAG实战(三十二)| 使用RAGAs和LlamaIndex评估RAG






       论文《RAGAS: Automated Evaluation of Retrieval Augmented Generation》[1]提到了3个RAG评估指标:1)可信度(Faithfulness)、2)答案相关性(Answer Relevance)和3)上下文相关性(Context Relevance),这些指标不需要人工标注数据集或参考答案。

       此外,RAGAs网站[2]还引入了两个指标:上下文精度(Context Precision)和上下文召回(Context Recall)。

1.1 可信度/忠诚度



Given a question and answer, create one or more statements from each sentence in the given answer.question: [question]answer: [answer]


Consider the given context and following statements, then determine whether they are supported by the information present in the context. Provide a brief explan ation for each statement before arriving at the verdict (Yes/No). Provide a final verdict for each statement in order at the end in the given format. Do not deviate from the specified format.statement: [statement 1]...statement: [statement n]


1.2 答案相关性



Generate a question for the given answer.answer: [answer]


1.3 上下文相关性



Please extract relevant sentences from the provided context that can potentially help answer the following question. If no relevant sentences are found, or if you believe the question cannot be answered from the given context, return the phrase "Insufficient Information". While extracting candidate sentences you’re not allowed to make any changes to sentences from given context.


1.4 上下文召回




1.5 上下文精度






2.1 环境配置


(py) Florian:~ Florian$ pip list | grep ragasragas                        0.0.22



import osos.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"dir_path = "YOUR_DIR_PATH"from llama_index import VectorStoreIndex, SimpleDirectoryReaderfrom ragas.metrics import (    faithfulness,    answer_relevancy,    context_relevancy,    context_recall,    context_precision)from ragas.llama_index import evaluate

        目录指定的是论文《TinyLlama: An Open-Source Small Language Model》[3]PDF文件。

(py) Florian:~ Florian$ ls /Users/Florian/Downloads/pdf_test/tinyllama.pdf

2.2 使用LlamaIndex构建一个简单的RAG查询引擎

documents = SimpleDirectoryReader(dir_path).load_data()index = VectorStoreIndex.from_documents(documents)query_engine = index.as_query_engine()




eval_questions = [    "Can you provide a concise description of the TinyLlama model?",    "I would like to know the speed optimizations that TinyLlama has made.",    "Why TinyLlama uses Grouped-query Attention?",    "Is the TinyLlama model open source?",    "Tell me about starcoderdata dataset",]eval_answers = [    "TinyLlama is a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes.",    "During training, our codebase has integrated FSDP to leverage multi-GPU and multi-node setups efficiently. Another critical improvement is the integration of Flash Attention, an optimized attention mechanism. We have replaced the fused SwiGLU module from the xFormers (Lefaudeux et al., 2022) repository with the original SwiGLU module, further enhancing the efficiency of our codebase. With these features, we can reduce the memory footprint, enabling the 1.1B model to fit within 40GB of GPU RAM.",      "To reduce memory bandwidth overhead and speed up inference, we use grouped-query attention in our model. We have 32 heads for query attention and use 4 groups of key-value heads. With this technique, the model can share key and value representations across multiple heads without sacrificing much performance",    "Yes, TinyLlama is open-source",    "This dataset was collected to train StarCoder (Li et al., 2023), a powerful opensource large code language model. It comprises approximately 250 billion tokens across 86 programming languages. In addition to code, it also includes GitHub issues and text-code pairs that involve natural languages.",]eval_answers = [[a] for a in eval_answers]


metrics = [    faithfulness,    answer_relevancy,    context_relevancy,    context_precision,    context_recall,]result = evaluate(query_engine, metrics, eval_questions, eval_answers)result.to_pandas().to_csv('YOUR_CSV_PATH', sep=',')



2.3 最终代码

import osos.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"dir_path = "YOUR_DIR_PATH"from llama_index import VectorStoreIndex, SimpleDirectoryReaderfrom ragas.metrics import (    faithfulness,    answer_relevancy,    context_relevancy,    context_recall,    context_precision)from ragas.llama_index import evaluatedocuments = SimpleDirectoryReader(dir_path).load_data()index = VectorStoreIndex.from_documents(documents)query_engine = index.as_query_engine()eval_questions = [    "Can you provide a concise description of the TinyLlama model?",    "I would like to know the speed optimizations that TinyLlama has made.",    "Why TinyLlama uses Grouped-query Attention?",    "Is the TinyLlama model open source?",    "Tell me about starcoderdata dataset",]eval_answers = [    "TinyLlama is a compact 1.1B language model pretrained on around 1 trillion tokens for approximately 3 epochs. Building on the architecture and tokenizer of Llama 2, TinyLlama leverages various advances contributed by the open-source community (e.g., FlashAttention), achieving better computational efficiency. Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes.",    "During training, our codebase has integrated FSDP to leverage multi-GPU and multi-node setups efficiently. Another critical improvement is the integration of Flash Attention, an optimized attention mechanism. We have replaced the fused SwiGLU module from the xFormers (Lefaudeux et al., 2022) repository with the original SwiGLU module, further enhancing the efficiency of our codebase. With these features, we can reduce the memory footprint, enabling the 1.1B model to fit within 40GB of GPU RAM.",      "To reduce memory bandwidth overhead and speed up inference, we use grouped-query attention in our model. We have 32 heads for query attention and use 4 groups of key-value heads. With this technique, the model can share key and value representations across multiple heads without sacrificing much performance",    "Yes, TinyLlama is open-source",    "This dataset was collected to train StarCoder (Li et al., 2023), a powerful opensource large code language model. It comprises approximately 250 billion tokens across 86 programming languages. In addition to code, it also includes GitHub issues and text-code pairs that involve natural languages.",]eval_answers = [[a] for a in eval_answers]metrics = [    faithfulness,    answer_relevancy,    context_relevancy,    context_precision,    context_recall,]result = evaluate(query_engine, metrics, eval_questions, eval_answers)result.to_pandas().to_csv('YOUR_CSV_PATH', sep=',')


       从图7中可以明显看出,第四个问题“Tell me about starcoderdata dataset”全部为0,这是因为LLM无法提供答案。第二个和第三个问题的上下文精度为0,表明检索到的上下文中的相关上下文没有排在最前面。第二个问题的上下文调用为0,表示检索到的上下文与标注答案不匹配。








[1] https://arxiv.org/pdf/2309.15217.pdf

[2] https://docs.ragas.io/en/latest/concepts/metrics/index.html

[3] https://arxiv.org/pdf/2401.02385.pdf





