使用 LlamaIndex 实现的检索增强生成（RAG）

Retrieval Augmented Generation (RAG) using LlamaIndex — ROCm Blogs (amd.com)

2024 年 4 月 4 日作者：Clint Greene.

先决条件

要运行本博客，您需要具备以下条件：

Linux: 参见受支持的 Linux 发行版
ROCm: 参见安装说明
AMD GPU: 参见兼容 GPU 列表

简介

大型语言模型（LLM），如 ChatGPT，是能够执行许多复杂写作任务的强大工具。然而，它们也有其限制，主要包括：

缺乏获取最新信息的能力：LLM 的训练数据是静态的，这意味着它们无法访问最新的新闻或信息。
域特定任务的适用性有限：LLM 并未使用域特定数据训练，因此在专门用途的场景中，可能会给出不相关或不准确的回答。

为了解决这些限制，可以采用两种主要方法引入最新和域特定数据：

微调：为 LLM 提供最新的、域特定的提示和完成对文本对。然而，这种方法成本高，特别是当用于微调的数据经常变动时，需要频繁更新。
上下文提示：将最新数据作为上下文插入提示中，LLM 可以将这些信息作为附加信息来生成回答。然而，这种方法也有局限性，因为并非所有最新的域特定文档都可以适合插入到提示的上下文中。

为了克服这些障碍，可以使用检索增强生成（RAG）。RAG 是一种通过向 LLM 提供最新、相关的信息来增强其准确性和可靠性的方法。它的工作原理是自动将外部文档分割成指定大小的块，根据查询检索最相关的块，并增强输入提示以使用这些块作为回答用户查询的上下文。这种方法允许创建域特定的应用程序，而无需进行微调或手动将信息插入上下文提示中。

AI 社区用于 RAG 的一个流行框架是 LlamaIndex。它是一个用于构建 LLM 应用程序的框架，重点在于摄取、构建和访问私有或域特定数据。其工具有助于将自定义的分布外数据集成到 LLM 中。

入门

要开始，首先安装用于RAG的`transformers`, accelerate和`llama-index`：

!pip install llama-index llama-index-llms-huggingface llama-index-embeddings-huggingface llama-index-readers-web transformers accelerate -q

然后，导入`LlamaIndex`库：

from llama_index.core import ServiceContext
from llama_index.core import VectorStoreIndex
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core.prompts.base import PromptTemplate
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.readers.web import BeautifulSoupWebReader

许多开源LLM在每个提示之前需要一个前序或对提示进行特定的结构化，你可以使用`system_prompt`或`messages_to_prompt`在生成之前进行编码。另外，查询可能需要一个额外的包装器，你可以使用`query_wrapper_prompt`指定这些信息。这些信息通常可以在你使用的模型的Hugging Face模型卡上找到。在这种情况下，你将使用`zephyr-7b-alpha`进行RAG，因此你可以从这里获取预期的提示格式。

def messages_to_prompt(messages):
  prompt = ""
  for message in messages:
    if message.role == 'system':
      prompt += f"<|system|>\n{message.content}</s>\n"
    elif message.role == 'user':
      prompt += f"<|user|>\n{message.content}</s>\n"
    elif message.role == 'assistant':
      prompt += f"<|assistant|>\n{message.content}</s>\n"

LlamaIndex支持通过将模型名称传递给`HuggingFaceLLM`类直接使用Hugging Face的LLM。你可以通过`model_kwargs`指定模型参数，例如使用的设备和量化程度。你可以在`generate_kwargs`中指定控制LLM生成策略的参数，如`top_k`、`top_p`和`temperature`。你也可以直接在类中指定控制输出长度的参数，如`max_new_tokens`。要了解更多关于这些参数及其影响生成方式的细节，请查看Hugging Face的text generation文档。

llm = HuggingFaceLLM(
    model_name="HuggingFaceH4/zephyr-7b-alpha",
    tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"),
    context_window=3900,
    max_new_tokens=256,
    model_kwargs={"use_safetensors": False},
    # tokenizer_kwargs={},
    generate_kwargs={"do_sample":True, "temperature": 0.7, "top_k": 50, "top_p": 0.95},
    messages_to_prompt=messages_to_prompt,
    device_map="cuda",
)

原始提示

为了演示前面提到的不足之处，可以提示你的LLM并询问Paul Graham如何推荐努力工作。

question = "How does Paul Graham recommend to work hard? Can you list it as steps"
response = llm.complete(question)
print(response)

Paul Graham's advice on working hard, as outlined in his essay "How to Be a Maker," can be summarized into several steps:

1. Work on something you care about. This will give you the motivation and energy to work harder and longer than you would if you were working on something that didn't matter to you.

2. Set specific goals. Instead of just working on your project, set specific milestones and deadlines for yourself. This will give you a clear sense of direction and help you to focus your efforts.

3. Eliminate distractions. Turn off your phone, close your email, and find a quiet place to work. Eliminating distractions will help you to stay focused and make progress.

4. Work in short, intense bursts. Rather than working for long periods of time, break your work into short, intense bursts. This will help you to maintain your focus and avoid burnout.

5. Take breaks. Taking breaks is important for maintaining your focus and avoiding burnout. Use your breaks to clear your mind, recharge your batteries, and come back to your work with fresh energy.

6. Work on your weaknesses.

乍看起来，生成的回应看起来准确合理。LLM知道我们在谈论Paul Graham和如何努力工作。推荐的努力工作步骤看起来也很合理。然而，这些并不是Paul Graham关于如何努力工作的建议。LLM在遇到知识空白时，可能会出现‘幻觉’，即做出虚假但似乎合理的陈述。

提示工程

一个简单的方法来克服事实的‘幻觉’是修改提示以包括外部的上下文信息。让我们从Paul Graham的文章《如何努力工作》中复制文本。

以使用Python中的BeautifulSoup库自动复制这些文本：

url = "https://paulgraham.com/hwh.html"

documents = BeautifulSoupWebReader().load_data([url])

现在，修改原始问题，并在提问时包括更新的信息：

context = documents[0].text
prompt = f"""Answer the question based on the context below. If the
question cannot be answered using the information provided answer
with "I don't know".

Context: {context}

Question: {question}

Answer: """

现在，将此提示输入到你的LLM中，并注意响应：

response = llm.complete(prompt)
print(response)

1. Learn the shape of real work: Understand the difference between fake work and real work, and be able to distinguish between them.

2. Find the limit of working hard: Learn how many hours a day to spend on work, and avoid pushing yourself to work too much.

3. Work toward the center: Aim for the most ambitious problems, even if they are harder.

4. Figure out what to work on: Determine which type of work you are suited for, based on your interests and natural abilities.

5. Continuously assess and adjust: Regularly evaluate both how hard you're working and how well you're doing, and be willing to switch fields if necessary.

6. Be honest with yourself: Consistently be clear-sighted and honest in your evaluations of your abilities, progress, and interests.

7. Accept failure: Be open to the possibility of failure and learn from it.

8. Stay interested: Find work that you find interesting and enjoyable, rather than simply for financial gain or external validation.

9. Balance work and rest: Give yourself time to get going, but also recognize when it's time to take a

通过提示LLM使用文章作为上下文，你限制了LLM使用提示中的信息生成内容，从而生成准确的响应。现在，尝试使用RAG（检索增强生成）方法生成响应，并与上下文提示方法进行比较。

构建检索增强生成（Retrieval Augmented Generation，RAG）应用程序

要构建RAG应用程序，首先需要调用`ServiceContext`，这会建立要使用的语言和嵌入模型，以及确定文档解析的关键参数（例如`chunk_size` 和`chunk_overlap`）。

service_context = ServiceContext.from_defaults(llm=llm, embed_model="local:BAAI/bge-base-en-v1.5", chunk_size=256, chunk_overlap=32)

在执行RAG时，文档被分成较小的块。`chunk_size`参数指定每个块的长度（以token为单位），而`chunk_overlap`指定每个块与其相邻块重叠的token数量。

用你在前面实验中使用的`llm`变量设置`llm`参数。对于嵌入模型，使用`bge-base`（已证明在检索任务中表现优异）来嵌入文档块。

接下来，使用`VectorStoreIndex`构建向量索引，它将你的文档传递给嵌入模型以进行分块和嵌入。然后调用`query_engine`准备索引用于查询，指定`similarity_top_k`以返回与输入查询最相似的前八个文档块。

index = VectorStoreIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine(similarity_top_k=8)

你的RAG应用程序现在已经可以进行查询了。让我们用原始问题进行查询：

response = query_engine.query(question)
print(response)

Paul Graham recommends the following steps to work hard:
1. Constantly aim toward the center of the problem you're working on, which is usually the most ambitious and difficult part.

2. Measure both how hard you're working and how well you're doing. Don't solely rely on pushing yourself to work, as there may be times where you need to focus on easier, peripheral work.

3. When working on a specific problem, aim as close to the center as you can without stalling.

4. When working on a larger scale, make big, lifetime-scale adjustments about which type of work to do.

5. Give yourself time to get going on a new problem, but don't give up too soon if results aren't immediate.

6. Learn to distinguish between good and bad results, and adjust accordingly.

7. Find an easy way to do something hard.

8. Be consistently honest and clear-sighted, and your network will automatically assume an optimal shape.

9. Determination, interest, and natural ability are the three ingredients in great work.

10. Go on vacation occasionally, but learn something new while there.

这个响应与上下文提示引导的示例相当类似。这并不令人惊讶，因为它使用相同的上下文信息生成响应。提示工程需要手动指定上下文，而可以将RAG视为一种高级和自动化的提示工程，它利用文档数据库检索最优的上下文来指导生成过程。