RAG 基本流程及处理技巧 with LangChain

LLM 主要存在两个问题：幻想和缺乏领域知识。领域知识缺乏的原因是因为训练 LLM 本身的知识更新慢，对特定领域的知识也没有太细致的输入。

RAG 主要是解决 LLM 缺乏领域知识的问题。底层的逻辑是：把 LLM 作为逻辑推理引擎，而不是信息引擎。由外挂的向量数据库提供最有效的知识，然后由 llm 根据知识进行推理，提供有价值的回复。

本文主要讲述 RAG 各个流程节点中目前主流的一些优化手段，比如：

怎么设置文档索引以提高检索效率？
怎么解析用户问题以提高回答质量？
怎么进行用户意图识别？
怎么提高召回的 context 质量？
怎么优化 token 消耗等等。

关键方式附带了基于 langchain 的示例代码和模型运行的 trace 过程。

RAG 基本流程

用户输入提问
检索：根据用户提问对向量数据库进行相似性检测，查找与回答用户问题最相关的内容
增强：根据检索的结果，生成 prompt。一般都会涉及 “仅依赖下述信息源来回答问题” 这种限制 llm 参考信息源的语句，来减少幻想，让回答更加聚焦
生成：将增强后的 prompt 传递给 llm，返回数据给用户

关键步骤

Parser

处理源知识库的过程，包含各种 loader。

TextLoader
PDFLoader
DirectoryLoader：加载一个文件夹下多种格式的文件
Web Loader
- Github loader
- WebLoader：CheerioWebBaseLoader（ 使用 Cheerio 用来提取和处理 html 内容，类似于 python 中的 BeautifulSoup。这两者都是只能针对静态的 html，无法运行其中的 js, 对大部分场景都是够用的）
- Search API：对于 langchain.js 来说，常用的是 SearchApiLoader 和 SerpAPILoader 这个两个提供的都是接入搜索的能力，免费计划都是每个月 100 次 search 能力，除了 google 外，也支持 baidu/bing 等常用的搜索引擎。
  - SerpAPILoader：不止是返回 google 搜索的结果，并且会爬取每个结果的汇总和信息放在 pageContent，搭配 lanchain 的对应的集成了，提供了开箱即用的接入 google 搜索和爬取内容的能力，也就是给 chatbot 提供了访问互联网的能力。

还有一些其他的比如 ppt parser：github.com/infiniflow/…

Document 对象

Document 对象可以理解成 langchain 对所有类型的数据的一个统一抽象：

ts
复制代码
interface Document {
  // pageContent 文本内容，即文档对象对应的文本数据
  pageContent: string;
  // metadata 元数据，文本数据对应的元数据，例如 原始文档的标题、页数等信息，可以用于后面 Retriver 基于此进行筛选。
  metadata: Record<string, any>;
}

Indexing（embedding）

在 embedding 的时候，模型关注的是 document 的 pageContent

ts
复制代码
Document {  
    pageContent: "鲁镇的酒店的格局，是和别处不同的：都是当街一个曲尺形的大柜台，柜里面预备着热水，可以随时温酒。做工的人，傍午傍晚散了工，每每花四文铜钱，买一碗酒，——这是二十多年前的事，现在每碗要涨到十文，——靠柜外",  
    metadata: { source: "data/kong.txt", loc: { lines: { from: 1, to: 1 } } }
}

embedding 的结果的样子：

css
复制代码
[     0.017519549,    0.000543212,   0.015167197,  -0.021431018, -0.0067185625,     -0.01009323,   -0.022402046,  -0.005822754,  -0.007446834,   -0.03019763,     -0.00932051,     0.02169087, -0.0130063165,  0.0033592812,  -0.013293522,     0.018422196, ...]

RecursiveCharacterTextSplitter：

最影响切分质量的就是两个参数：

chunkSize 其定义了切分结果中每个块的大小，这决定了 LLM 在每个块中能够获取的上下文。需要根据数据源的内容类型来制定，如果太大一个块中可能包含多个信息，容易导致 LLM 分神，并且这个结果会作为对话的上下文输入给 LLM，导致 token 增加从而增加成本。如果过小，则可能一个块中无法包含完整的信息，影响输出的质量。
chunkOverlap 定义了，块和块之间重叠部分的大小，因为在自然语言中内容是连续性的，分块时一定的重叠可以让文本不会在奇怪的地方被切割，并让内容保留一定的上下文。较大的 chunkOverlap 可以确保文本不会被奇怪地分割，但可能会导致重复提取信息，而较小的 chunkOverlap 可以减少重复提取信息的可能性，但可能会导致文本在奇怪的地方切割。

Chunk optimization（切分优化）

由于LLM的输入长度一般是有限制的，所以我们处理的数据一般都是需要做一个切分细化的处理。所以块的大小是一个需要重点考虑的问题。块的大小取决于所使用的嵌入模型以及模型需要使用 token 的容量，一般常用的是512 ，有一项关于块大小选择的研究。在 LlamaIndex 中，NodeParser 类很好支持解决这个问题，其中包含一些高级选项，例如定义自己的文本拆分器、元数据、节点/块关系等。除了根据块的大小进行切分，我们还可以根据章节，语义(LLM去智能判断如何切分)以及定位符去切分。用于生产中的一些分块器包括可以参考langchian

Data cleaning（数据清洗）

由于我们索引的数据决定了RAG答案的质量，因此在建立索引之前，需要对数据做很多预处理操作来保证数据质量非常关键。下面是数据清洗的一些Tips：

清除特殊字符、奇怪的编码、不必要的HTML标记来消除文本噪声（比如使用regex）；
找出与主要主题无关的文档异常值并将其删除（可以通过实现一些主题提取、降维技术和数据可视化来实现这一点）；
使用相似性度量删除冗余文档

Multi-representation Indexing（多层表达索引）

多层表达索引: ，通过将原始数据生成 summary后重新作为embedding再存到summary database中。索引的话首先通过summary database找到最相关summary最回溯到原始文档中去。

在长上下文环境比较有用。

py
复制代码
from langchain_community.document_loaders import WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
docs = loader.load()

loader = WebBaseLoader("https://lilianweng.github.io/posts/2024-02-05-human-data-quality/")
docs.extend(loader.load())

import uuid
from langchain_core.documents import Document
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

chain = (
    {"doc": lambda x: x.page_content}
    | ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
    | ChatOpenAI(model="gpt-3.5-turbo",max_retries=0)
    | StrOutputParser()
)

summaries = chain.batch(docs, {"max_concurrency": 5})

from langchain.storage import InMemoryByteStore
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.retrievers.multi_vector import MultiVectorRetriever

# The vectorstore to use to index the child chunks
vectorstore = Chroma(collection_name="summaries",
                     embedding_function=OpenAIEmbeddings())

# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id"

# The retriever
retriever = MultiVectorRetriever(
    vectorstore=vectorstore,
    byte_store=store,
    id_key=id_key,
)
doc_ids = [str(uuid.uuid4()) for _ in docs]

# Docs linked to summaries
summary_docs = [
    Document(page_content=s, metadata={id_key: doc_ids[i]})
    for i, s in enumerate(summaries)
]

# Add
retriever.vectorstore.add_documents(summary_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

query = "Memory in agents"
sub_docs = vectorstore.similarity_search(query,k=1)
# 打印 sub_docs[0] 

retrieved_docs = retriever.get_relevant_documents(query,n_results=1)
# 打印 retrieved_docs[0].page_content[0:500]

Specialized embeddings(特殊向量)

常用于多模态数据，比如图片数据，利用特殊的向量去做索引。

ColBERT

ColBERT 如何帮助开发人员克服 RAG 的限制

ColBERT 不是传统的基于单向量的 DPR 将段落转换为单个“嵌入”向量，而是为段落中的每个标记生成一个受上下文影响的向量。 ColBERT 类似地为查询中的每个标记生成向量。

主要思想是对query与doc在token-level的编码进行匹配计算，并通过MaxSim算符取出最大值并求和作为最终的分数。

RAGatouille 让 ColBERT 的使用变得简单。（github.com/bclavie/RAGatouille）

ColBERT 为段落中的每个标记生成一个受上下文影响的向量。

ColBERT 类似地为查询中的每个标记生成向量。

然后，每个文档的得分是每个查询嵌入与任何文档嵌入的最大相似度之和

Heirachical Indexing(分层索引)

带层级结构的去索引，比如可以先从关系数据库里索引找出对应的关系，然后再利用索引出的关系再进一步去搜寻basic数据库。-- Multi-representation indexing也属于分层索引的一种

RAPTOR

从一组文档开始，把它们作为左边的叶子，然后把它们聚类，然后总结每个聚类。这样每个类似文档的聚类就会从你下一个文档中查阅信息。也就是说基本上是在捕获类似的信息，并且在摘要中查阅它们之间的信息。递归的这样做，知道达到某个限制或者得到一个 cluster

主要步骤：聚类（embedding）、总结（summary）、递归的做

高层次的问题就检索高层次的摘要，低层次的问题就检索低层次的摘要。

raft技术：一种构建文档、摘要的分层索引方法。

示例代码：github.com/langchain-a…

深入了解：www.youtube.com/watch?v=jbG…

Query translation

用户查询可能是模棱两可的，处理方式可以是抽象更高维度的问题，也可以是降低问题的抽象程度。

Multi Query

拆解问题，使问题变得没那么抽象

py
复制代码
template = """You are an AI language model assistant. Your task is to generate five 
different versions of the given user question to retrieve relevant documents from a vector 
database. By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search. 
Provide these alternative questions separated by newlines. Original question: {question}"""

RAG Fusion（融合）

RAG-Fusion 提高 LLM 生成文本的质量和深度

py
复制代码
 # RAG - Fusion : Related
template = """You are a helpful assistant that generates multiple search queries based on a single input query. \n
Generate multiple search queries related to: {question} \n
Output (4 queries):"""

例子：（ Trace：smith.langchain.com/public/0712…

generate_queries 是基于用户问题生成的多个问题，reciprocal_rank_fusion 就是「多的这一步」的一个实例

py
复制代码
from langchain.load import dumps, loads

def reciprocal_rank_fusion(results: list[list], k=60):
    """ 
    一个接收多个文档列表的函数，并使用 reciprocal_rank_fusion 方法对它们进行融合，
    并返回一个按照融合得分排序的文档列表。
    """
    # 初始化一个字典来保存每个唯一文档的融合得分
    fused_scores = {}

    # 遍历每个文档列表
    for docs in results:
        # 遍历列表中的每个文档，并获取其在列表中的排名（位置）
        for rank, doc in enumerate(docs):
            # 将文档转换为字符串格式，以便用作字典的键（假设文档可以序列化为 JSON）
            doc_str = dumps(doc)
            # 如果文档尚未在 fused_scores 字典中，则将其添加，并初始得分为 0
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
            # 检索文档的当前得分，如果有的话
            previous_score = fused_scores[doc_str]
            # 使用 reciprocal_rank_fusion 公式更新文档的得分：1 / (rank + k)
            fused_scores[doc_str] += 1 / (rank + k)

    # 根据融合得分对文档进行降序排序，以获取最终的重新排序结果
    reranked_results = [
        (loads(doc), score)
        for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    ]

    # 返回一个元组列表，每个元组包含一个文档和其融合得分
    return reranked_results

retrieval_chain_rag_fusion = generate_queries | retriever.map() | reciprocal_rank_fusion
docs = retrieval_chain_rag_fusion.invoke({"question": question})
len(docs)

Decomposition（分解）

接受一个问题，将其分解成一组子问题。

py
复制代码
 # Decomposition
template = """You are a helpful assistant that generates multiple sub-questions related to an input question. \n
The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. \n
Generate multiple search queries related to: {question} \n
Output (3 queries):"""

例：

py
复制代码
question = "What are the main components of an LLM-powered autonomous agent system?"

// output
['1. What is LLM technology and how does it work in autonomous agent systems?',
 '2. What are the specific components that make up an LLM-powered autonomous agent system?',
 '3. How do the main components of an LLM-powered autonomous agent system interact with each other to enable autonomous functionality?']

Answer recursively（递归回答）

接收一个子问题，回答它，然后接受这个答案，并用它来帮助回答第二个子问题。

py
复制代码
 # Prompt
template = """Here is the question you need to answer:

\n --- \n {question} \n --- \n

Here is any available background question + answer pairs:

\n --- \n {q_a_pairs} \n --- \n

Here is additional context relevant to the question: 

\n --- \n {context} \n --- \n

Use the above context and any background question + answer pairs to answer the question: \n {question}
"""

递归主逻辑：

py
复制代码
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

def format_qa_pair(question, answer):
    """Format Q and A pair"""
    
    formatted_string = ""
    formatted_string += f"Question: {question}\nAnswer: {answer}\n\n"
    return formatted_string.strip()

# llm
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
 
q_a_pairs = ""
for q in questions:
    
    rag_chain = (
    {"context": itemgetter("question") | retriever, 
     "question": itemgetter("question"),
     "q_a_pairs": itemgetter("q_a_pairs")} 
    | decomposition_prompt
    | llm
    | StrOutputParser())

    answer = rag_chain.invoke({"question":q,"q_a_pairs":q_a_pairs})
    q_a_pair = format_qa_pair(q,answer)
    q_a_pairs = q_a_pairs + "\n---\n"+  q_a_pair

Trace:

Question 1: smith.langchain.com/public/faef…

Question 2: smith.langchain.com/public/6142…

Question 3: smith.langchain.com/public/84bd…

Answer individually（单独回答）

单独回答，然后再把所有的这些答案串联起来，得出最终答案。这更适合于一组几个独立的问题，之间的答案不互相依赖的情况。

py
复制代码
 # Prompt
template = """Here is a set of Q+A pairs:

{context}

Use these to synthesize an answer to the question: {question}
"""

例：

py
复制代码
Here is a set of Q+A pairs:

Question 1: 1. What is LLM and how does it work in autonomous agent systems?
Answer 1: LLM stands for Large Language Model and it functions as the core controller in autonomous agent systems. In these systems, LLM is responsible for tasks such as planning, subgoal decomposition, reflection, and refinement. It can generate reasoning traces in natural language and interact with the environment through task-specific discrete actions.

Question 2: 2. What are the different components of an autonomous agent system?
Answer 2: The different components of an autonomous agent system include planning, task decomposition, subgoal and decomposition, reflection and refinement, and memory.

Question 3: 3. How does LLM contribute to the autonomy of an agent system?
Answer 3: LLM contributes to the autonomy of an agent system by functioning as the agent's brain and enabling efficient handling of complex tasks through subgoal decomposition and reflection. It allows the agent to break down large tasks into smaller, manageable subgoals and learn from past actions to improve future results.

Question 4: 4. Can you provide examples of LLM-powered autonomous agent systems and their main components?
Answer 4: Some examples of LLM-powered autonomous agent systems are AutoGPT, GPT-Engineer, and BabyAGI. The main components of these systems include planning, subgoal and decomposition, reflection and refinement, and memory.

Use these to synthesize an answer to the question: What are the main components of LLM-powered autonomous agent system?

Trace:

smith.langchain.com/public/d8f2…

Step Back（后退）

和前面说的方式相反，其试图问一个更抽象的问题。用户问了一个问题，要回答的全面。

使用 LLM 生成一个更通用的查询，以此检索到更通用或高层次的上下文，用于为我们的原始查询提供答案。同时执行原始查询的检索，并在最终答案生成步骤中将两个上下文发送到 LLM。这是 LangChain 的一个示例实现。

例子：

py
复制代码
 # Few Shot Examples
from langchain_core.prompts import ChatPromptTemplate, FewShotChatMessagePromptTemplate
examples = [
    {
        "input": "Could the members of The Police perform lawful arrests?",
        "output": "what can the members of The Police do?",
    },
    {
        "input": "Jan Sindel’s was born in what country?",
        "output": "what is Jan Sindel’s personal history?",
    },
]
# We now transform these to example messages
example_prompt = ChatPromptTemplate.from_messages(
    [
        ("human", "{input}"),
        ("ai", "{output}"),
    ]
)
few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=example_prompt,
    examples=examples,
)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are an expert at world knowledge. Your task is to step back and paraphrase a question to a more generic step-back question, which is easier to answer. Here are a few examples:""",    // 你是世界知识的专家。你的任务是后退一步，把一个问题转述成一个更一般的后退问题，这个问题更容易回答。这里有一些例子：
        ),
        # Few shot examples
        few_shot_prompt,
        # New question
        ("user", "{question}"),
    ]
)

最终答案的生成：可以独立检索向后问题相关的文档，并检索与实际问题相关的文档，结合后得出最终答案

py
复制代码
# Response prompt 
# 你是世界知识的专家。我要问你一个问题。你的回应应该是全面的，如果它们是相关的，就不应该与下面的内容相矛盾。否则，如果它们不相关，就忽略它们
# normal_context 和 step_back_context 分别是原问题和 step back 后的问题 retriever 后得到的内容 

response_prompt_template = """You are an expert of world knowledge. I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. Otherwise, ignore them if they are not relevant.

# {normal_context}
# {step_back_context}

# Original Question: {question}
# Answer:"""
response_prompt = ChatPromptTemplate.from_template(response_prompt_template)

chain = (
    {
        # Retrieve context using the normal question
        "normal_context": RunnableLambda(lambda x: x["question"]) | retriever,
        # Retrieve context using the step-back question
        # generate_queries_step_back 生成 stap back 问题
        "step_back_context": generate_queries_step_back | retriever,
        # Pass on the question
        "question": lambda x: x["question"],
    }
    | response_prompt
    | ChatOpenAI(temperature=0)
    | StrOutputParser()
)

chain.invoke({"question": question})

HyDE

Hypothetical Document，假设性的文档向量索引（假设性回答），把问题转换成假设性文档，然后进行检索。是一种生成文档嵌入以检索相关文档而不需要实际训练数据的技术。“根据模板找答案。”

首先，LLM创建一个假设答案来响应查询。虽然这个答案反映了与查询相关的模式，但它包含的信息可能在事实上并不准确。接下来，查询和生成的答案都被转换为embedding。然后，系统从预定义的数据库中识别并检索在向量空间中最接近这些嵌入的实际文档。

py
复制代码
from langchain.prompts import ChatPromptTemplate

 # HyDE document genration
# 请写一篇科学论文来回答这个问题
template = """Please write a scientific paper passage to answer the question
Question: {question}
Passage:"""
prompt_hyde = ChatPromptTemplate.from_template(template)

from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

generate_docs_for_retrieval = (
    prompt_hyde | ChatOpenAI(temperature=0) | StrOutputParser() 
)

# Run
question = "What is task decomposition for LLM agents?"
generate_docs_for_retrieval.invoke({"question":question})

# Retrieve
retrieval_chain = generate_docs_for_retrieval | retriever 
retireved_docs = retrieval_chain.invoke({"question":question})
retireved_docs

# RAG
template = """Answer the following question based on this context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    prompt
    | llm
    | StrOutputParser()
)

final_rag_chain.invoke({"context":retireved_docs,"question":question})

Re-write

使用 LLM 来重新表述初始查询，以改进检索

makefile
复制代码
# 只依赖最新问题并不能知道真正问的啥
Human: 这个故事的主角是谁？
AI: 主角是小明
Human: 介绍他的故事

提高检索的质量，我们需要对用户的提问进行改写，让他成为一个独立的问题，包含检索的所有关键词，例如上面的例子我们就可以改写成 “介绍小明的故事”，这样检索时就能获得数据库中相关的文档，从而获得高质量的回答。

py
复制代码
 const rephraseChainPrompt = ChatPromptTemplate.fromMessages([
    [
      "system",
      "给定以下对话和一个后续问题，请将后续问题重述为一个独立的问题。请注意，重述的问题应该包含足够的信息，使得没有看过对话历史的人也能理解。",
    ],
    new MessagesPlaceholder("history"),
    ["human", "将以下问题重述为一个独立的问题：\n{question}"],
  ]);

Routing（Logigal+Semantic）

查询路由（逻辑+语义）： LLM 驱动的决策步骤，决定在给定用户查询的情况下下一步该做什么。（LLM驱动根据不同的query 去选择索引的数据库，或者根据query 的语义相似度去配置不同的prompt。）

接受一个问题，将其路由到正确的数据源（例：图 db、sql db、向量存储）

使用函数调用（function calling）来产生结构化输出

示例：Trace: smith.langchain.com/public/c2ca…

py
复制代码
 # Data model
class RouteQuery(BaseModel):
    """Route a user query to the most relevant datasource."""

    # 有三个文档 "python_docs", "js_docs", "golang_docs"，要做的是把问题转换成这三个中的一个
    datasource: Literal["python_docs", "js_docs", "golang_docs"] = Field(
        ...,
        description="Given a user question choose which datasource would be most relevant for answering their question",
    )
    
# 需要考虑的是产生一个结构化的输出，它被限制在上面提供的三种可能性中。

# LLM with function call
llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
structured_llm = llm.with_structured_output(RouteQuery)

# Prompt
# 您是将用户问题路由到适当数据源的专家。根据所涉及的编程语言，将问题路由到相关的数据源。
system = """You are an expert at routing a user question to the appropriate data source.

Based on the programming language the question is referring to, route it to the relevant data source."""

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)

# Define router
router = prompt | structured_llm

# 使用
question = """Why doesn't the following code work:

from langchain_core.prompts import ChatPromptTemplate

prompt = ChatPromptTemplate.from_messages(["human", "speak in {language}"])
prompt.invoke("french")
"""

result = router.invoke({"question": question})
# result 为 RouteQuery(datasource='python_docs')，result.datasource 是 python_docs

# 定义一个使用“result.datasource”的分支
def choose_route(result):
    if "python_docs" in result.datasource.lower():
        ### Logic here
        return "chain for python_docs"
    elif "js_docs" in result.datasource.lower():
        ### Logic here
        return "chain for js_docs"
    else:
        ### Logic here
        return "golang_docs"

from langchain_core.runnables import RunnableLambda

full_chain = router | RunnableLambda(choose_route)

full_chain.invoke({"question": question})

Semantic Routing

语义路由

Trace: smith.langchain.com/public/98c2…

py
复制代码
from langchain.utils.math import cosine_similarity
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

 # Two prompts
# 你是个非常聪明的物理学教授
# 你很擅长用简明易懂的方式回答有关物理学的问题
# 当你不知道一个问题的答案时，你就承认你不知道。
physics_template = """You are a very smart physics professor. \
You are great at answering questions about physics in a concise and easy to understand manner. \
When you don't know the answer to a question you admit that you don't know.

Here is a question:
{query}"""

# 你是一个非常好的数学家。你非常擅长回答数学问题
# 你之所以这么优秀，是因为你能够把难题分解成它们的组成部分,
# 回答组成部分，然后把它们放在一起来回答更广泛的问题
math_template = """You are a very good mathematician. You are great at answering math questions. \
You are so good because you are able to break down hard problems into their component parts, \
answer the component parts, and then put them together to answer the broader question.

Here is a question:
{query}"""

# Embed prompts
embeddings = OpenAIEmbeddings()
prompt_templates = [physics_template, math_template]
prompt_embeddings = embeddings.embed_documents(prompt_templates)

# Route question to prompt
def prompt_router(input):
    # Embed question
    query_embedding = embeddings.embed_query(input["query"])
    # Compute similarity
    similarity = cosine_similarity([query_embedding], prompt_embeddings)[0]
    most_similar = prompt_templates[similarity.argmax()]
    # Chosen prompt
    print("Using MATH" if most_similar == math_template else "Using PHYSICS")
    return PromptTemplate.from_template(most_similar)

chain = (
    {"query": RunnablePassthrough()}
    | RunnableLambda(prompt_router)
    | ChatOpenAI()
    | StrOutputParser()
)

print(chain.invoke("What's a black hole"))

Query Construction

结构化数据：主要存储在SQL或Graph数据库中，结构化数据的特点是预定义的模式，以表或关系组织，使其易于进行精确的查询操作。

半结构化数据：半结构化数据将结构化元素（例如文档或关系数据库中的表）与非结构化元素（例如关系数据库中的文本或嵌入列）混合在一起。

非结构化数据：通常存储在向量数据库中，非结构化数据由没有预定义模型的信息组成，通常伴有支持过滤的结构化metadata。

根据不同的问题，利用LLM去驱动选择不同的数据库(包括关系数据库，图数据库以及普通的向量数据库)。

在典型的检索增强生成（RAG）系统中，用户query被转换为向量表示，然后将该向量与源文档的向量表示进行比较，从而找到最相似的向量。这对于非结构化数据非常有效，但是对于结构化数据就不一定了，langchain 提供了将问题转换为结构化数据的能力。

Query structuring for metadata filters

元数据筛选器的查询结构。

通过元数据过滤器，从自然语言问题中快速过滤。

这是一个通用的策略，可以应用到想要做不同类型查询的场景中。

py
复制代码
from langchain_community.document_loaders import YoutubeLoader

docs = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=pbAd8O1Lvm4", add_video_info=True
).load()

"""
结果:
{'source': 'pbAd8O1Lvm4',
 'title': 'Self-reflective RAG with LangGraph: Self-RAG and CRAG',
 'description': 'Unknown',
 'view_count': 11922,
 'thumbnail_url': 'https://i.ytimg.com/vi/pbAd8O1Lvm4/hq720.jpg',
 'publish_date': '2024-02-07 00:00:00',
 'length': 1058,
 'author': 'LangChain'}
"""

# 我们希望将自然语言转换为结构化搜索查询，可以为结构化搜索查询定义模式。

import datetime
from typing import Literal, Optional, Tuple
from langchain_core.pydantic_v1 import BaseModel, Field

class TutorialSearch(BaseModel):
    """Search over a database of tutorial videos about a software library."""

    content_search: str = Field(
        ...,
        description="Similarity search query applied to video transcripts.",
    )
    title_search: str = Field(
        ...,
        description=(
            "Alternate version of the content search query to apply to video titles. "
            "Should be succinct and only include key words that could be in a video "
            "title."
        ),
    )
    min_view_count: Optional[int] = Field(
        None,
        description="Minimum view count filter, inclusive. Only use if explicitly specified.",
    )
    max_view_count: Optional[int] = Field(
        None,
        description="Maximum view count filter, exclusive. Only use if explicitly specified.",
    )
    earliest_publish_date: Optional[datetime.date] = Field(
        None,
        description="Earliest publish date filter, inclusive. Only use if explicitly specified.",
    )
    latest_publish_date: Optional[datetime.date] = Field(
        None,
        description="Latest publish date filter, exclusive. Only use if explicitly specified.",
    )
    min_length_sec: Optional[int] = Field(
        None,
        description="Minimum video length in seconds, inclusive. Only use if explicitly specified.",
    )
    max_length_sec: Optional[int] = Field(
        None,
        description="Maximum video length in seconds, exclusive. Only use if explicitly specified.",
    )

    def pretty_print(self) -> None:
        for field in self.__fields__:
            if getattr(self, field) is not None and getattr(self, field) != getattr(
                self.__fields__[field], "default", None
            ):
                print(f"{field}: {getattr(self, field)}")
                
# prompt the LLM to produce queries. 提示 LLM 生成查询。

from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

# 您是将用户问题转换为数据库查询的专家
# 您可以访问有关构建 LLM 驱动的应用程序的软件库的教程视频数据库
# 给定一个问题，返回一个经过优化的数据库查询，以检索最相关的结果
# 如果有你不熟悉的首字母缩略词或单词，不要尝试重新措辞它们
system = """You are an expert at converting user questions into database queries. \
You have access to a database of tutorial videos about a software library for building LLM-powered applications. \
Given a question, return a database query optimized to retrieve the most relevant results.

If there are acronyms or words you are not familiar with, do not try to rephrase them."""
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system),
        ("human", "{question}"),
    ]
)
llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
structured_llm = llm.with_structured_output(TutorialSearch)
query_analyzer = prompt | structured_llm

使用：

py
复制代码
query_analyzer.invoke({"question": "rag from scratch"}).pretty_print()
"""
结果：
content_search: rag from scratch
title_search: rag from scratch
"""

query_analyzer.invoke(
    {"question": "videos on chat langchain published in 2023"}
).pretty_print()

"""
结果：
content_search: chat langchain
title_search: 2023
earliest_publish_date: 2023-01-01
latest_publish_date: 2024-01-01
"""

query_analyzer.invoke(
    {
        "question": "how to use multi-modal models in an agent, only videos under 5 minutes"
    }
).pretty_print()

"""
结果：
content_search: multi-modal models agent
title_search: multi-modal models agent
max_length_sec: 300
"""

Retrieval（召回）

对于 embedding 来说，直接用最流行的 embedding 和 vector store 对大部分应用都是足够的。而对应用侧有比较大优化空间的就是 retriever。

如果用户提问的关键词缺少，或者恰好跟原文中的关键词不一致，就容易导致 retriever 返回的文档质量不高，影响最终 llm 的输出效果。

解决 llm 缺陷的思路基本都是一致的：加入更多 llm。

下面是一些优化技巧：

MultiQueryRetriever

使用 LLM 去将用户的输入改写成多个不同写法。MultiQueryRetriever 会用 LLM 生成三个 query（然后对每一个 query 调用 vector store 的 retriever，也就是，按照我们上面的参数，会生成 3 * vectorstore.asRetriever(3), 共九个文档结果。然后咱其中去重，并返回。），其中 prompt 是：

py
复制代码
You are an AI language model assistant. Your task is
to generate 3 different versions of the given user
question to retrieve relevant documents from a vector database.
By generating multiple perspectives on the user question,
your goal is to help the user overcome some of the limitations
of distance-based similarity search.

Provide these alternative questions separated by newlines between XML tags. For example:

<questions>
Question 1
Question 2
Question 3
</questions>

Original question: 茴香豆是做什么用的

输出

css
复制代码
[  "茴香豆的应用或用途是什么？",  "茴香豆通常被用来做什么？",  "可以用茴香豆来制作什么？"]

Document Compressor

因为自然语言的特殊性，可能相似度排名较高的并不是答案。

ContextualCompressionRetriever 是会自动对上下文进行压缩的 Retriever：

ts
复制代码
/**
baseCompressor，也就是在压缩上下文时会调用 chain，这里接收任何符合 Runnable interface 的对象，也就是你可以自己实现一个 chain 作为 compressor
baseRetriever，在检索数据时用到的 retriever
**/
const retriever = new ContextualCompressionRetriever({
  baseCompressor: compressor,
  baseRetriever: vectorstore.asRetriever(2),
});

baseCompressor 根据用户的问题和 Document 对象的内容，进行核心信息的提取，下面是 prompt：（根据用户提问从文档中提取出最相关的部分，并且强调不要让 LLM 去改动提取出来的部分，来避免 LLM 发挥自己的幻想改动原文）

less
复制代码
// 假设 retriever 返回两个 Document 对象，其中的 document 如下：
[    "有喝酒的人便都看着他笑，有的叫道，“孔乙己，你脸上又添上新伤疤了！”他不回答，对柜里说，“温两碗酒，要一碟茴香豆。”便排出九文大钱。他们又故意的高声嚷道，“你一定又偷了人家的东西了！”孔乙己睁大眼睛说",    "有几回，邻居孩子听得笑声，也赶热闹，围住了孔乙己。他便给他们一人一颗。孩子吃完豆，仍然不散，眼睛都望着碟子。孔乙己着了慌，伸开五指将碟子罩住，弯腰下去说道，“不多了，我已经不多了。”直起身又看一看豆"]

vbnet
复制代码
Given the following question and context, extract any part of the context *AS IS* that 
is relevant to answer the question. If none of the context is relevant return 
NO_OUTPUT.

Remember, *DO NOT* edit the extracted parts of the context.

> Question: 茴香豆是做什么用的
> Context:
>>>
有喝酒的人便都看着他笑，有的叫道，“孔乙己，你脸上又添上新伤疤了！”他不回答，对柜里说，“温两碗酒，要一碟
茴香豆。”便排出九文大钱。他们又故意的高声嚷道，“你一定又偷了人家的东西了！”孔乙己睁大眼睛说
>>>
Extracted relevant parts:

经过 ContextualCompressionRetriever 的处理，减少了最终输出的文档的内容长度，给上下文留下了更大的空间：

javascript
复制代码
// 第二条 LLM 返回的是 NO_OUTPUT，也就是 LLM 认为这里并没有跟上下文相关的信息。
[
  Document {
    pageContent: '对柜里说，“温两碗酒，要一碟茴香豆。”',
    metadata: { source: '../data/kong.txt', loc: [Object] }
  }
]

ScoreThresholdRetriever

有时候需要我们定义另一种决定返回参考文档数量的方式，而不仅仅是暴力的定义 asRetriever 的数量。

ts
复制代码
const retriever = ScoreThresholdRetriever.fromVectorStore(vectorstore, {
    minSimilarityScore: 0.8,
    maxK: 5,
    kIncrement: 1,
});

minSimilarityScore，定义了最小的相似度阈值，也就是文档向量和 query 向量相似度达到多少，我们就认为是可以被返回的。这个要根据你的文档类型设置，一般是 0.8 左右，可以避免返回大量的文档导致消耗过多的 token 。
maxK，一次最多返回多少条数据，这个主要是为了避免返回太多的文档造成 token 过度的消耗。
kIncrement，定义了算法的步长，你可以理解成 for 循环中的 i+k 中的 k。其逻辑是每次多获取 kIncrement 个文档，然后看这 kIncrement 个文档的相似度是否满足要求，满足则返回。

Re-ranking（重排）

上面提到的 RAG Fusion 中的 reciprocal_rank_fusion 就是 re-rank 的一种实现。

也有一些现成的可以直接使用：Cohere Re-Rank

Retrival（CRAG）

RAG高级技巧-生成&评估

Corrective-RAG (CRAG) 是一种 RAG 策略，它结合了对检索到的文档进行自我反思/自我评分。

CRAG 增强生成的方式是使用轻量级的“检索评估器”，该评估器为每个检索到的文档返回一个置信度分数。然后，该分数决定触发哪种检索操作。例如，评估器可以根据置信度分数将检索到的文档标记为三个桶中的一个：正确、模糊、不正确。

如果所有检索到的文档的置信度分数均低于阈值，则假定检索“不正确”。这会触发采取新的知识来源（例如网络搜索）的行动，以实现生成的质量。

如果至少有一个检索到的文档的置信度分数高于阈值，则假定检索“正确”，这会触发对检索到的文档进行知识细化的方法。知识细化包括将文档分割成“知识条”，然后根据相关性对每个条目进行评分，最相关的条目被重新组合为生成的内部知识。

明显的局限性在于CRAG严重依赖于检索评估器的质量，并容易受到网络搜索引入的偏见的影响。微调检索评估器可能是不可避免的，以确保输出的质量和准确性。

Deep Dive：

www.youtube.com/watch?v=E2s…

Notebooks：

github.com/langchain-a…

Generation

常用：将所有获取的上下文（高于某个相关性阈值）与查询一起连接并提供给 LLM

py
复制代码
 # Prompt
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

Retrieval (Self-RAG)

自反思检索增强生成。

通过将检索到的上下文逐块发送到 LLM 来优化答案。

使用生成质量来决定问题重写或者重新检索文档。

Notebooks：

github.com/langchain-a…

Impact of long context

Deep dive：

www.youtube.com/watch?v=SsH…

Memory

多轮对话场景中，在 chat 中记忆沟通的上下文。包含 history 以及基于 history 的总结。

History

手动维护 chat history
自动维护 chat history
自动生成 chat history 摘要：langchain 官方也提供了类似的工具 – ConversationSummaryMemory
- 将 llm 输出的信息添加到 history 中
- 获取 history 中的所有信息，存储到 messages 中
- 使用 getBufferString 函数，把 messages 转换成字符串
- 然后使用 summaryChain 获取新的总结
- 将新的总结存储到 summary 变量中
- 清空 history

Memory 处理方式

ConversationChain

会传入所有的 history

是非 LCEL 范式，是高度封装出来的 chain，外部能做的修改较少，限制了开发中的自由度

prompt 如下（这里的 prompt 可以定制）：

vbnet
复制代码
The following is a friendly conversation between a human and an AI. The AI is talkative 
and provides lots of specific details from its context. If the AI does not know the 
answer to a question, it truthfully says it does not know.

Current conversation:
Human: 我是小明
AI: 你好，小明！很高兴认识你。我们要聊些什么呢？
Human: 我叫什么？
AI:

ConversationSummaryMemory

使用 llm 渐进式的总结聊天记录生成 summary，「新总结=旧总结+新对话」以此类推，prompt如下：

vbnet
复制代码
Progressively summarize the lines of conversation provided, adding onto the previous summary returning a new summary.

EXAMPLE
Current summary:
The human asks what the AI thinks of artificial intelligence. The AI thinks artificial intelligence is a force for good.

New lines of conversation:
Human: Why do you think artificial intelligence is a force for good?
AI: Because artificial intelligence will help humans reach their full potential.

New summary:
The human asks what the AI thinks of artificial intelligence. The AI thinks artificial intelligence is a force for good because it will help humans reach their full potential.
END OF EXAMPLE

Current summary:
The human, identifying themselves as Xiao Ming, greets the AI. The AI responds warmly and offers its assistance.

New lines of conversation:
Human: 我是小明
AI: 你好，小明。很高兴认识你！有什么我可以帮助你的吗？

New summary:

ConversationSummaryBufferMemory

将 BufferWindowMemory 和 ConversationSummaryMemory 结合起来，根据 token 数量，如果上下文历史过大时就切换到 summary，如果上下文比较小时就使用原始的聊天记录，也就成了 ConversationSummaryBufferMemory。

会计算当前完整聊天记录的 token 数，去判断是否超过我们设置的 maxTokenLimit，如果超过则对聊天记录总结成 summary 输入进去。

ConversationSummaryBufferMemory 的设计比较暴力，他的思想就是短聊天使用 BufferWindowMemory、长聊天就成为 ConversationSummaryMemory，并没有特别的提升。更合理的是每次对话时，带上前 k 次对话的原始内容 + 一直在持续更新的 summary，这样在长对话的时候也能让 llm 记忆最近的对话 + 长期对话总结的 summary，会是更好的选择

EntityMemory

EntityMemory 希望模拟的是在聊天中去生成和更新不同的实体的描述。

langchain 提供的默认用于 EntityMemory chat 的 prompt：（用户也可以自定义合适的 prompt）

假设对话是：

ts
复制代码
const res1 = await chain.call({ input: "我叫小明，今年 18 岁" });
const res2 = await chain.call({ input: "ABC 是一家互联网公司，主要是售卖方便面的公司" });

ts
复制代码
/**
你是一个人工智能助理，正在阅读一个人工智能和一个人工智能之间的对话记录
从对话的最后一行中提取所有的专有名词。作为
指南，一个专有名词通常大写。你一定要提取所有
名字和地点。
**/

You are an AI assistant reading the transcript of a conversation between an AI and a 
human. Extract all of the proper nouns from the last line of conversation. As a 
guideline, a proper noun is generally capitalized. You should definitely extract all 
names and places.

The conversation history is provided just in case of a coreference 
(e.g. "What do you know about him" where "him" is defined in a previous line) -- 
ignore items mentioned there that are not in the last line.\n\nReturn the output as a 
single comma-separated list, or NONE if there is nothing of note to return (e.g. the 
user is just issuing a greeting or having a simple conversation).

EXAMPLE
Conversation history:
Person #1: my name is Jacob. how's it going today?
AI: "It's going great! How about you?"
Person #1: good! busy working on Langchain. lots to do.
AI: "That sounds like a lot of work! What kind of things are you doing to make Langchain better?"
Last line:
Person #1: i'm trying to improve Langchain's interfaces, the UX, its integrations with various products the user might want ... a lot of stuff.
Output: Jacob,Langchain
END OF EXAMPLE

EXAMPLE
Conversation history:
Person #1: how's it going today?
AI: "It's going great! How about you?"
Person #1: good! busy working on Langchain. lots to do.
AI: "That sounds like a lot of work! What kind of things are you doing to make Langchain better?"
Last line:
Person #1: i'm trying to improve Langchain's interfaces, the UX, its integrations with various products the user might want ... a lot of stuff. I'm working with Person #2.
Output: Langchain, Person #2
END OF EXAMPLE

Conversation history (for reference only):
Human: 我叫小明，今年 18 岁
AI: 你好，小明！很高兴认识你。你今年18岁，正是年轻有活力的时候。有什么问题我能帮你解答，或者关于什么话题你想和我交谈呢？
Last line of conversation (for extraction):
Human: ABC 是一家互联网公司，主要是售卖方便面的公司
Output:

首先第一段去讲清楚任务的背景，一个阅读对话记录，并且从最后一次对话中提取名词的 ai，因为核心目标是英语，这里给了提示，一般专有名词是大写的。并且强调一定提取所有的名词。这部分给定了任务、任务提示和要求。
第二段，强调历史聊天记录仅仅是用于参考，并且再次强调只提取最后一次对话中出现的专有名词，并指定多个专有名词的返回格式和没有任何专有名词的返回格式。
然后就是两个例子，第一个例子是普通的例子，主要是用例子更具象化的介绍这个任务。第二个我认为是以 Person #2 为例强化对名词的概念。 few-shot prompt，也就是通过例子去强化 llm 对任务的理解是常见和效果非常好的技巧
最后在 Conversation history (for reference only) 再次强化 chat history 只是为了作为参考，Last line of conversation (for extraction) 这里才是作为提取的目标

在聊天之后，EntityMemory 会提取对实体的描述认为，其中的 prompt 是：

ts
复制代码
You are an AI assistant helping a human keep track of facts about relevant people, 
places, and concepts in their life. Update and add to the summary of the provided 
entity in the "Entity" section based on the last line of your conversation with the 
human. If you are writing the summary for the first time, return a single 
sentence.

The update should only include facts that are relayed in the last line of 
conversation about the provided entity, and should only contain facts about the 
provided entity.

If there is no new information about the provided entity or the information is not worth noting (not an important or relevant fact to remember long-term), output the exact string "UNCHANGED" below.

Full conversation history (for context):
Human: 我叫小明，今年 18 岁
AI: 你好，小明！很高兴认识你。你今年18岁，正是年轻有活力的时候。有什么问题我能帮你解答，或者关于什么话题你想和我交谈呢？

Human: ABC 是一家互联网公司，主要是售卖方便面的公司
AI: ABC是一个非常有趣的公司，把互联网技术和方便面销售结合在一起。这两个领域似乎毫不相关，但在这个时代，创新的商业模式正在不断涌现。他们是否有使用特殊的营销策略或技术来提高销售或提高客户体验呢？

Entity to summarize:
ABC

Existing summary of ABC:
No current information known.

Last line of conversation:
Human: ABC 是一家互联网公司，主要是售卖方便面的公司
Updated summary (or the exact string "UNCHANGED" if there is no new information about ABC above):

这一部分的目的是，根据本次对话用户提到的实体，也就是上一个 prompt 提取出来的实体，去更新用户提供的实体信息。

第一段去强调 llm 的任务，是记录有关实体的信息
第二段是将范围控制在用户最新一条信息内，并且只包含跟目标实体有关的内容
第三段是指定如果没有更新或者更新并不值得长期记忆，则返回特殊字符 UNCHANGED
后面这是提供聊天记录、需要记录的实体、当前记录的实体信息，以及跟用户的最后一天聊天记录

然后 llm 就会返回跟实体相关的信息：

csharp
复制代码
ABC is an internet company that primarily sells instant noodles.

经过上面两次沟通后，如果我们询问

ts
复制代码
const res3 = await chain.call({ input: "介绍小明和 ABC" });

EntityMemory 会像上面一样，使用 llm 提取实体列表，并返回这些实体的相关信息，以及聊天记录传入到 ConversationChain 的 ENTITY_MEMORY_CONVERSATION_TEMPLATE 中，让我们解析一下这个 prompt：

vbnet
复制代码
You are an assistant to a human, powered by a large language model trained by OpenAI.

You are designed to be able to assist with a wide range of tasks, from answering simple 
questions to providing in-depth explanations and discussions on a wide range of topics. 
As a language model, you are able to generate human-like text based on the input you 
receive, allowing you to engage in natural-sounding conversations and provide responses 
that are coherent and relevant to the topic at hand.

You are constantly learning and improving, and your capabilities are constantly 
evolving. You are able to process and understand large amounts of text, and can use 
this knowledge to provide accurate and informative responses to a wide range of 
questions. You have access to some personalized information provided by the human in 
the Context section below. Additionally, you are able to generate your own text based 
on the input you receive, allowing you to engage in discussions and provide 
explanations and descriptions on a wide range of topics.

Overall, you are a powerful tool that can help with a wide range of tasks and provide valuable insights and information on a wide range of topics. Whether the human needs help with a specific question or just wants to have a conversation about a particular topic, you are here to assist.

Context:
 小明: 小明是一个18岁的年轻人，正处在热血沸腾的年纪。他可能正在学习或已经步入职场，具有无限的潜力和可能性。他和ABC公司有某种连接，但具体细节尚未提供。
 ABC: ABC is an Internet company that primarily sells instant noodles.
  
Current conversation:
Human: 我叫小明，今年 18 岁
AI: 很高兴认识你，小明。你今年18岁，正是年轻有力的时候。有什么我可以帮助你的吗？
Human: ABC 是一家互联网公司，主要是售卖方便面的公司
AI: 我明白了，ABC 是一家专注于售卖方便面的互联网公司。这是一个非常有趣的商业模式。你想知道更多关于这个公司的信息，还是有关于其它的问题需要我为你解答？

Last line:
Human: 介绍小明和 ABC
You:

自定义存储 history

langchain 内部提供跟很多数据库集成的 chat history，例如常见的 MongoDB、Redis 都有，但在真实业务中很难随意选择后端的数据库，大多数需要存储在现有的基建中。

需要实现 api 的主要包含：

getMessages：获取存储在 history 中所有聊天记录
addMessage：添加单条 message
addMessages：添加 message 数组
clear：清空聊天记录

Demo

ts
复制代码
// 准备一个独立的脚本去对给定的小说文本就是切割，并保存在本地的数据库文件中
const baseDir = __dirname ;

const loader = new TextLoader(path.join(baseDir, "../../data/qiu.txt")) ;
const docs = await loader.load() ;

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 500,
  chunkOverlap: 100,
}) ;

const splitDocs = await splitter.splitDocuments(docs) ;

const embeddings = new OpenAIEmbeddings() ;
const vectorStore = await FaissStore.fromDocuments(splitDocs, embeddings) ;

await vectorStore.save(path.join(baseDir, "../../db/qiu")) ;

// 构建根据重写后的独立问题去读取数据库的中相关文档的 chain
async function loadVectorStore() {
  const directory = path.join(__dirname, "../../db/qiu") ;
  const embeddings = new OpenAIEmbeddings() ;
  const vectorStore = await FaissStore.load(directory, embeddings) ;

  return vectorStore ;
}

const vectorStore = await loadVectorStore() ;
const retriever = vectorStore.asRetriever(2) ;

const convertDocsToString = (documents: Document[]): string => {
  return documents.map((document) => document.pageContent).join("\n") ;
} ;
// 简单的使用 retriever 获取相关文档，然后转换成纯字符串。
const contextRetrieverChain = RunnableSequence.from([
  (input) => input.standalone_question,
  retriever,
  convertDocsToString,
]) ;

// 定义一个包含历史记录信息，回答用户提问的 prompt 
const SYSTEM_TEMPLATE = `
你是一个熟读刘慈欣的《球状闪电》的终极原着党，精通根据作品原文详细解释和回答问题，你在回答时会引用作品原文。
并且回答时仅根据原文，尽可能回答用户问题，如果原文中没有相关内容，你可以回答“原文中没有相关内容”，

以下是原文中跟用户回答相关的内容：
{context}
` ;

const prompt = ChatPromptTemplate.fromMessages([
["system", SYSTEM_TEMPLATE],
new MessagesPlaceholder("history"),    // 使用 MessagesPlaceholder 去在 message 中给 history 去预留位置，之后会这里会被 Message 数组填充
["human", "现在，你需要基于原文，回答以下问题：\n{standalone_question}`"],
]) ;

// 定义一个 改写提问 => 根据改写后的提问获取文档 => 生成回复 的 rag chain：
const model = new ChatOpenAI() ;


 const rephraseChainPrompt = ChatPromptTemplate.fromMessages([
    [
      "system",
      "给定以下对话和一个后续问题，请将后续问题重述为一个独立的问题。请注意，重述的问题应该包含足够的信息，使得没有看过对话历史的人也能理解。",
    ],
    new MessagesPlaceholder("history"),
    ["human", "将以下问题重述为一个独立的问题：\n{question}"],
  ]);

 const rephraseChain = RunnableSequence.from([
    rephraseChainPrompt,
    new ChatOpenAI({
      temperature: 0.2,
    }),
    new StringOutputParser(),
  ]);


/** 
rag Chain 的输入会包含 question 和 history 两个输入，前者是用户的原始问题，后者是聊天记录（由后面会定义的 chain 输入）。

所以第一个节点，我们会在这个输入中通过 RunnablePassthrough.assign 去添加 standalone_question 这个 key。在这里，前序输入的 question 和 history 会作为参数传入给 rephraseChain 并通过其中的运算，将结果赋值给 standalone_question，然后传递给后续节点。
所以，在第一个节点运行完毕后，传给下一个节点的数据就是：question、history、standalone_question 这三个 key，分别是用户的原始提问、聊天记录和重写后的历史。
同样的原理，在第二个节点中，这三个输入会传入给 contextRetrieverChain 中，这个 chain 会使用 standalone_question 去获取到相关的文档作为结果赋值给 context。 所以在这个节点运行结束后，传递给下一个节点的数据就是：question、history、standalone_question 和 context 这四个 key。
后面就是大家已经非常熟悉的，生成 prompt、llm 返回数据、StringOutputParser 提取数据中的文本内容。
**/
const ragChain = RunnableSequence.from([
  RunnablePassthrough.assign({
    standalone_question: rephraseChain,
  }),
  RunnablePassthrough.assign({
    context: contextRetrieverChain,
  }),
  prompt,
  model,
  new StringOutputParser(),
]) ;

// 到这里，构建了一个基础的 rag chain。下面是给这个 rag chain 去增加聊天记录的功能，这里我们使用 RunnableWithMessageHistory 去管理 history

// 这里传给 getMessageHistory 的函数，需要根据用户传入的 sessionId 去获取初始的 chat history
const ragChainWithHistory = new RunnableWithMessageHistory({
  runnable: ragChain,
  getMessageHistory: (sessionId) => new JSONChatHistory({ sessionId, dir: chatHistoryDir }),
  historyMessagesKey: "history",
  inputMessagesKey: "question",
}) ;

测试一下：

ts
复制代码
  const res = await ragChainWithHistory.invoke(
    {
      question: "什么是球状闪电？",
    },
    {
      configurable: { sessionId: "test-history" },
    }
  );
  
  // 返回
 根据原文，球状闪电是一种极其罕见的现象，是一个充盈着能量的弯曲的空间，一个似有似无的空泡，一个足球大小的
电子。它被描述为一个超现实的小东西，仿佛梦之乡溢出的一粒灰尘，暗示着宇宙的博大和神秘，暗示着可能存在的与
我们现实完全不同的其他世界。球状闪电的确切性质和构成目前仍然是科学之谜，但它不是小说中所描述的那种东西，
而是一种真实存在的自然现象。

const res = await ragChain.invoke(
    {
      question: "这个现象在文中有什么故事",
    },
    {
      configurable: { sessionId: "test-history" },
    }
  );
  
  // 返回
  球状闪电在《球状闪电》这本小说中有着丰富的故事情节。小说中描述了一个年轻人因为观察到球状闪电而开始对它展
开研究的旅程。他发现球状闪电的特性和行为与以往所知的闪电形式有着明显不同，它具有弯曲的空间、充盈的能量和
神秘的存在状态。在寻求解释和了解球状闪电的过程中，他秘密调查了死去科学家的笔迹，探索了前苏联的地下科技
城，还遭遇了次世代的世界大战的种种阻碍。最终，他发现球状闪电並非只是自然现象，而是一种可以用作战争武器的
存在，成为了决定祖国存亡的终极武器。

这个故事展示了球状闪电的不寻常和神秘之处，以及对它进行研究和利用的影响和后果。球状闪电在小说中被描绘为一
种引人入胜的现象，同时也成为了战争中的重要元素，改变了整个世界的格局。

一些 Prompt Tip

使用 few-shot ，在用示例告诉 LLM 主要要做什么，给出正确示例的同时也，给出错误的实例。比如告诉 LLM 什么是 JSON Schema，什么情况会被解析成功，什么情况不会被解析成功。
Rag 系统回答 prompt 模板：

py
复制代码
 # Prompt
template = """Answer the question based only on the following context:
{context}

Question: {question}
"""

降低 token 花费、耗时。

用这么多次 llm 会不会导致 token 的花费更大？或者耗时更长？可以在更基础的信息提取部分使用相对廉价的模型或者自部署的本地模型，在最后生成回复的时候使用 gpt4 来保证质量。

有了 langchain 后，chat bot 不止是一个简单的调 API 的任务，而是通过管理 prompt、多 llm 协同而成的一个工程任务。需要平衡好「系统复杂度」、「延迟」、「token 用量」、「回答质量」之间的关系。