langchain教程-9.Retriever/检索器

前言

该系列教程的代码: https://github.com/shar-pen/Langchain-MiniTutorial

我主要参考 langchain 官方教程, 有选择性的记录了一下学习内容

这是教程清单

  • 1.初试langchain
  • 2.prompt
  • 3.OutputParser/输出解析
  • 4.model/vllm模型部署和langchain调用
  • 5.DocumentLoader/多种文档加载器
  • 6.TextSplitter/文档切分
  • 7.Embedding/文本向量化
  • 8.VectorStore/向量数据库存储和检索
  • 9.Retriever/检索器
  • 10.Reranker/文档重排序
  • 11.RAG管道/多轮对话RAG
  • 12.Agent/工具定义/Agent调用工具/Agentic RAG

VectorStore-backed Retriever

基于VectorStore的检索器 是一种文档检索系统,它利用向量存储根据文档的向量表示来进行搜索。这种方法使得基于相似度的搜索变得高效,特别适用于处理非结构化数据。

RAG系统中的文档搜索和响应生成步骤包括:

  1. 文档加载:导入原始文档。
  2. 文本切分:将文本切分成可管理的块。
  3. 向量嵌入:使用嵌入模型将文本转换为数值向量。
  4. 存储到向量数据库:将生成的嵌入向量存储到向量数据库中,以便高效检索。

在查询阶段:

  • 流程:用户查询 → 嵌入 → 在向量存储中搜索 → 检索相关块 → LLM生成响应
  • 用户的查询被转化为一个嵌入向量,使用嵌入模型。
  • 该查询嵌入向量与向量数据库中存储的文档向量进行比较,以 检索最相关的结果
  • 检索到的文档块被传递给大语言模型(LLM),该模型基于检索到的信息生成最终响应。
import faiss
from langchain_core.documents import Document
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_openai import OpenAIEmbeddings

openai_embedding = OpenAIEmbeddings(
	model="bge-m3",
	base_url='http://localhost:9997/v1',
	api_key='cannot be empty',
	# dimensions=1024,
)

embed_dim = len(openai_embedding.embed_query("hello world"))
texts = [
    "AI helps doctors diagnose diseases faster, improving patient outcomes.",
    "AI can analyze medical images to detect conditions like cancer.",
    "Machine learning predicts patient outcomes based on health data.",
    "AI speeds up drug discovery by predicting the effectiveness of compounds.",
    "AI monitors patients remotely, enabling proactive care for chronic diseases.",
    "AI automates administrative tasks, saving time for healthcare workers.",
    "NLP extracts insights from electronic health records for better care.",
    "AI chatbots help with patient assessments and symptom checking.",
    "AI improves drug manufacturing, ensuring better quality and efficiency.",
    "AI optimizes hospital operations and reduces healthcare costs."
]

documents = [
	Document(text, metadata={"source":text})
	for text in texts
]

db = FAISS.from_documents(documents, openai_embedding)

一旦向量数据库创建完成,就可以使用检索方法,如 相似度搜索最大边际相关性(MMR),加载并查询数据库,从中搜索相关的文本。

as_retriever 方法允许你将一个向量数据库转换为一个检索器,从而实现从向量库中高效地搜索和检索文档。

工作原理

  • as_retriever() 方法将一个向量库(如 FAISS)转换为一个检索器对象,使其与 LangChain 的检索工作流兼容。
  • 这个检索器可以直接用于 RAG 流水线,或与大型语言模型(LLM)结合,用于构建智能搜索系统。
retriever = db.as_retriever()

高级检索器配置

as_retriever 方法允许你配置高级检索策略,如 相似度搜索最大边际相关性(MMR)基于相似度分数阈值的过滤

参数:

  • **kwargs:传递给检索函数的关键字参数:
    • search_type:指定搜索方法。
      • "similarity":基于余弦相似度返回最相关的文档。
      • "mmr":利用最大边际相关性算法,平衡 相关性多样性
      • "similarity_score_threshold":返回相似度分数超过指定阈值的文档。
    • search_kwargs:其他用于微调结果的搜索选项:
      • k:返回的文档数量(默认值:4)。
      • score_threshold:用于 "similarity_score_threshold" 搜索类型的最小相似度分数(例如:0.8)。
      • fetch_k:在 MMR 搜索过程中最初检索的文档数量(默认值:20)。
      • lambda_mult:控制 MMR 结果中的多样性(0 = 最大多样性,1 = 最大相关性,默认值:0.5)。
      • filter:用于选择性文档检索的元数据过滤。

返回值:

  • VectorStoreRetriever:初始化后的检索器对象,可以直接用于文档搜索任务。

注意事项:

  • 支持多种搜索策略(similarityMMRsimilarity_score_threshold)。
  • MMR 通过减少结果中的冗余,提升结果多样性同时保持相关性。
  • 元数据过滤使得根据文档属性选择性地检索文档成为可能。
  • tags 参数可以用于给检索器加标签,以便更好地组织和识别。

警告:

  • 使用 MMR 时的多样性控制:
    • 小心调整 fetch_k(最初检索的文档数量)和 lambda_mult(多样性控制因子)以获得最佳平衡。
    • lambda_mult
      • 较低值(< 0.5)→ 优先考虑多样性。
      • 较高值(> 0.5)→ 优先考虑相关性。
    • 为有效的多样性控制,设置 fetch_k 大于 k
  • 阈值设置:
    • 使用较高的 score_threshold(例如 0.95)可能会导致没有结果。
  • 元数据过滤:
    • 在应用过滤器之前,确保元数据结构已经定义好。
  • 平衡配置:
    • 为了获得最佳的检索性能,保持 search_typesearch_kwargs 设置之间的适当平衡。
retriever = db.as_retriever(
    search_type="similarity_score_threshold", 
    search_kwargs={
        "k": 5,  # Return the top 5 most relevant documents
        "score_threshold": 0.5  # Only return documents with a similarity score of 0.4 or higher
    }
)

query = "How does AI improve healthcare?"
results = retriever.invoke(query)

# Display search results
for doc in results:
    print(doc.page_content)
No relevant docs were retrieved using the relevance score threshold 0.5

检索器的 invoke() 方法

invoke() 方法是与检索器交互的主要入口点。它用于根据给定的查询搜索并检索相关的文档。

工作原理:

  1. 查询提交:用户提交查询字符串作为输入。
  2. 嵌入生成:如果需要,查询会被转换成向量表示。
  3. 搜索过程:检索器使用指定的搜索策略(如相似度、MMR 等)在向量数据库中进行搜索。
  4. 结果返回:该方法返回一组相关的文档片段。

参数:

  • input(必需):

    • 用户提供的查询字符串。
    • 查询会被转换成向量,并与存储的文档向量进行相似度比较,以进行基于相似度的检索。
  • config(可选):

    • 允许对检索过程进行细粒度控制。
    • 可用于指定 标签、元数据插入和搜索策略
  • **kwargs(可选):

    • 允许直接传递 search_kwargs 进行高级配置。
    • 示例选项包括:
      • k:返回的文档数量。
      • score_threshold:文档被包括的最低相似度分数。
      • fetch_k:MMR 搜索中最初检索的文档数量。

返回值:

  • List[Document]
    • 返回包含检索到的文本和元数据的文档对象列表。
    • 每个文档对象包括:
      • page_content:文档的主要内容。
      • metadata:与文档相关联的元数据(例如,来源、标签)。

用例 1

docs = retriever.invoke("What is an embedding?")

for doc in docs:
    print(doc.page_content)
    print("=========================================================")
Machine learning predicts patient outcomes based on health data.
=========================================================
AI monitors patients remotely, enabling proactive care for chronic diseases.
=========================================================
AI chatbots help with patient assessments and symptom checking.
=========================================================

用例 2

# search options: top 5 results with a similarity score ≥ 0.7
docs = retriever.invoke(
    "What is a vector database?",
    search_kwargs={"k": 5, "score_threshold": 0.7}
)
for doc in docs:
    print(doc.page_content)
    print("=========================================================")
Machine learning predicts patient outcomes based on health data.
=========================================================
AI monitors patients remotely, enabling proactive care for chronic diseases.
=========================================================
AI chatbots help with patient assessments and symptom checking.
=========================================================

最大边际相关性 (MMR)

最大边际相关性 (MMR) 搜索方法是一种文档检索算法,旨在通过平衡相关性和多样性来减少冗余,从而返回结果时提高多样性。

MMR 的工作原理:
与仅根据相似度分数返回最相关文档的基本相似度搜索不同,MMR 考虑了两个关键因素:

  1. 相关性:衡量文档与用户查询的匹配程度。
  2. 多样性:确保检索到的文档彼此不同,避免重复的结果。

关键参数:

  • search_type="mmr":启用 MMR 检索策略。
  • k:应用多样性过滤后返回的文档数量(默认值:4)。
  • fetch_k:应用多样性过滤前最初检索的文档数量(默认值:20)。
  • lambda_mult:多样性控制因子(0 = 最大多样性1 = 最大相关性,默认值:0.5)。

相似度分数阈值搜索

相似度分数阈值搜索是一种检索方法,只有当文档的相似度分数超过预定义的阈值时才会返回。该方法有助于筛选出低相关性的结果,确保返回的文档与查询高度相关。

关键特性:

  • 相关性过滤:仅返回相似度分数高于指定阈值的文档。
  • 可调精度:通过 score_threshold 参数调整阈值。
  • 启用搜索类型:通过设置 search_type="similarity_score_threshold" 启用此搜索方法。

这种搜索方法非常适用于需要高度精确结果的任务,例如事实核查或回答技术性查询。

配置 top_k(调整返回文档的数量)

  • 参数 k 指定在向量搜索过程中返回的文档数量。它决定了从向量数据库中检索到的 排名最高(基于相似度分数)的文档数量。

  • 通过在 search_kwargs 中设置 k 值,可以调整检索到的文档数量。

  • 例如,设置 k=1 将仅返回 最相关的 1 篇文档,该文档基于相似度排序。

ContextualCompressionRetriever

ContextualCompressionRetriever 是 LangChain 中的一种强大工具,旨在通过根据上下文压缩检索到的文档来优化检索过程。这个检索器特别适用于需要对大量数据进行动态总结或过滤的场景,确保只有最相关的信息传递到后续处理步骤。

ContextualCompressionRetriever 的主要特点包括:

  • 上下文感知压缩:文档会根据特定的上下文或查询进行压缩,确保相关性并减少冗余。
  • 灵活的集成:与其他 LangChain 组件无缝工作,便于集成到现有的管道中。
  • 可定制的压缩:支持使用不同的压缩技术,包括摘要模型和基于嵌入的方法,来根据需求定制检索过程。

ContextualCompressionRetriever 特别适用于以下应用:

  • 为问答系统总结大量数据。
  • 通过提供简洁且相关的回答来提升聊天机器人性能。
  • 提高文档密集型任务(如法律分析或学术研究)的效率。

通过使用这个检索器,开发者可以显著减少计算开销,并提高提供给最终用户的信息质量。

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

# 1. Generate Loader to lthe text file using TextLoader
loader = TextLoader("./data/appendix-keywords.txt")\

# 2. Generate text chunks using CharacterTextSplitter and split the text into chunks of 300 characters with no overlap.
text_splitter = CharacterTextSplitter(chunk_size=400, chunk_overlap=0)
texts = loader.load_and_split(text_splitter)

# 3. Generate vector store using FAISS and convert it to retriever
embedder = OpenAIEmbeddings(
	model="bge-m3",
	base_url='http://localhost:9997/v1',
	api_key='cannot be empty',
	# dimensions=1024,
)
retriever = FAISS.from_documents(texts, embedder).as_retriever(search_kwargs={"k": 10})

# 4. Query the retriever to find relevant documents
docs = retriever.invoke("What is the definition of Multimodal?")

# 5. Print the relevant documents
for i, d in enumerate(docs):
	print(f"document {i+1}:\n\n" + d.page_content)
Created a chunk of size 419, which is longer than the specified 400


document 1:

Semantic Search
document 2:

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.
Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: Natural Language Processing, Search Algorithms, Data Mining
document 3:

Definition: A token refers to a smaller unit of text obtained by splitting a larger text. It can be a word, sentence, or phrase.
Example: The sentence “I go to school” can be split into tokens: “I”, “go”, “to”, “school”.
Related Keywords: Tokenization, Natural Language Processing, Parsing

Tokenizer
document 4:

Definition: A tokenizer is a tool that splits text data into tokens. It is commonly used in natural language processing for data preprocessing.
Example: The sentence “I love programming.” can be tokenized into [“I”, “love”, “programming”, “.”].
Related Keywords: Tokenization, Natural Language Processing, Parsing

VectorStore
document 5:

Definition: A vector store is a system for storing data in vector form. It is used for tasks like retrieval, classification, and other data analysis.
Example: Word embedding vectors can be stored in a database for quick access.
Related Keywords: Embedding, Database, Vectorization

SQL
document 6:

Definition: SQL (Structured Query Language) is a programming language for managing data in databases. It supports operations like querying, modifying, inserting, and deleting data.
Example: SELECT * FROM users WHERE age > 18; retrieves information about users older than 18.
Related Keywords: Database, Query, Data Management

CSV
document 7:

Definition: CSV (Comma-Separated Values) is a file format for storing data where each value is separated by a comma. It is often used for simple data storage and exchange in tabular form.
Example: A CSV file with headers “Name, Age, Job” might contain data like “John Doe, 30, Developer”.
Related Keywords: File Format, Data Handling, Data Exchange

JSON
document 8:

Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format that represents data objects in a human- and machine-readable text format.
Example: {"name": "John Doe", "age": 30, "job": "Developer"} is an example of JSON data.
Related Keywords: Data Exchange, Web Development, API

Transformer
document 9:

Definition: A transformer is a type of deep learning model used in natural language processing for tasks like translation, summarization, and text generation. It is based on the attention mechanism.
Example: Google Translate uses transformer models to perform translations between languages.
Related Keywords: Deep Learning, Natural Language Processing, Attention

HuggingFace
document 10:

Definition: HuggingFace is a library that provides pre-trained models and tools for natural language processing, making NLP tasks more accessible to researchers and developers.
Example: HuggingFace’s Transformers library can be used for tasks like sentiment analysis and text generation.
Related Keywords: Natural Language Processing, Deep Learning, Library

Digital Transformation

使用 LLMChainExtractor 创建的 DocumentCompressor 正是应用于检索器的,即 ContextualCompressionRetriever

ContextualCompressionRetriever 会通过去除无关信息并专注于最相关的信息来压缩文档。

LLMChainFilter

LLMChainFilter 是一个简单但强大的压缩器,它使用 LLM 链来决定从最初检索到的文档中哪些应该被过滤,哪些应该被返回。

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI

# Before applying ContextualCompressionRetriever
docs = retriever.invoke("What is the definition of Multimodal?")
for i, d in enumerate(docs):
	print(f"document {i+1}:\n\n" + d.page_content)
print("="*62)
print("="*15 + "After applying LLMChainExtractor" + "="*15)


# After applying ContextualCompressionRetriever
# 1. Generate LLM
llm = ChatOpenAI(
	base_url='http://localhost:5551/v1',
	api_key='EMPTY',
	model_name='Qwen2.5-7B-Instruct',
	temperature=0.2,
)


# 2. Generate compressor using LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)

# 3. Generate compression retriever using ContextualCompressionRetriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=retriever,
)

# 4. Query the compression retriever to find relevant documents
compressed_docs = (
    compression_retriever.invoke( 
        "What is the definition of Multimodal?"
    )
)

# 5. Print the relevant documents
for i, d in enumerate(compressed_docs):
	print(f"document {i+1}:\n\n" + d.page_content)
document 1:

Semantic Search
document 2:

Definition: Semantic search is a search method that goes beyond simple keyword matching by understanding the meaning of the user’s query to return relevant results.
Example: If a user searches for “planets in the solar system,” the system might return information about related planets such as “Jupiter” or “Mars.”
Related Keywords: Natural Language Processing, Search Algorithms, Data Mining
document 3:

Definition: A token refers to a smaller unit of text obtained by splitting a larger text. It can be a word, sentence, or phrase.
Example: The sentence “I go to school” can be split into tokens: “I”, “go”, “to”, “school”.
Related Keywords: Tokenization, Natural Language Processing, Parsing

Tokenizer
document 4:

Definition: A tokenizer is a tool that splits text data into tokens. It is commonly used in natural language processing for data preprocessing.
Example: The sentence “I love programming.” can be tokenized into [“I”, “love”, “programming”, “.”].
Related Keywords: Tokenization, Natural Language Processing, Parsing

VectorStore
document 5:

Definition: A vector store is a system for storing data in vector form. It is used for tasks like retrieval, classification, and other data analysis.
Example: Word embedding vectors can be stored in a database for quick access.
Related Keywords: Embedding, Database, Vectorization

SQL
document 6:

Definition: SQL (Structured Query Language) is a programming language for managing data in databases. It supports operations like querying, modifying, inserting, and deleting data.
Example: SELECT * FROM users WHERE age > 18; retrieves information about users older than 18.
Related Keywords: Database, Query, Data Management

CSV
document 7:

Definition: CSV (Comma-Separated Values) is a file format for storing data where each value is separated by a comma. It is often used for simple data storage and exchange in tabular form.
Example: A CSV file with headers “Name, Age, Job” might contain data like “John Doe, 30, Developer”.
Related Keywords: File Format, Data Handling, Data Exchange

JSON
document 8:

Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format that represents data objects in a human- and machine-readable text format.
Example: {"name": "John Doe", "age": 30, "job": "Developer"} is an example of JSON data.
Related Keywords: Data Exchange, Web Development, API

Transformer
document 9:

Definition: A transformer is a type of deep learning model used in natural language processing for tasks like translation, summarization, and text generation. It is based on the attention mechanism.
Example: Google Translate uses transformer models to perform translations between languages.
Related Keywords: Deep Learning, Natural Language Processing, Attention

HuggingFace
document 10:

Definition: HuggingFace is a library that provides pre-trained models and tools for natural language processing, making NLP tasks more accessible to researchers and developers.
Example: HuggingFace’s Transformers library can be used for tasks like sentiment analysis and text generation.
Related Keywords: Natural Language Processing, Deep Learning, Library

Digital Transformation
==============================================================
===============After applying LLMChainExtractor===============

大模型把无关内容都过滤了, 虽然我 embedding 很拉, 没能抽到相关内容
以下是一个过滤效果的展示, 把定义成功保留, 示例被过滤掉

text = \
"""
Multimodal
Definition: Multimodal refers to the technology that combines multiple types of data modes (e.g., text, images, sound) to process and extract richer and more accurate information or predictions.
Example: A system that analyzes both images and descriptive text to perform more accurate image classification is an example of multimodal technology.
Relate
"""
docs = [Document(text)]
query = "What is the definition of Multimodal?"
compressed_docs = compressor.compress_documents(docs, query)
print(compressed_docs[0].page_content)
Multimodal
Definition: Multimodal refers to the technology that combines multiple types of data modes (e.g., text, images, sound) to process and extract richer and more accurate information or predictions.

源码分析

这是 ContextualCompressionRetriever 的检索函数 _get_relevant_documents的关键代码:

	docs = self.base_retriever.invoke(
		query, config={"callbacks": run_manager.get_child()}, **kwargs
	)
	if docs:
		compressed_docs = self.base_compressor.compress_documents(
			docs, query, callbacks=run_manager.get_child()
		)
		return list(compressed_docs)
	else:
		return []

首先还是 base_retriever 支持返回检索结果, 再接过 base_compressor 压缩

这是 base_compresser 类 LLMChainExtractorcompress_documents函数关键部分:

	compressed_docs = []
	for doc in documents:
		_input = self.get_input(query, doc) # 产生 {"question": query, "context": doc.page_content}
		output_ = self.llm_chain.invoke(_input, config={"callbacks": callbacks}) # 调用大模型抽取内容
		if isinstance(self.llm_chain, LLMChain):
			output = output_[self.llm_chain.output_key]
			if self.llm_chain.prompt.output_parser is not None:
				output = self.llm_chain.prompt.output_parser.parse(output)
		else:
			output = output_
		if len(output) == 0:
			continue
		compressed_docs.append(
			Document(page_content=cast(str, output), metadata=doc.metadata)
		)
	return compressed_docs

这是调用大模型抽取内容的 prompt 模板

"""
Given the following question and context, extract any part of the context *AS IS* that is relevant to answer the question. If none of the context is relevant return {no_output_str}. 

Remember, *DO NOT* edit the extracted parts of the context.

> Question: {{question}}
> Context:
>>>
{{context}}
>>>
Extracted relevant parts:
"""

EmbeddingsFilter

对每个检索到的文档执行额外的 LLM 调用既昂贵又缓慢。
EmbeddingsFilter 提供了一个更经济且更快速的选项,通过嵌入文档和查询,只返回那些与查询的嵌入相似度足够高的文档。

这种方法在保持搜索结果相关性的同时,节省了计算成本和时间。
该过程涉及使用 EmbeddingsFilterContextualCompressionRetriever 压缩并检索相关文档。

  • EmbeddingsFilter 用于过滤超过指定相似度阈值(0.86)的文档。
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain_openai import OpenAIEmbeddings

# 1. Generate embeddings using OpenAIEmbeddings
embeddings = OpenAIEmbeddings(
	model="bge-m3",
	base_url='http://localhost:9997/v1',
	api_key='cannot be empty',
	# dimensions=1024,
)

# 2. Generate EmbedingsFilter object that has similarity threshold of 0.86
embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.86)

# 3. Generate ContextualCompressionRetriever object using EmbeddingsFilter and retriever
compression_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_filter, 
    base_retriever=retriever
)

# 4. Query the compression retriever to find relevant documents
compressed_docs = compression_retriever.invoke(
    "What is the definition of Multimodal?"
)

# 5. Print the relevant documents
for i, d in enumerate(compressed_docs):
	print(f"document {i+1}:\n\n" + d.page_content)

这个方法也只是将 base_retriever 的返回结果经过 EmbeddingsFilter 的相似度阈值过滤, 可以选择更强的 embedding model 来强化相似度准确度

Ensemble Retriever 多路召回

EnsembleRetriever 集成了稀疏和密集检索算法的优点,通过使用权重和运行时配置来定制性能。

关键特点

  1. 集成多个检索器:接受不同类型的检索器作为输入并结合结果。
  2. 结果重新排序:使用倒排排名融合算法重新排序结果。
  3. 混合检索:主要使用稀疏检索器(例如 BM25)和密集检索器(例如 嵌入相似度)相结合。

优势

  • 稀疏检索器:有效进行基于关键词的检索。
  • 密集检索器:有效进行基于语义相似度的检索。

由于这些互补特性,EnsembleRetriever 可以在各种检索场景中提供更好的性能。

from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

# list sample documents
doc_list = [
    "I like apples",
    "I like apple company",
    "I like apple's iphone",
    "Apple is my favorite company",
    "I like apple's ipad",
    "I like apple's macbook",
]

# Initialize the bm25 retriever and faiss retriever.
bm25_retriever = BM25Retriever.from_texts(
    doc_list,
)
bm25_retriever.k = 2  # Set the number of search results for BM25Retriever to 1.

embedding = OpenAIEmbeddings(
	model="bge-m3",
	base_url='http://localhost:9997/v1',
	api_key='cannot be empty',
	# dimensions=1024,
	)

faiss_vectorstore = FAISS.from_texts(
    doc_list,
    embedding,
)
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

# Initialize the ensemble retriever.
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever],
    weights=[0.7, 0.3],
)
# Get the search results document.
query = "my favorite fruit is apple"
ensemble_result = ensemble_retriever.invoke(query)
bm25_result = bm25_retriever.invoke(query)
faiss_result = faiss_retriever.invoke(query)

# Output the fetched documents.
print("[Ensemble Retriever]")
for doc in ensemble_result:
    print(f"Content: {doc.page_content}")
    print()

print("[BM25 Retriever]")
for doc in bm25_result:
    print(f"Content: {doc.page_content}")
    print()

print("[FAISS Retriever]")
for doc in faiss_result:
    print(f"Content: {doc.page_content}")
    print()
[Ensemble Retriever]
Content: Apple is my favorite company

Content: I like apple company

Content: I like apples

[BM25 Retriever]
Content: Apple is my favorite company

Content: I like apple company

[FAISS Retriever]
Content: Apple is my favorite company

Content: I like apples

源码分析

EnsembleRetrieverrank_fusion 函数:

retriever_docs = [
	retriever.invoke(
		query,
		patch_config(
			config, callbacks=run_manager.get_child(tag=f"retriever_{i + 1}")
		),
	)
	for i, retriever in enumerate(self.retrievers)
]

# Enforce that retrieved docs are Documents for each list in retriever_docs
for i in range(len(retriever_docs)):
	retriever_docs[i] = [
		Document(page_content=cast(str, doc)) if isinstance(doc, str) else doc
		for doc in retriever_docs[i]
	]

# apply rank fusion
fused_documents = self.weighted_reciprocal_rank(retriever_docs)

每个 retriever 单独调用, 返回多组 Documents, 再经过 weighted_reciprocal_rank:

rrf_score: Dict[str, float] = defaultdict(float)
for doc_list, weight in zip(doc_lists, self.weights):
	for rank, doc in enumerate(doc_list, start=1):
		rrf_score[
			(
				doc.page_content
				if self.id_key is None
				else doc.metadata[self.id_key]
			)
		] += weight / (rank + self.c)

# Docs are deduplicated by their contents then sorted by their scores
all_docs = chain.from_iterable(doc_lists)
sorted_docs = sorted(
	unique_by_key(
		all_docs,
		lambda doc: (
			doc.page_content
			if self.id_key is None
			else doc.metadata[self.id_key]
		),
	),
	reverse=True,
	key=lambda doc: rrf_score[
		doc.page_content if self.id_key is None else doc.metadata[self.id_key]
	],
)

基于 weights 对 Documents 重排序

Long Context Reorder

无论模型的架构如何,当检索的文档超过 10 个时,性能都会显著下降。

简单来说,当模型需要在长上下文的中间部分访问相关信息时,它往往会忽视提供的文档。

更多细节,请参阅以下论文:

  • https://arxiv.org/abs/2307.03172

为了避免这个问题,您可以在检索后重新排序文档,从而防止性能下降。

可以创建一个检索器,它使用 Chroma 向量数据库存储和搜索文本数据。然后,使用检索器的 invoke 方法,针对给定的查询搜索出高度相关的文档。

from langchain_core.prompts import PromptTemplate
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Get embeddings
embeddings = OpenAIEmbeddings(
	model="bge-m3",
	base_url='http://localhost:9997/v1',
	api_key='cannot be empty',
	# dimensions=1024,
	)

texts = [
    "This is just a random text I wrote.",
    "ChatGPT, an AI designed to converse with users, can answer various questions.",
    "iPhone, iPad, MacBook are representative products released by Apple.",
    "ChatGPT was developed by OpenAI and is continuously being improved.",
    "ChatGPT has learned from vast amounts of data to understand user questions and generate appropriate answers.",
    "Wearable devices like Apple Watch and AirPods are also part of Apple's popular product line.",
    "ChatGPT can be used to solve complex problems or suggest creative ideas.",
    "Bitcoin is also called digital gold and is gaining popularity as a store of value.",
    "ChatGPT's capabilities are continuously evolving through ongoing learning and updates.",
    "The FIFA World Cup is held every four years and is the biggest event in international football.",
]



# Create a retriever (Set K to 10)
retriever = Chroma.from_texts(texts, embedding=embeddings).as_retriever(
    search_kwargs={"k": 10}
)
query = "What can you tell me about ChatGPT?"

# Retrieves relevant documents sorted by relevance score.
docs = retriever.invoke(query)
docs
[Document(metadata={}, page_content='Bitcoin is also called digital gold and is gaining popularity as a store of value.'),
 Document(metadata={}, page_content='The FIFA World Cup is held every four years and is the biggest event in international football.'),
 Document(metadata={}, page_content="Wearable devices like Apple Watch and AirPods are also part of Apple's popular product line."),
 Document(metadata={}, page_content='iPhone, iPad, MacBook are representative products released by Apple.'),
 Document(metadata={}, page_content='This is just a random text I wrote.'),
 Document(metadata={}, page_content='ChatGPT, an AI designed to converse with users, can answer various questions.'),
 Document(metadata={}, page_content='ChatGPT was developed by OpenAI and is continuously being improved.'),
 Document(metadata={}, page_content='ChatGPT has learned from vast amounts of data to understand user questions and generate appropriate answers.'),
 Document(metadata={}, page_content='ChatGPT can be used to solve complex problems or suggest creative ideas.'),
 Document(metadata={}, page_content="ChatGPT's capabilities are continuously evolving through ongoing learning and updates.")]

创建一个 LongContextReorder 类的实例。

  • 调用 reordering.transform_documents(docs) 来重新排序文档列表。
  • 相关性较低的文档会被置于列表的中间,而相关性较高的文档会被放置在列表的开头和结尾。
from langchain_community.document_transformers import LongContextReorder
reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs)

reordered_docs
[Document(metadata={}, page_content='The FIFA World Cup is held every four years and is the biggest event in international football.'),
 Document(metadata={}, page_content='iPhone, iPad, MacBook are representative products released by Apple.'),
 Document(metadata={}, page_content='ChatGPT, an AI designed to converse with users, can answer various questions.'),
 Document(metadata={}, page_content='ChatGPT has learned from vast amounts of data to understand user questions and generate appropriate answers.'),
 Document(metadata={}, page_content="ChatGPT's capabilities are continuously evolving through ongoing learning and updates."),
 Document(metadata={}, page_content='ChatGPT can be used to solve complex problems or suggest creative ideas.'),
 Document(metadata={}, page_content='ChatGPT was developed by OpenAI and is continuously being improved.'),
 Document(metadata={}, page_content='This is just a random text I wrote.'),
 Document(metadata={}, page_content="Wearable devices like Apple Watch and AirPods are also part of Apple's popular product line."),
 Document(metadata={}, page_content='Bitcoin is also called digital gold and is gaining popularity as a store of value.')]

源码分析

documents.reverse()
reordered_result = []
for i, value in enumerate(documents):
	if i % 2 == 1:
		reordered_result.append(value)
	else:
		reordered_result.insert(0, value)

原顺序是相似度由高到低的, 他只是在原顺序的基础上把高相似度的放散在头部和尾部, 低相关的放在中部.

当模型需要在长上下文的中间部分访问相关信息时,它往往会忽视提供的文档

按这种说法, 模型会更注重头部和尾部的文档

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/965442.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

[LVGL] 在VC_MFC中移植LVGL

前言&#xff1a; 0. 在MFC中开发LVGL的优点是可以用多个Window界面做辅助扩展 1.本文基于VC2022-MFC单文档框架移植lvgl8 2. gitee上下载lvgl8.3 源码&#xff0c;并将其文件夹改名为lvgl lvgl: LVGL 是一个开源图形库&#xff0c;提供您创建具有易于使用的图形元素、漂亮…

Java----线程池

什么是线程池呢&#xff0c;先举一个情景&#xff1a; 一个火锅店开业了&#xff0c;早上人比较少&#xff0c;大家进店后不需要预约&#xff0c;直接付款在店里的桌子上吃饭&#xff0c;慢慢的人多了&#xff0c;店里的桌子不够用了&#xff0c;没座位的人可以先预约&#xf…

安卓开发,底部导航栏

1、创建导航栏图标 使用系统自带的矢量图库文件&#xff0c;鼠标右键点击res->New->Vector Asset 修改 Name , Clip art 和 Color 再创建一个 同样的方法再创建四个按钮 2、添加百分比布局依赖 app\build.gradle.kts 中添加百分比布局依赖&#xff0c;并点击Sync Now …

每日Attention学习22——Inverted Residual RWKV

模块出处 [arXiv 25] [link] [code] RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation 模块名称 Inverted Residual RWKV (IR-RWKV) 模块作用 用于vision的RWKV结构 模块结构 模块代码 注&#xff1a;cpp扩展请参考作者原…

Git--使用教程

Git的框架讲解 Git 是一个分布式版本控制系统&#xff0c;其架构设计旨在高效地管理代码版本&#xff0c;支持分布式协作&#xff0c;并确保数据的完整性和安全性。 Git 的核心组件&#xff1a; 工作区&#xff08;Working Directory&#xff09;&#xff1a; 工作区是你在本…

智慧停车系统:不同规模停车场的应用差异与YunCitys解决方案

在智慧停车领域&#xff0c;不同规模停车场因自身特点&#xff0c;对智慧停车系统的需求和应用效果存在显著差异。云创智城凭借丰富的经验和先进的技术&#xff0c;为各类规模停车场打造了贴合需求的智慧停车系统&#xff0c;下面为您详细剖析。 小型停车场&#xff1a;精准高…

snort的学习记录

一、what is snort&#xff1f;什么是snort? Snort 是一款开源的 网络入侵检测系统&#xff08;NIDS&#xff09; 和 网络入侵防御系统&#xff08;NIPS&#xff09;&#xff0c;能够实时监控网络流量&#xff0c;检测恶意行为&#xff08;如端口扫描、SQL注入、DDoS攻击等&a…

PHP-trim

[题目信息]&#xff1a; 题目名称题目难度PHP-trim1 [题目考点]&#xff1a; trim() 函数移除字符串两侧的空白字符或其他预定义字符。[Flag格式]: SangFor{dl9hFiITmhQNAJysCgigAskyCZ6kQaDc}[环境部署]&#xff1a; docker-compose.yml文件或者docker tar原始文件。 ht…

maven如何不把依赖的jar打包到同一个jar?

spring boot项目打jar包部署&#xff1a; 经过以下步骤&#xff0c; 最终会形成maven依赖的多个jar&#xff08;包括lib下添加的&#xff09;、 我们编写的程序代码打成一个jar&#xff0c;将程序jar与 依赖jar分开&#xff0c;便于管理&#xff1a; success&#xff1a; 最终…

【ArcGIS Pro 简介1】

ArcGIS Pro 是由 Esri &#xff08;Environmental Systems Research Institute&#xff09;公司开发的下一代桌面地理信息系统&#xff08;GIS&#xff09;软件&#xff0c;是传统 ArcMap 的现代化替代产品。它结合了强大的空间分析能力、直观的用户界面和先进的三维可视化技术…

DeepSeek 部署过程中的问题

文章目录 DeepSeek 部署过程中的问题一、部署扩展&#xff1a;docker 部署 DS1.1 部署1.2 可视化 二、问题三、GPU 设置3.1 ollama GPU 的支持情况3.2 更新 GPU 驱动3.3 安装 cuda3.4 下载 cuDNN3.5 配置环境变量 四、测试 DeepSeek 部署过程中的问题 Windows 中 利用 ollama 来…

Elasticsearch基本使用详解

文章目录 Elasticsearch基本使用详解一、引言二、环境搭建1、安装 Elasticsearch2、安装 Kibana&#xff08;可选&#xff09; 三、索引操作1、创建索引2、查看索引3、删除索引 四、数据操作1、插入数据2、查询数据&#xff08;1&#xff09;简单查询&#xff08;2&#xff09;…

点(线)集最小包围外轮廓效果赏析

“ 图像、点集、线集合最小外轮廓计算应用较为广泛&#xff0c;如抠图、神奇选择、LOD、碰撞检查等领域&#xff0c;提高场景效率” 1.前言 作者基于递归迭代求解实现点集的最小外轮廓计算&#xff0c;在CGLib库中实现&#xff0c;已集成于CGViewer&#xff0c;可联系作者试用&…

IOPS与吞吐量、读写块大小及延迟之间的关系

IOPS&#xff08;每秒输入/输出操作次数&#xff09;、吞吐量、读写块大小及延迟是衡量存储系统性能的四个关键指标&#xff0c;它们之间存在密切的关系。以下从多个方面详细说明这些指标之间的关系&#xff1a; 1. IOPS与吞吐量的关系 公式关系&#xff1a;吞吐量&#xff0…

FPGA与ASIC:到底选哪个好?

不少人想转行FPGA&#xff0c;但在ASIC和FPGA之间犹豫不决。要做出选择&#xff0c;首先需要清楚两者的区别和各自特点。 FPGA (Field Programmable Gate Array) 是一种现场可编程门阵列芯片&#xff0c;本质上它是一种半定制的芯片&#xff0c;可以根据需要重新编程&#xff…

LLMs之data:synthetic-data-generator的简介、安装和使用方法、案例应用之详细攻略

LLMs之data&#xff1a;synthetic-data-generator的简介、安装和使用方法、案例应用之详细攻略 目录 synthetic-data-generator的简介 1、核心功能和优势 2、特点 synthetic-data-generator的安装和使用方法 1、安装 pip安装 安装依赖项 运行应用 2、使用方法 快速入…

第二天:系统从BIOS/UEFI到GRUB/bootloader的启动过程

目录 **一、BIOS/UEFI初始化阶段****二、引导加载程序&#xff08;GRUB&#xff09;的启动过程****1. BIOS模式下的GRUB分阶段加载****2. UEFI模式下的GRUB加载** **三、操作系统内核加载与初始化****四、关键组件与配置文件****五、故障排查与恢复****总结**常见问题如何在UEF…

113,【5】 功防世界 web unseping

进入靶场 代码审计 <?php // 高亮显示当前 PHP 文件的源代码&#xff0c;方便开发者查看代码结构和内容 highlight_file(__FILE__);// 定义一个名为 ease 的类 class ease {// 私有属性 $method&#xff0c;用于存储要调用的方法名private $method;// 私有属性 $args&…

图解BWT(Burrows-Wheeler Transform) 算法

Burrows-Wheeler Transform (BWT) 是一种数据转换算法, 主要用于数据压缩领域. 它由 Michael Burrows 和 David Wheeler 在 1994 年提出, 广泛应用于无损数据压缩算法(如 bzip2)中. BWT 的核心思想是通过重新排列输入数据, 使得相同的字符更容易聚集在一起, 从而提高后续压缩算…

DeepSeek辅助学术写作【句子重写】效果如何?

句子重写(功能指数:★★★★★) 当我们想引用一篇文章中的一-些我们认为写得很好的句子时&#xff0c;如果直接将原文加人自己的文章&#xff0c;那么即使我们标注上了引用&#xff0c;也依旧会被查重软件计算在重复比例中。查重比例过高的话&#xff0c;会影响投稿或毕业答辩送…