Advanced RAG 05：探讨基于文本内在语义信息的数据分块方法

编者按：在 RAG (Retrieval Augmented Generation) 系统中，将文本数据高效地划分成相对独立且富有语义信息的数据块（chunks）是一项较为关键的任务。基于规则的传统数据分块方法存在一些问题，因此探讨基于文本内在语义信息的数据分块方法非常必要。

在这篇文章中，作者深入探讨了三种基于文本内在语义信息的数据分块方法：基于嵌入的方法、基于模型的方法和基于大语言模型的方法。 作者对这些方法的技术原理和实际应用进行了详细分析。

本文详细论点归纳： (1) 基于嵌入的方法使用句子嵌入计算相似度进行数据分块，但存在粒度较粗的问题。 (2) 基于模型的方法利用预训练模型（如 BERT）进行数据分块，但需要针对特定场景进行定制训练才能发挥最佳性能。 (3) 基于大语言模型的方法使用 LLM 生成“proposition”作为数据分块的基本单元，可以实现更细粒度的数据分块，但成本较高。这些方法各有优缺点，需要根据实际应用场景权衡选择最合适的方案。

作者 | Florian June

编译 | 岳扬

文档解析完成后，就能够获得一些结构化或半结构化数据。目前的主要任务是将这些数据分解成更小的 chunks （译者注：用于描述数据或信息被分成小块或片段的术语）再提取特征（features），然后将从数据中提取的特征（features）转换成一种能够捕捉到其语义的形式。其在 RAG System 中的位置如图 1 所示。

图 1：数据分块过程（红框标注）在 RAG System 中的位置。图片由原文作者提供。

大多数常用的数据分块方法（chunking）都是基于规则的，采用 fixed chunk size（译者注：将数据或文本按照固定的大小进行数据分块）或 overlap of adjacent chunks（译者注：让相邻的数据块具有重叠内容，确保信息不会丢失。）等技术。对于具有多个层级结构的文档，我们可以使用 Langchain 提供的 RecursiveCharacterTextSplitter[1]，这种方法允许将文档按照不同的层级进行分割。

然而，在实际应用中，由于预定义的规则（比如数据分块大小（chunk size）或重叠部分的大小（size of overlapping parts））过于死板，基于规则的数据分块方法很容易导致检索到的上下文（retrieval contexts）不完整或包含 noise（译者注：指不需要的、干扰性的信息或数据，可能会对分析或处理造成干扰或误导的数据。） 的数据块过大等问题。

因此，最优雅的数据分块方法显然是基于语义的数据分块。通过 Semantic chunking（译者注：一种根据文本中的语义信息将文本分成有意义的片段或块的过程），希望每个数据块所包含的信息在语义上相对独立，以便更好地进行分析和处理。

本文将探讨按照文本内在语义信息进行划分的数据分块方法，解释其原理以及这些方法在实际应用中的使用情况。我们将介绍三种方法：

Embedding-based（译者注：基于嵌入的数据分块方法，数据被映射到一个低维空间中，以便更好地捕捉其语义信息。）
Model-based（译者注：基于模型的数据分块方法，使用了预先训练好的模型来进行语义分块。）
LLM-based（译者注：基于大语言模型的数据分块方法，使用 LLM 捕捉文本中的语义信息。）

01 Embedding-based 方法

LlamaIndex[2] 和 Langchain[3] 都提供了基于嵌入（embedding）的 semantic chunker（译者注：能够将文本或数据按照语义信息进行分块的工具或算法。）。他们的算法理念大致是差不多的，本文将以 LlamaIndex 为例进行解读。

请注意，需要安装最新版本的 LlamaIndex ，才能使用 LlamaIndex 中的 semantic chunker。 我之前安装的 LlamaIndex 版本号是 0.9.45，并没有包含这个算法。因此，我创建了一个新的 conda 虚拟环境，并安装了更新版本的 LlamaIndex —— 0.10.12：

pip install llama-index-core

pip install llama-index-readers-file

pip install llama-index-embeddings-openai

值得一提的是，0.10.12 版本的 LlamaIndex 可以根据需求安装所需的组件或模块，因此本文仅安装一些关键组件。

(llamaindex_010) Florian:~ Florian$ pip list | grep llama
llama-index-core              0.10.12
llama-index-embeddings-openai 0.1.6
llama-index-readers-file 0.1.5
llamaindex-py-client          0.1.13

测试代码如下所示：

from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import SimpleDirectoryReader


import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPEN_AI_KEY"

# load documents
dir_path = "YOUR_DIR_PATH"
documents = SimpleDirectoryReader(dir_path).load_data()


embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

nodes = splitter.get_nodes_from_documents(documents)
for node in nodes:
 print('-' * 100)
 print(node.get_content())

我了解了 splitter.get_nodes_from_documents [4] 函数的内部实现逻辑，其主要流程如图 2 所示：

图 2： splitter.get_nodes_from_documents[4] 函数的主要逻辑。图片由原文作者提供。

图 2 中提到的 “sentences” 是一个 Python 列表，其中每个成员都是包含四个键值对的字典，各键的含义如下：

sentence：当前句子
index：当前句子的序号
combined_sentence：用于构建滑动窗口（sliding window），包括 [index - self.buffer_size, index, index + self.buffer_size] 三个句子 （默认情况下，self.buffer_size = 1） ，能够用于计算句子间的语义相关性。将当前句子与其前后的句子合并在一起，可以减少不必要的干扰信息，从而更有效地抓取连续句子之间的关联性。
combined_sentence_embedding：combined_sentence 的嵌入向量

通过以上分析可以明显看出，基于嵌入向量的 semantic chunking（译者注：根据语义信息将文本或数据分成有意义的片段或块的过程） 方法本质上是基于滑动窗口（combined_sentence）计算文本相似度。 那些相邻且符合阈值的句子会被归入一个语义块。

项目路径中只有一份 BERT 论文文档[5]。以下是运行结果：

(llamaindex_010) Florian:~ Florian$ python /Users/Florian/Documents/june_pdf_loader/test_semantic_chunk.py 
...
...
----------------------------------------------------------------------------------------------------
We argue that current techniques restrict the
power of the pre-trained representations, espe-
cially for the ﬁne-tuning approaches. The ma-
jor limitation is that standard language models are
unidirectional, and this limits the choice of archi-
tectures that can be used during pre-training. For
example, in OpenAI GPT, the authors use a left-to-
right architecture, where every token can only at-
tend to previous tokens in the self-attention layers
of the Transformer (Vaswani et al., 2017). Such re-
strictions are sub-optimal for sentence-level tasks,
and could be very harmful when applying ﬁne-
tuning based approaches to token-level tasks such
as question answering, where it is crucial to incor-
porate context from both directions.
In this paper, we improve the ﬁne-tuning based
approaches by proposing BERT: Bidirectional
Encoder Representations from Transformers.
BERT alleviates the previously mentioned unidi-
rectionality constraint by using a “masked lan-
guage model” (MLM) pre-training objective, in-
spired by the Cloze task (Taylor, 1953). The
masked language model randomly masks some of
the tokens from the input, and the objective is to


predict the original vocabulary id of the

maskedarXiv:1810.04805v2  [cs.CL]  24 May 2019

----------------------------------------------------------------------------------------------------
word based only on its context. Unlike left-to-
right language model pre-training, the MLM ob-
jective enables the representation to fuse the left
and the right context, which allows us to pre-
train a deep bidirectional Transformer. In addi-
tion to the masked language model, we also use
a “next sentence prediction” task that jointly pre-
trains text-pair representations. The contributions
of our paper are as follows:
• We demonstrate the importance of bidirectional
pre-training for language representations. Un-
like Radford et al. (2018), which uses unidirec-
tional language models for pre-training, BERT
uses masked language models to enable pre-
trained deep bidirectional representations. This
is also in contrast to Peters et al. 
----------------------------------------------------------------------------------------------------
...
...

测试结果表明，使用这种数据分块方法，得到的数据块粒度相对较粗。

图 2 还显示，这种方法是 page-based 的（译者注：这种方法将文本按照页面进行数据分块处理，而非其他更小的单位，比如句子或段落。），并没有直接解决跨多个页面的数据分块问题。

通常情况下，基于嵌入的数据分块方法，其性能严重依赖于嵌入模型。实际效果需要未来进一步评估。

02 Model-based 方法

2.1 Naive BERT 直接应用 BERT 模型进行数据分块

回顾一下 BERT[5] 的预训练过程（pre-training process）。其中的一个二元分类任务 —— “下一个句子预测（Next Sentence Prediction，NSP）” 是为了让模型理解两个句子之间的关系而设计的。在这种情况下，同时输入两个句子到 BERT 中，模型会预测这两个句子是否连续。

我们可以应用这一原理来设计一种直接的数据分块方法。对于一份文档，将其分割成若干句子。然后，使用滑动窗口，将相邻的句子输入到 BERT 模型中进行NSP judgement（译者注：根据BERT模型的预测结果判断两个句子是否连续），如图 3 所示：

图 3：使用 BERT 进行数据分块（Chunking）。图片由原文作者提供。

如果 BERT 模型对这两个句子的预测得分低于预设的阈值，则表明两个句子之间的语义关系较弱。预测结果可以作为 text segmentation point （译者注：根据预测结果确定的可以分割文本的位置，即划分两个句子的位置。），如图 3 中句子 2 和句子 3 之间的虚线所示。

这种方法的优点是可以直接使用，无需进一步训练或微调。

然而，这种方法在确定文本分割点（text segmentation point）时只考虑了紧随在当前句子之后的句子，忽略了更多距离较远段落的信息。此外，这种方法的预测效率相对较低。

2.2 基于 Cross Segment Attention 的数据分块技术

《Text Segmentation by Cross Segment Attention》[6]这篇论文提出了三种基于 Cross Segment Attention 的文本分割模型，如图 4 所示：

图 4：在 cross-segment BERT 模型（a）中，我们向模型输入潜在文本分割点附近的局部上下文：左边 k 个tokens，右边 k 个tokens。在 BERT+Bi-LSTM 模型（b）中，我们首先使用 BERT 模型对每个句子进行编码，然后将句子表征输入到 Bi-LSTM 中。在 hierarchical BERT 模型（c）中，我们首先使用 BERT 对每个句子进行编码，然后将输出的句子表征输入另一个基于 transformer 的模型。Source: Text Segmentation by Cross Segment Attention.[6]

图 4（a）中的 cross-segment BERT 模型，将文本分割定义为逐句分类任务（译者注：sentence-by-sentence classification task，逐句判断其是否是文本分割点）。潜在文本分割点附近的局部上下文（两侧的 k 个 tokens）被输入到模型中。与 [CLS] 相对应的隐藏状态（hidden state）被传递给 softmax 分类器，由其决定是否在潜在断句处进行分割。

这篇论文还介绍了另外两个模型。其中一个使用 BERT 模型获取每个句子的向量表征。将连续的多个句子转换为向量表征输入到 Bi-LSTM 模型（图4（b））或另一个 BERT 模型（图4（c））中，以预测每个句子是否是文本分割点的位置。

如图5所示，当时这三种模型都取得了不错的效果：

图 5：在 text segmentation 和 discourse segmentation 任务测试集的评估结果。Source: Text Segmentation by Cross Segment Attention.[6]

然而，迄今为止，大家仅了解这篇论文中公开的训练方法[7]，还未发现公开且可用、可供推理的开源模型。

2.3 基于 SeqModel 的数据分块技术

Cross-Segment 模型将每个句子单独地转换为向量表示，而不考虑句子之间的关联或上下文信息。在《Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation》[8]论文中提出了进一步的改进方案 —— SeqModel。

SeqModel[9] 采用 BERT 同时对多个句子进行编码，将句子转换为向量表征之前，首先考虑句子所处的上下文，并尝试捕捉其中的语义依赖关系。然后会对每个句子进行分析，并尝试确定该句子是否是文本段落中的一个分割点。 此外，该模型还利用自适应滑动窗口（self-adaptive sliding window） 方法提高推理速度，而不会影响准确性。SeqModel 的示意图如图 6 所示。

图 6：SeqModel 架构和用于推理的自适应滑动窗口方法。

Source: Sequence Model with Self-Adaptive Sliding Window for Efficient Spoken Document Segmentation[8]

SeqModel 可通过 ModelScope[10] 框架使用。代码如下：

from modelscope.outputs import OutputKeys
from modelscope.pipelines import pipeline
from modelscope.utils.constant import Tasks

p = pipeline(
    task = Tasks.document_segmentation,
    model = 'damo/nlp_bert_document-segmentation_english-base'
)

print('-' * 100)

result = p(documents='We demonstrate the importance of bidirectional pre-training for language representations. Unlike Radford et al. (2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations. This is also in contrast to Peters et al. (2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs. • We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures. BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures. Today is a good day')

print(result[OutputKeys.TEXT])

在测试数据的末尾添加了一句 “Today is a good day（今天是个好日子）”，但文本分割处理后的 result 变量中并没有以任何方式将 "Today is a good day（今天是个好日子） "分割开。

(modelscope) Florian:~ Florian$ python /Users/Florian/Documents/june_pdf_loader/test_seqmodel.py 
2024-02-24 17:09:36,288 - modelscope - INFO - PyTorch version 2.2.1 Found.
2024-02-24 17:09:36,288 - modelscope - INFO - Loading ast index from /Users/Florian/.cache/modelscope/ast_indexer
...
...
----------------------------------------------------------------------------------------------------
...
... 
We demonstrate the importance of bidirectional pre-training for language representations.Unlike Radford et al.(2018), which uses unidirectional language models for pre-training, BERT uses masked language models to enable pretrained deep bidirectional representations.This is also in contrast to Peters et al.(2018a), which uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.• We show that pre-trained representations reduce the need for many heavily-engineered taskspecific architectures.BERT is the first finetuning based representation model that achieves state-of-the-art performance on a large suite of sentence-level and token-level tasks, outperforming many task-specific architectures.Today is a good day

总体而言，基于模型的 semantic chunking（译者注：将文本分成具有语义相关性的片段。）方法仍有很大的改进空间。

我建议的一种改进方法是创建专门针对某个项目或任务定制的训练数据，以便进行领域微，这样可以提高模型的性能。此外，优化模型架构也是一个改进点。

只要我们能够找到适合特定业务场景的优秀模型，就可以继续使用基于模型的方法来解决相关问题。

03 LLM-based 方法

这篇题为《 Dense X Retrieval: What Retrieval Granularity Should We Use? 》的论文介绍了一种新的检索单位，称为 proposition 。proposition 被定义为文本中的 atomic expressions（译者注：不能进一步分解的单个语义元素，可用于构成更大的语义单位），用于检索和表达文本中的独特事实或特定概念，能够以简明扼要的方式表达，使用自然语言完整地呈现一个独立的概念或事实，不需要额外的信息来解释。

那么，我们如何获取所谓的 proposition 呢？本文通过构建提示词并与 LLM 交互来获取。

LlamaIndex 和 Langchain 都实现了相关算法，下面将使用 LlamaIndex 进行演示。

LlamaIndex 的实现思路是使用论文中提供的提示词来生成 proposition ：

PROPOSITIONS_PROMPT = PromptTemplate(
    """Decompose the "Content" into clear and simple propositions, ensuring they are interpretable out of
context.
1. Split compound sentence into simple sentences. Maintain the original phrasing from the input
whenever possible.
2. For any named entity that is accompanied by additional descriptive information, separate this
information into its own distinct proposition.
3. Decontextualize the proposition by adding necessary modifier to nouns or entire sentences
and replacing pronouns (e.g., "it", "he", "she", "they", "this", "that") with the full name of the
entities they refer to.
4. Present the results as a list of strings, formatted in JSON.

Input: Title: ¯Eostre. Section: Theories and interpretations, Connection to Easter Hares. Content:
The earliest evidence for the Easter Hare (Osterhase) was recorded in south-west Germany in
1678 by the professor of medicine Georg Franck von Franckenau, but it remained unknown in
other parts of Germany until the 18th century. Scholar Richard Sermon writes that "hares were
frequently seen in gardens in spring, and thus may have served as a convenient explanation for the
origin of the colored eggs hidden there for children. Alternatively, there is a European tradition
that hares laid eggs, since a hare’s scratch or form and a lapwing’s nest look very similar, and
both occur on grassland and are first seen in the spring. In the nineteenth century the influence
of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular throughout Europe.
German immigrants then exported the custom to Britain and America where it evolved into the
Easter Bunny."
Output: [ "The earliest evidence for the Easter Hare was recorded in south-west Germany in
1678 by Georg Franck von Franckenau.", "Georg Franck von Franckenau was a professor of
medicine.", "The evidence for the Easter Hare remained unknown in other parts of Germany until
the 18th century.", "Richard Sermon was a scholar.", "Richard Sermon writes a hypothesis about
the possible explanation for the connection between hares and the tradition during Easter", "Hares
were frequently seen in gardens in spring.", "Hares may have served as a convenient explanation
for the origin of the colored eggs hidden in gardens for children.", "There is a European tradition
that hares laid eggs.", "A hare’s scratch or form and a lapwing’s nest look very similar.", "Both
hares and lapwing’s nests occur on grassland and are first seen in the spring.", "In the nineteenth
century the influence of Easter cards, toys, and books was to make the Easter Hare/Rabbit popular
throughout Europe.", "German immigrants exported the custom of the Easter Hare/Rabbit to
Britain and America.", "The custom of the Easter Hare/Rabbit evolved into the Easter Bunny in
Britain and America." ]

Input: {node_text}
Output:"""
)

在前面的章节“ 01 Embedding-based 方法 ”中，我们已经安装了 LlamaIndex 0.10.12 的关键组件。但如果我们想要使用 DenseXRetrievalPack ，还需要运行pip install llama-index-llms-openai 安装。安装完成后，当前的 LlamaIndex 相关组件如下所示：

(llamaindex_010) Florian:~ Florian$ pip list | grep llama
llama-index-core                    0.10.12
llama-index-embeddings-openai       0.1.6
llama-index-llms-openai             0.1.6
llama-index-readers-file            0.1.5
llamaindex-py-client                0.1.13

在 LlamaIndex 中，DenseXRetrievalPack 是一个需要单独下载的软件包。这里直接在测试代码中下载。测试代码如下：

from llama_index.core.readers import SimpleDirectoryReader
from llama_index.core.llama_pack import download_llama_pack

import os
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_KEY"

# Download and install dependencies
DenseXRetrievalPack = download_llama_pack(
    "DenseXRetrievalPack", "./dense_pack"
)

# If you have already downloaded DenseXRetrievalPack, you can import it directly.
# from llama_index.packs.dense_x_retrieval import DenseXRetrievalPack

# Load documents
dir_path = "YOUR_DIR_PATH"
documents = SimpleDirectoryReader(dir_path).load_data()

# Use LLM to extract propositions from every document/node
dense_pack = DenseXRetrievalPack(documents)

response = dense_pack.run("YOUR_QUERY")

这段测试代码主要使用 DenseXRetrievalPack 类的构造函数。因此，有必要分析 DenseXRetrievalPack 类的源代码[11]。

class DenseXRetrievalPack(BaseLlamaPack):
    def __init__(
        self,
        documents: List[Document],
        proposition_llm: Optional[LLM] = None,
        query_llm: Optional[LLM] = None,
        embed_model: Optional[BaseEmbedding] = None,
        text_splitter: TextSplitter = SentenceSplitter(),
        similarity_top_k: int = 4,
    ) -> None:
        """Init params."""
        self._proposition_llm = proposition_llm or OpenAI(
            model="gpt-3.5-turbo",
            temperature=0.1,
            max_tokens=750,
        )

        embed_model = embed_model or OpenAIEmbedding(embed_batch_size=128)

        nodes = text_splitter.get_nodes_from_documents(documents)
        sub_nodes = self._gen_propositions(nodes)

        all_nodes = nodes + sub_nodes
        all_nodes_dict = {n.node_id: n for n in all_nodes}

        service_context = ServiceContext.from_defaults(
            llm=query_llm or OpenAI(),
            embed_model=embed_model,
            num_output=self._proposition_llm.metadata.num_output,
        )

        self.vector_index = VectorStoreIndex(
            all_nodes, service_context=service_context, show_progress=True
        )

        self.retriever = RecursiveRetriever(
            "vector",
            retriever_dict={
                "vector": self.vector_index.as_retriever(
                    similarity_top_k=similarity_top_k
                )
            },
            node_dict=all_nodes_dict,
        )

        self.query_engine = RetrieverQueryEngine.from_args(
            self.retriever, service_context=service_context
        )

如代码所示，该构造函数的思路是首先使用 text_splitter 将文档划分为 nodes（译者注：将文档按照其最初的格式分割而成的最小单元。），然后调用 self._gen_propositions 生成 propositions 来获取相应的 sub_nodes（译者注：根据 nodes 生成的 propositions 所对应的文档片段或子集）。然后，使用 nodes + sub_nodes 构建VectorStoreIndex，并通过 RecursiveRetriever 进行检索。递归检索器（recursive retriever）可以通过处理文档的小数据块（small chunks）来找到所需信息，就像我们可以直接去书籍中的某个小节或段落查找一样。但是，如果在这些小数据块（small chunks）中找不到完整的信息，递归检索器（recursive retriever）会将相关的大数据块（larger chunks）传递到生成阶段（generation stage）进一步处理，就像我们在书中某个小节或段落查找资料时，如果需要更多信息，就会翻到相关的章节或整本书一样。

项目路径中只有一份 BERT 论文文档。通过调试，我们发现 sub_nodes[].text 的内容并非原始文本，里面的内容已经被改写过了：

> /Users/Florian/anaconda3/envs/llamaindex_010/lib/python3.11/site-packages/llama_index/packs/dense_x_retrieval/base.py(91)__init__()
     90 
---> 91         all_nodes = nodes + sub_nodes
     92         all_nodes_dict = {n.node_id: n for n in all_nodes}


ipdb> sub_nodes[20]
IndexNode(id_='ecf310c7-76c8-487a-99f3-f78b273e00d9', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Our paper demonstrates the importance of bidirectional pre-training for language representations.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)
ipdb> sub_nodes[21]
IndexNode(id_='4911332e-8e30-47d8-a5bc-ed7cbaa8e042', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Radford et al. (2018) uses unidirectional language models for pre-training.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)
ipdb> sub_nodes[22]
IndexNode(id_='83aa82f8-384a-4b06-92c8-d6277c4162bf', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='BERT uses masked language models to enable pre-trained deep bidirectional representations.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)
ipdb> sub_nodes[23]
IndexNode(id_='2ac635c2-ccb0-4e62-88c7-bcbaef3ef38a', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Peters et al. (2018a) uses a shallow concatenation of independently trained left-to-right and right-to-left LMs.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)
ipdb> sub_nodes[24]
IndexNode(id_='e37b17cf-30dd-4114-a3c5-9921b8cf0a77', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='Pre-trained representations reduce the need for many heavily-engineered task-specific architectures.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n', index_id='8deca706-fe97-412c-a13f-950a19a594d1', obj=None)

sub_nodes 和 nodes 之间的关系如图 7 所示，这个索引结构按照 small-to-big 的方式进行排序组织。

图 7：按照 small-to-big 的方式进行排序组织的索引结构。图片由作者提供。

这种索引结构是通过 self._gen_propositions[12] 构建的，代码如下：

 async def _aget_proposition(self, node: TextNode) -> List[TextNode]:
 """Get proposition."""
        inital_output = await self._proposition_llm.apredict(
            PROPOSITIONS_PROMPT, node_text=node.text
 )
        outputs = inital_output.split("\n")

        all_propositions = []

 for output in outputs:
 if not output.strip():
 continue
 if not output.strip().endswith("]"):
 if not output.strip().endswith('"') and not output.strip().endswith(
 ","
 ):
                    output = output + '"'
                output = output + " ]"
 if not output.strip().startswith("["):
 if not output.strip().startswith('"'):
                    output = '"' + output
                output = "[ " + output

 try:
                propositions = json.loads(output)
 except Exception:
 # fallback to yaml
 try:
                    propositions = yaml.safe_load(output)
 except Exception:
 # fallback to next output
 continue

 if not isinstance(propositions, list):
 continue

            all_propositions.extend(propositions)

 assert isinstance(all_propositions, list)
        nodes = [TextNode(text=prop) for prop in all_propositions if prop]

 return [IndexNode.from_text_node(n, node.node_id) for n in nodes]

 def _gen_propositions(self, nodes: List[TextNode]) -> List[TextNode]:
 """Get propositions."""
        sub_nodes = asyncio.run(
            run_jobs(
 [self._aget_proposition(node) for node in nodes],
                show_progress=True,
                workers=8,
 )
 )

 # Flatten list
 return [node for sub_node in sub_nodes for node in sub_node]

对每个原始节点（original node），都异步调用 self._aget_proposition，通过 PROPOSITIONS_PROMPT 获取 LLM 返回的 inital_output，然后根据 inital_output 获取 propositions 并构建 TextNode。最后，将这些 TextNode 与原始节点（original node）关联起来，即使用 [IndexNode.from_text_node(n, node.node_id) for n in nodes] 。

有一件事需要多说一句：原论文使用 LLM 生成的 propositions 作为训练数据，微调了一个文本生成模型。这个文本生成模型目前是公开可访问的[13]，感兴趣的读者可以体验一下。

总体来看，这种利用 LLM 构建 propositions 的数据分块方法能够实现更精细的数据分块。 能够与原始节点（original node）构成一个 small-to-big 的索引结构，从而为 semantic chunking（译者注：将文本分成具有语义相关性的片段。）方法提供了一种新思路。

不过，这种方法依赖于 LLM，成本相对较高。

如果条件允许，可以对这种基于 LLM 的数据分块方法进行持续跟踪和监控。

04 Conclusion

本文探讨了三种 semantic chunking 方法的原理和具体实现，并做出了一些评述。semantic chunking 是一种更加优雅的方法，也是优化 RAG 的关键点之一。

如果您对 RAG 技术感兴趣，欢迎阅读本系列的其他文章！如有任何问题，请在评论区提出。

Thanks for reading!

——

Florian June

An artificial intelligence researcher, mainly write articles about Large Language Models, data structures and algorithms, and NLP.

END

参考资料

[1]https://github.com/langchain-ai/langchain/blob/v0.1.9/libs/langchain/langchain/text_splitter.py#L851C1-L851C6

[2]https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking.html

[3]https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker

[4]https://github.com/run-llama/llama_index/blob/v0.10.12/llama-index-core/llama_index/core/node_parser/text/semantic_splitter.py

[5]https://arxiv.org/pdf/1810.04805.pdf

[6]https://arxiv.org/abs/2004.14535

[7]https://github.com/aakash222/text-segmentation-NLP/

[8]https://arxiv.org/pdf/2107.09278.pdf

[9]https://github.com/alibaba-damo-academy/SpokenNLP

[10]https://github.com/modelscope/modelscope/

[11]https://github.com/run-llama/llama_index/blob/v0.10.12/llama-index-packs/llama-index-packs-dense-x-retrieval/llama_index/packs/dense_x_retrieval/base.py

[12]https://github.com/run-llama/llama_index/blob/v0.10.12/llama-index-packs/llama-index-packs-dense-x-retrieval/llama_index/packs/dense_x_retrieval/base.py#L161

[13]https://github.com/chentong0/factoid-wiki

本文经原作者授权，由 Baihai IDP 编译。如需转载译文，请联系获取授权。

原文链接：

https://medium.com/towards-artificial-intelligence/advanced-rag-05-exploring-semantic-chunking-97c12af20a4d

本系列其他文章：

Advanced RAG 01：讨论未经优化的 RAG 系统存在的问题与挑战

Advanced RAG 02：揭开 PDF 文档解析的神秘面纱

Advanced RAG 03：运用 RAGAs 与 LlamaIndex 评估 RAG 应用

Advanced RAG 04：重排序（Re-ranking）技术探讨