速度提高100倍 - 扩展 RAG 应用程序，以实现数十亿个嵌入，并行计算余弦相似度

原文链接：100x Faster — Scaling Your RAG App for Billions of Embeddings

2024 年 2 月 15 日

RAG应用程序最大的问题之一是它们的计算检索时间。想象一下，你有一个向量数据库，包含一万亿条Embedding向量的记录。当您尝试将用户查询与一万亿向量匹配时，检索正确的信息肯定要花费一分钟以上的时间。

我们能否在CPU内核上使用并行处理将检索正确信息的时间缩短到几秒钟?

减少时间包括找到有效的方法来计算用户查询Embedding向量与存储在向量数据库中的百万，十亿，甚至万亿个其他Embedding向量之间的余弦相似度。

Chunkdot，在MIT许可下，是专门为此目的而设计的，为密集矩阵和稀疏矩阵提供多线程矩阵乘法。它适用于对项目矩阵表示进行分段(Embeddings)，并使用Numba加速计算，从而计算出大量项目中K个最相似的项目。

HuggingFace上有很多数据集，提供了超过100万个条目的Embedding向量，比如这个来自Qdrant的dataset。你可以用它来测试Chunkdot的性能。然而，对于详细的性能测量，我们将使用NumPy库来生成各种维度的随机Embedding向量。

我们将比较两种方法，一种来自Chunkdot，另一种是余弦相似度的伪代码。我们将观察增加大小和维度对性能的影响。我将使用Kaggle(无GPU)笔记本来完成这项任务，以确保一致性。

这个博客的所有代码都可以在我的GitHub存储库中找到。

目录表

舞台设置
编码伪码算法
编码块点算法
编码计算时间函数
测试10k向量Embeddings
测试100k向量Embeddings
测试100万个向量Embeddings
可视化可扩展性影响
Chunkdot功能
下一步是什么

搭建舞台

Chunkdot需要与其他库类似的安装过程。

1 2	# installing chunkdot pip install chunkdot

在运行任何东西之前，我们必须首先检查Kaggle环境中的可用内存。

1 2	# Checking available memory !free -h

可用内存在Kaggle笔记本

检查可用内存对Chunkdot至关重要。随着向量数据库大小的增加，计算内存也会增加。为了防止超出可用内存，监控硬件中的剩余内存非常重要。在我的情况下，可用空间是25GB，不包括Buff/Cache。

让我们导入必要的库。

# to matrix generate matrices
import numpy as np

# importing cosine similarity module from chunkdot
from chunkdot import cosine_similarity_top_k

# to calculate computation time
import timeit

伪代码算法

我们将首先构建一个伪代码算法，计算用户查询向量与其他数百万个可能存储在数据库或本地的向量之间的余弦相似度。

def cosine_pseudocode(query_v, doc_v, num_indices):
    """
    Retrieve indices of the highest cosine similarity values between
    the query vector and embeddings.
    
    Parameters:
        query_v (numpy.ndarray): Query vector.
        doc_v (list of numpy.ndarray): List of embedding vectors.
        num_indices (int): Number of Top indices to retrieve.
        
    Returns:
        list of int: Indices of the highest cosine similarity values.
    """
    cosine_similarities = []  # Initialize an empty list to store cosine similarities

    query_norm = np.linalg.norm(query_v)  # Calculate the norm of the query vector
    
    # Iterate over each documents embedding vectors in the list
    for vec in doc_v:
        dot_product = np.dot(vec, query_v.T)  # Calculate dot product between embedding vector and query vector
        embedding_norm = np.linalg.norm(vec)  # Calculate the norm of the embedding vector
        cosine_similarity = dot_product / (embedding_norm * query_norm)  # Calculate cosine similarity
        cosine_similarities.append(cosine_similarity)  # Append cosine similarity to the list
    
    cosine_similarities = np.array(cosine_similarities)  # Convert the list to a numpy array
    
    # Sort the array in descending order
    sorted_array = sorted(range(len(cosine_similarities)), key=lambda i: cosine_similarities[i], reverse=True)

    # Get indices of the top two values
    top_indices = sorted_array[:num_indices]
    
    # Return the indices of highest cosine similarity values
    return top_indices

这个余弦相似度函数，独立于除NumPy以外的任何库，接受三个输入:

query_v用户查询的Embedding向量
doc_v存储在某处的文档的Embedding向量
num_indices类似top_k结果的文档索引号

Chunkdot算法

现在我们已经编写了伪代码算法，下一步是编写Chunkdot余弦相似度函数。

def cosine_chunkdot(query_v, doc_v, num_indices, max_memory):
    """
    Calculate cosine similarity using the chunkdot library.
    
    Parameters:
        query_v (numpy.ndarray): Query vector.
        doc_v (numpy.ndarray): List of Embedding vectors.
        num_indices (int): Number of top indices to retrieve.
        max_memory (float): Maximum memory to use.
        
    Returns:
        numpy.ndarray: Top k indices.
    """
    
    # Calculate Cosine Similarity
    cosine_array = cosine_similarity_top_k(embeddings=query_v, embeddings_right=doc_v, 
                                          top_k=num_indices, max_memory=max_memory)  # Calculate cosine similarity using chunkdot

    # Get indices of the top values
    top_indices = cosine_array.nonzero()[1]
    
    # return the top similar results
    return top_indices

这个Chunkdot函数有四个输入:

query_v用户查询的Embedding向量
doc_v存储在某处的文档的Embedding向量
num_indices类似top_k结果的文档索引号
max_memory表示计算的可用内存，其值以字节为单位。例如，1E9表示1GB, 10E9表示10GB，以此类推。

让我们在一个样本数据集上测试这两个函数，观察它们的输出。

doc_embeddings = np.random.randn(10, 100) # 10 document embeddings (100 dim)

user_query = np.random.rand(1,100) # 1 user query (100 dim)

top_indices = 1 # number of top indices to retrieve

max_memory = 5E9 # maximum memory to use (5GB)

# retrieve indices of the highest cosine similarity values using pseudocode
print("top indices using pseudocode:", cosine_pseudocode(user_query, doc_embeddings, top_indices))

# retrieve indices of the highest cosine similarity values using chunkdot
print("top indices using chunkdot:", cosine_chunkdot(user_query, doc_embeddings, top_indices, max_memory))
### OUTPUT ###
top indices using pseudocode: [4]
top indices using chunkdot: [4]
### OUTPUT ###

我为文档Embeddings生成了10个随机Embedding向量，每个向量的维度为100，还有一个用户查询，它是具有相同维度的单个Embedding向量。’ top_indices ‘参数设置为1，这意味着它将根据最高余弦相似度返回文档Embeddings中仅一个相似项的索引。内存使用率设置为5E9，等于5GB。我们的两个函数都返回相同的索引4，这表明我们对两个函数都进行了准确的编码。

编码计算时间函数

我们还需要创建一个计时函数，它可以测量这两个函数输出结果所花费的计算时间。

# calculate time taken
def calculate_execution_time(query_v, doc_v, num_indices, max_memory, times):
    
    # calculate time taken to execute the pseudocode function
    pseudocode_time = round(timeit.timeit(lambda: cosine_pseudocode(query_v, doc_v, num_indices), number=times), 5)

    # calculate time taken to execute the chunkdot function
    chunkdot_time = round(timeit.timeit(lambda: cosine_chunkdot(query_v, doc_v, num_indices, max_memory), number=times), 5)

    # print the time taken
    print("Time taken for pseudocode function:", pseudocode_time, "seconds")
    print("Time taken for chunkdot function:", chunkdot_time, "seconds")

我们已经回顾了传递给这个函数的参数。这里唯一的新参数是times，它告诉函数你想运行代码多少次。让我们在更大的规模上测试Chunkdot性能的效率。

测试10k向量Embeddings

我们将从合理数量的文档Embeddings开始，10000个，这相当于一个小规模的特定于领域的RAG应用程序。我将每个Embedding向量的维度设置为1536，这相当于OpenAIEmbedding模型text-embedding-3-small。

让我们通过运行100次来计算每种方法的计算时间。

doc_embeddings = np.random.randn(10000, 1536) # 10K document embeddings (1536 dim)

user_query = np.random.rand(1,1536) # user query (1536 dim)

top_indices = 1 # number of top indices to retrieve 

max_memory = 5E9 # maximum memory set to 5GB

# compute the time taken to execute the functions
calculate_execution_time(user_query, doc_embeddings, top_indices, max_memory, 100)

对于10k个文档Embeddings，维度为1536，两种算法运行100次，对比如下:

10k文档计算时间

与我们的伪代码相比，Chunkdot需要更多的时间。这是因为它首先创建块，并在合并它们之前对每个块执行计算。因此，对于这个小规模的例子，它可能不是一个合适的解决方案。但是，当我们稍后使用更大的示例时，您将看到Chunkdot的好处。

测试100k向量Embeddings

对于10K，我们的伪代码方法获胜，但是现在让我们将文档Embedding向量增加到100K向量，这与中等规模的RAG应用程序相当。

让我们计算每种方法的计算时间，但这次我们将“times”参数设置为1(只运行一次代码)，因为向量的数量相当大，并且不需要多次执行计算。

doc_embeddings = np.random.randn(100000, 1536) # 100K document embeddings (1536 dim)

user_query = np.random.rand(1,1536) # user query (1536 dim)

top_indices = 1 # number of top indices to retrieve 

max_memory = 5E9 # maximum memory set to 5GB

times = 1 # number of times to execute the functions

# compute the time taken to execute the functions
calculate_execution_time(user_query, doc_embeddings, top_indices, max_memory, times)

对于100k的文档Embeddings，维度为1536，运行两种算法一次，下面是比较:

100k文档计算时间

与我们的伪代码相比，Chunkdot花费的时间更少，几乎是一半。现在我们看到了Chunkdot带来的积极影响。

一百万向量Embeddings的测试

处理涉及数百万个Embeddings的任务时，您需要检查的第一件事是文档Embedding向量占用了多少内存。

# 1 Million document embeddings (1536 dim)
doc_embeddings = np.random.randn(1000000, 1536)

# user query (1536 dim)
user_query = np.random.rand(1,1536)

# Check the memory size of doc_embeddings and user_query embedding
print(doc_embeddings.nbytes / (1024 * 1024 * 1024),
      user_query.nbytes / (1024 * 1024))

100万个Embedding向量的内存大小

我们的文档Embeddings大约占用12GB。让我们检查一下剩余的可用空间。

检查可用空间

我们有高达17GB的可用内存。为了避免任何内存错误，我们将为max_memory参数设置一个安全值，即12GB。让我们看看结果。

# 1 Million document embeddings (1536 dim)
doc_embeddings = np.random.randn(1000000, 1536)

# user query (1536 dim)
user_query = np.random.rand(1,1536)

top_indices = 1 # number of top indices to retrieve 

max_memory = 12E9 # maximum memory set to  --- 12GB ---

times = 1 # number of times to execute the functions

# compute the time taken to execute the functions
calculate_execution_time(user_query, doc_embeddings, top_indices, max_memory, times)

100万个文档计算时间

ChunkDot确实有效地减少了计算量。当你打算构建一个严肃的RAG应用程序时，你应该考虑从至少一百万个查询开始。工作与更高维度的Embedding模型，高达4000。这种方法将变得更加有效。

可视化可伸缩性影响

让我们可视化增加文档Embedding向量数量的影响，从10,000开始到一个非常大的数字。

不同数量文档的计算时间

我绘制了三种方法，在增加文档Embeddings数量的基础上，Chunkdot是所有方法中最优越的。现在，让我们看看Embedding向量的维数是如何影响计算时间的。

不同维度的计算时间

我在增加向量维度的同时使用了100K个文档，在增加文档数量时观察到的行为与我们看到的相同。

Chunkdot的特点

Chunkdot有一个可以显示进度条的功能，它可以帮助您跟踪剩余的计算量。

doc_embeddings = np.random.randn(100000, 1536) # 100K document embeddings (1536 dim)

user_query = np.random.rand(1,1536) # user query (1536 dim)

top_indices = 100 # number of top indices to retrieve 

max_memory = 5E9 # maximum memory set to 5GB

# with progress bar
output_array = cosine_similarity_top_k(user_query, doc_embeddings, 
                        top_k=top_indices, 
                        show_progress=True)

进度条示例

Chunkdot的输出是一个稀疏矩阵，您可以使用以下命令将其转换为数组:

1 2	# converting the ouput output_array.toarray()

您可以仅对文档Embeddings使用Chunkdot，它将为文档Embeddings的每个元素返回top_k个最相似的元素。

# total 5 documents embeddings
embeddings = np.random.randn(5, 256)

# return top 2 most similar item index for each
cosine_similarity_top_k(embeddings, top_k=2).toarray()
### OUTPUT ###
array([[1.        , 0.        , 0.        , 0.        , 0.09924064],
       [0.        , 1.        , 0.        , 0.09935381, 0.        ],
       [0.02358785, 0.        , 1.        , 0.        , 0.        ],
       [0.        , 0.09935381, 0.        , 1.        , 0.        ],
       [0.09924064, 0.        , 0.        , 0.        , 1.        ]])
### OUTPUT ###

类似地，您可以通过向top_k参数提供负值来返回最不相似的项

# total 5 documents embeddings
embeddings = np.random.randn(5, 256)

# return top 2 most dissimilar item index for each 
# Top_K = -2
cosine_similarity_top_k(embeddings, top_k=-2).toarray()
### OUTPUT ###
array([[ 0.        ,  0.        , -0.04357524,  0.        , -0.05118288],
       [ 0.        ,  0.        ,  0.        ,  0.01619543, -0.01836534],
       [-0.04357524,  0.        ,  0.        , -0.02466613,  0.        ],
       [ 0.        ,  0.01619543, -0.02466613,  0.        ,  0.        ],
       [-0.05118288, -0.01836534,  0.        ,  0.        ,  0.        ]])
### OUTPUT ###

这可能不是你的情况，但如果你处理的是10K维的稀疏Embeddings，你可以使用“密度”参数来更有效地减少计算。

# for creating sparse embeddings
from scipy import sparse

# creating spare matrix with 100K documents (10K dim each)
# defining density of 0.005
embeddings = sparse.rand(100000, 10000, density=0.005)

# using all you system's memory
cosine_similarity_top_k(embeddings, top_k=50)

接下来

如果您想了解Chunkdot算法是如何工作的，请查看作者的这个有意思的博客。Chunkdot最大的好处之一是它可以在CPU内核上工作。未来，他们计划集成GPU支持，这将大大减少计算时间。如果你的本地环境没有足够的RAM，你可以使用像Kaggle或GitHub Codespaces这样的平台，与GPU成本相比，云CPU内核和RAM的成本非常低。不要忘记查看官方GitHub存储库和他们的博客，因为它非常好地解释了Chunkdot是如何工作的。