使用 Elasticsearch 作为向量数据库询问有关你的 GitHub 存储库的问题

作者：来自 Elastic Fram Souza

本博客介绍了使用 RAG 和 Elasticsearch 实现语义代码查询的 GitHub Assistant，提供对 GitHub 存储库的洞察，并可扩展到 PR 反馈、问题处理和生产准备情况审查。

该项目允许你直接与 GitHub 存储库交互并利用语义搜索来了解代码库。你将学习如何询问有关存储库代码的具体问题并收到有意义的上下文感知响应。你可以在此处关注 GitHub 代码。

主要考虑因素：

数据质量：输出的好坏取决于输入 —— 确保数据干净且结构良好。
数据块大小：适当的数据分块对于实现最佳性能至关重要。
性能评估：定期评估基于 RAG 的应用程序的性能。

组件

Elasticsearch：用作向量数据库，可高效存储和检索嵌入。
LlamaIndex：由 LLM 提供支持的应用程序构建框架。
OpenAI：用于 LLM 和生成嵌入。

架构

数据摄入 - ingestion

该过程首先将 GitHub 存储库克隆到本地 /tmp 目录。然后使用 SimpleDirectoryReader 加载克隆的存储库进行索引，根据文件类型将文档拆分为块，使用 CodeSplitter 处理代码文件，使用 JSON、Markdown 和 SentenceSplitter 处理其他格式，请参阅：

def parse_documents():
    owner = os.getenv('GITHUB_OWNER')
    repo = os.getenv('GITHUB_REPO')
    branch = os.getenv('GITHUB_BRANCH')
    base_path = os.getenv('BASE_PATH', "/tmp")  

    if not owner or not repo:
        raise ValueError("GITHUB_OWNER and GITHUB_REPO environment variables must be set.")

    local_repo_path = clone_repository(owner, repo, branch, base_path)

    nodes = []
    file_summary = []

    ts_parser = get_parser('typescript')
    py_parser = get_parser('python')
    go_parser = get_parser('go')
    js_parser = get_parser('javascript')
    bash_parser = get_parser('bash')
    yaml_parser = get_parser('yaml')

    parsers_and_extensions = [
        (SentenceSplitter(), [".md"]),
        (CodeSplitter(language='python', parser=py_parser), [".py", ".ipynb"]),
        (CodeSplitter(language='typescript', parser=ts_parser), [".ts"]),
        (CodeSplitter(language='go', parser=go_parser), [".go"]),
        (CodeSplitter(language='javascript', parser=js_parser), [".js"]),
        (CodeSplitter(language='bash', parser=bash_parser), [".bash", ",sh"]),
        (CodeSplitter(language='yaml', parser=yaml_parser), [".yaml", ".yml"]),
        (JSONNodeParser(), [".json"]),
    ]

    for parser, extensions in parsers_and_extensions:
        matching_files = []
        for ext in extensions:
            matching_files.extend(glob.glob(f"{local_repo_path}/**/*{ext}", recursive=True))

        if len(matching_files) > 0:
            file_summary.append(f"Found {len(matching_files)} {', '.join(extensions)} files in the repository.")
            loader = SimpleDirectoryReader(input_dir=local_repo_path, required_exts=extensions, recursive=True)
            docs = loader.load_data()
            parsed_nodes = parser.get_nodes_from_documents(docs)

            print_docs_and_nodes(docs, parsed_nodes)

            nodes.extend(parsed_nodes)
        else:
            file_summary.append(f"No {', '.join(extensions)} files found in the repository.")

    collect_and_print_file_summary(file_summary)
    print("\n")
    return nodes

如果你想在此代码中添加更多支持语言，只需将新的解析器和扩展添加到 parsers_and_extensions 列表中即可。解析节点后，使用 text-embedding-3-large 模型生成嵌入并存储在 Elasticsearch 中。嵌入模型使用 Setting 包声明，它是一个全局变量：

Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-large")

然后，它会在主函数中作为 Ingest Pipeline 的一部分使用。由于它是一个全局变量，因此在摄取过程中无需再次调用它：

    nodes = parse_documents()
    es_vector_store = get_es_vector_store()

    try:
        pipeline = IngestionPipeline(
            vector_store=es_vector_store,
        )

        pipeline.run(documents=nodes, show_progress=True)

上面的代码块首先将文档解析为较小的块（节点），然后初始化与 Elasticsearch 的连接。使用指定的 Elasticsearch 向量存储创建 IngestionPipeline，并执行管道以处理节点并将其嵌入存储在 Elasticsearch 中，同时显示处理过程中的进度。此时，我们应该在 Elasticsearch 中索引你的数据，并生成和存储嵌入。以下是文档在 ESS 中的一个例子：

        "_source": {
          "content": """Changelog

All notable changes to this project will be documented in this file.

**For detailed release notes, please refer to the [GitHub
releases](https://github.com/elastic/synthetics/releases) page.**""",
          "metadata": {
            "file_path": "/tmp/elastic/synthetics/CHANGELOG.md",
            "file_name": "CHANGELOG.md",
            "file_size": 23162,
            "creation_date": "2024-10-08",
            "last_modified_date": "2024-10-08",
            "_node_content": """{"id_": "2918efbb-b1aa-4afa-a505-d584e62d0d87", "embedding": null, "metadata": {"file_path": "/tmp/elastic/synthetics/CHANGELOG.md", "file_name": "CHANGELOG.md", "file_size": 23162, "creation_date": "2024-10-08", "last_modified_date": "2024-10-08"}, "excluded_embed_metadata_keys": ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"], "excluded_llm_metadata_keys": ["file_name", "file_type", "file_size", "creation_date", "last_modified_date", "last_accessed_date"], "relationships": {"1": {"node_id": "b0574471-c909-4fc8-ab82-2165c45ba72a", "node_type": "4", "metadata": {"file_path": "/tmp/elastic/synthetics/CHANGELOG.md", "file_name": "CHANGELOG.md", "file_size": 23162, "creation_date": "2024-10-08", "last_modified_date": "2024-10-08"}, "hash": "58b8f33fdb38603530f1d06333a6d84614d21bb305a2aee4cb74f174fd5037aa", "class_name": "RelatedNodeInfo"}}, "text": "", "mimetype": "text/plain", "start_char_idx": 0, "end_char_idx": 204, "text_template": "{metadata_str}\n\n{content}", "metadata_template": "{key}: {value}", "metadata_seperator": "\n", "class_name": "TextNode"}""",
            "_node_type": "TextNode",
            "document_id": "b0574471-c909-4fc8-ab82-2165c45ba72a",
            "doc_id": "b0574471-c909-4fc8-ab82-2165c45ba72a",
            "ref_doc_id": "b0574471-c909-4fc8-ab82-2165c45ba72a"
          },
          "embeddings": []
        }
     }

查询 - query

一旦数据被索引，你就可以查询 Elasticsearch 索引以询问有关代码库的问题。query.py 脚本允许你与索引数据进行交互并询问有关代码库的问题。它从用户那里检索查询输入，使用与 index.py 中使用的相同 OpenAIEmbedding 模型创建嵌入，并使用从 Elasticsearch 向量存储加载的 VectorStoreIndex 设置查询引擎。查询引擎使用相似性搜索，根据查询与存储的嵌入的相似性检索前 3 个最相关的文档。使用 response_mode="tree_summarize" 以树状格式汇总结果，你可以在下面看到代码片段：

    query = input("Please enter your query: ")

    openai_llm = OpenAI(model="gpt-4o")

    es_vector_store = get_es_vector_store()
    index = VectorStoreIndex.from_vector_store(es_vector_store)

    try:
        query_engine = index.as_query_engine(
            llm=openai_llm,
            similarity_top_k=3,
            streaming=False, 
            response_mode="tree_summarize"
        )

        bundle = QueryBundle(query, embedding=embed_model.get_query_embedding(query))

        result = query_engine.query(bundle)
        return result.response

安装

1. 克隆存储库：

git clone https://github.com/framsouza/github-assistant.git
cd github-assistant

2. 安装所需的库：

pip install -r requirements.txt

3. 设置环境变量：

使用你的 Elasticsearch 凭据和目标 GitHub 存储库详细信息（例如 GITHUB_TOKEN、GITHUB_OWNER、GITHUB_REPO、GITHUB_BRANCH、ELASTIC_CLOUD_ID、ELASTIC_USER、ELASTIC_PASSWORD、ELASTIC_INDEX）更新 .env 文件。

以下是 .env 文件的一个示例：

GITHUB_TOKEN=""
GITHUB_OWNER=""
GITHUB_REPO=""
GITHUB_BRANCH=""
ELASTIC_CLOUD_ID=""
ELASTIC_USER=""
ELASTIC_PASSWORD=""
ELASTIC_INDEX=""
OPENAI_API_KEY=""

使用方法

1. 通过运行以下命令索引你的数据并创建嵌入：

python index.py

2. 通过运行以下命令询问有关代码库的问题：

python query.py

例子：

python query.py                                    
Please enter your query: Give me a detailed list of the external dependencies being used in this repository
 Based on the provided context, the following is a list of third-party dependencies used in the given Elastic Cloud on K8s project:
1. dario.cat/mergo (BSD-3-Clause, v1.0.0)
2. Masterminds/sprig (MIT, v3.2.3)
3. Masterminds/semver (MIT, v4.0.0)
4. go-spew (ISC, v1.1.2-0.20180830191138-d8f796af33cc)
5. elastic/go-ucfg (Apache-2.0, v0.8.8)
6. ghodss/yaml (MIT, v1.0.0)
7. go-logr/logr (Apache-2.0, v1.4.1)
8. go-test/deep (MIT, v1.1.0)
9. gobuffalo/flect (MIT, v1.0.2)
10. google/go-cmp (BSD-3-Clause, v0.6.0)
...
This list includes both direct and indirect dependencies as identified in the context.None

你可能想问的问题：

Give me a detailed description of what are the main functionalities implemented in the code? - 请详细描述一下代码中实现的主要功能是什么？
How does the code handle errors and exceptions? - 代码如何处理错误和异常？
Could you evaluate the test coverage of this codebase and also provide detailed insights into potential enhancements to improve test coverage significantly? - 你能否评估此代码库的测试覆盖率，并提供有关潜在增强功能的详细见解，以显著提高测试覆盖率？

评估

evaluation.py 代码处理文档，根据内容生成评估问题，然后使用 LLM 评估响应的相关性（响应是否与问题相关）和忠实度（响应是否忠实于源内容）。以下是有关如何使用代码的分步指南：

python evaluation.py --num_documents 5 --skip_documents 2 --num_questions 3 --skip_questions 1 --process_last_questions

你可以在不使用任何参数的情况下运行代码，但上面的示例演示了如何使用参数。以下是每个参数的作用的详细说明：

文档处理：

--num_documents 5：脚本将总共处理 5 个文档。
--skip_documents 2：将跳过前 2 个文档，脚本将从第 3 个文档开始处理。因此，它将处理文档 3、4、5、6 和 7。

问题生成：

加载文档后，脚本将根据这些文档的内容生成问题列表。

--num_questions 3：在生成的问题中，仅处理 3 个问题。
--skip_questions 1：脚本将跳过列表中的第一个问题，并从第二个问题开始处理问题。
--process_last_questions：脚本将跳过第一个问题后处理前 3 个问题，而是处理列表中的后 3 个问题。

Number of documents loaded: 5
\All available questions generated:
0. What is the purpose of chunking monitors in the updated push command as mentioned in the changelog?
1. How does the changelog describe the improvement made to the performance of the push command?
2. What new feature is added to the synthetics project when it is created via the `init` command?
3. According to the changelog, what is the file size of the CHANGELOG.md document?
4. On what date was the CHANGELOG.md file last modified?
5. What is the significance of the example lightweight monitor yaml file mentioned in the changelog?
6. How might the changes described in the changelog impact the workflow of users creating or updating monitors?
7. What is the file path where the CHANGELOG.md document is located?
8. Can you identify the issue numbers associated with the changes mentioned in the changelog?
9. What is the creation date of the CHANGELOG.md file as per the context information?
10. What type of file is the document described in the context information?
11. On what date was the CHANGELOG.md file last modified?
12. What is the file size of the CHANGELOG.md document?
13. Identify one of the bug fixes mentioned in the CHANGELOG.md file.
14. What command is referenced in the context of creating new synthetics projects?
15. How does the CHANGELOG.md file address the issue of varying NDJSON chunked response sizes?
16. What is the significance of the number #680 in the context of the document?
17. What problem is addressed by skipping the addition of empty values for locations?
18. How many bug fixes are explicitly mentioned in the provided context?
19. What is the file path of the CHANGELOG.md document?
20. What is the file path of the document being referenced in the context information?
...

Generated questions:
1. What command is referenced in relation to the bug fix in the CHANGELOG.md?
2. On what date was the CHANGELOG.md file created?
3. What is the primary purpose of the document based on the context provided?

Total number of questions generated: 3

Processing Question 1 of 3:

Evaluation Result:
+---------------------------------------------------+-------------------------------------------------+----------------------------------------------------+----------------------+----------------------+-------------------+------------------+------------------+
| Query                                             | Response                                        | Source                                             | Relevancy Response   | Relevancy Feedback   |   Relevancy Score | Faith Response   | Faith Feedback   |
+===================================================+=================================================+====================================================+======================+======================+===================+==================+==================+
| What command is referenced in relation to the bug | The `init` command is referenced in relation to | Bug Fixes                                          | Pass                 | YES                  |                 1 | Pass             | YES              |
| fix in the CHANGELOG.md?                          | the bug fix in the CHANGELOG.md.                |                                                    |                      |                      |                   |                  |                  |
|                                                   |                                                 |                                                    |                      |                      |                   |                  |                  |
|                                                   |                                                 | - Pick the correct loader when bundling TypeScript |                      |                      |                   |                  |                  |
|                                                   |                                                 | or JavaScript journey files                        |                      |                      |                   |                  |                  |
|                                                   |                                                 |                                                    |                      |                      |                   |                  |                  |
|                                                   |                                                 |   during push command #626                         |                      |                      |                   |                  |                  |
+---------------------------------------------------+-------------------------------------------------+----------------------------------------------------+----------------------+----------------------+-------------------+------------------+------------------+

Processing Question 2 of 3:

Evaluation Result:
+-------------------------------------------------+------------------------------------------------+------------------------------+----------------------+----------------------+-------------------+------------------+------------------+
| Query                                           | Response                                       | Source                       | Relevancy Response   | Relevancy Feedback   |   Relevancy Score | Faith Response   | Faith Feedback   |
+=================================================+================================================+==============================+======================+======================+===================+==================+==================+
| On what date was the CHANGELOG.md file created? | The date mentioned in the CHANGELOG.md file is | v1.0.0-beta-38 (20222-11-02) | Pass                 | YES                  |                 1 | Pass             | YES              |
|                                                 | November 2, 2022.                              |                              |                      |                      |                   |                  |                  |
+-------------------------------------------------+------------------------------------------------+------------------------------+----------------------+----------------------+-------------------+------------------+------------------+

Processing Question 3 of 3:

Evaluation Result:
+---------------------------------------------------+---------------------------------------------------+------------------------------+----------------------+----------------------+-------------------+------------------+------------------+
| Query                                             | Response                                          | Source                       | Relevancy Response   | Relevancy Feedback   |   Relevancy Score | Faith Response   | Faith Feedback   |
+===================================================+===================================================+==============================+======================+======================+===================+==================+==================+
| What is the primary purpose of the document based | The primary purpose of the document is to provide | v1.0.0-beta-38 (20222-11-02) | Pass                 | YES                  |                 1 | Pass             | YES              |
| on the context provided?                          | a changelog detailing the features and            |                              |                      |                      |                   |                  |                  |
|                                                   | improvements made in version 1.0.0-beta-38 of a   |                              |                      |                      |                   |                  |                  |
|                                                   | software project. It highlights specific          |                              |                      |                      |                   |                  |                  |
|                                                   | enhancements such as improved validation for      |                              |                      |                      |                   |                  |                  |
|                                                   | monitor schedules and an enhanced push command    |                              |                      |                      |                   |                  |                  |
|                                                   | experience.                                       |                              |                      |                      |                   |                  |                  |
+---------------------------------------------------+---------------------------------------------------+------------------------------+----------------------+----------------------+-------------------+------------------+------------------+
(clean_env) (base) framsouza@Frams-MacBook-Pro-2 git-assistant % 
+-------------------------------------------------+------------------------------------------------+------------------------------+----------------------+----------------------+-------------------+------------------+------------------+------+------------------+

Processing Question 3 of 3:

Evaluation Result:
+---------------------------------------------------+---------------------------------------------------+------------------------------+----------------------+----------------------+-------------------+------------------+------------------+-----------+------------------+
| Query                                             | Response                                          | Source                       | Relevancy Response   | Relevancy Feedback   |   Relevancy Score | Faith Response   | Faith Feedback   |Response   | Faith Feedback   |
+===================================================+===================================================+==============================+======================+======================+===================+==================+==================+===========+==================+
| What is the primary purpose of the document based | The primary purpose of the document is to provide | v1.0.0-beta-38 (20222-11-02) | Pass                 | YES                  |                 1 | Pass             | YES              |           | YES              |
| on the context provided?                          | a changelog detailing the features and            |                              |                      |                      |                   |                  |                  |           |                  |
|                                                   | improvements made in version 1.0.0-beta-38 of a   |                              |                      |                      |                   |                  |                  |           |                  |
|                                                   | software project. It highlights specific          |                              |                      |                      |                   |                  |                  |           |                  |
|                                                   | enhancements such as improved validation for      |                              |                      |                      |                   |                  |                  |           |                  |
|                                                   | monitor schedules and an enhanced push command    |                              |                      |                      |                   |                  |                  |           |                  |
|                                                   | experience.                                       |                              |                      |                      |                   |                  |                  |           |                  |
+---------------------------------------------------+---------------------------------------------------+------------------------------+----------------------+----------------------+-------------------+------------------+------------------+-----------+------------------+