Elasticsearch:通过摄取管道加上嵌套向量对大型文档进行分块轻松地实现段落搜索

作者:VECTOR SEARCH

向量搜索是一种基于含义而不是精确或不精确的 token 匹配技术来搜索数据的强大方法。 然而,强大的向量搜索的文本嵌入模型只能按几个句子的顺序处理短文本段落,而不是可以处理任意大量文本的基于 BM25 的技术。 现在,Elasticsearch 可以将大型文档与向量搜索无缝结合。

简单地说,它是如何在发挥作用的呢?

Elasticsearch 功能(例如摄取管道、脚本处理器的灵活性以及对使用密集向量的嵌套文档的新支持)的组合允许以一种简单的方式在摄取时将大文档分成足够小的段落,然后由文本嵌入模型处理这些段落以生成表示大型文档的全部含义所需的所有向量。

像平常一样摄取文档数据,并向摄取管道添加一个脚本处理器,将大型文本数据分解为句子数组或其他类型的块,然后添加一个 for_each 处理器,对每个块运行推理处理器。 索引的映射被定义为使得块数组被设置为嵌套对象,其中以密集向量映射作为子对象,然后该子对象将正确索引每个向量并使它们可搜索。

让我们详细了解一下:

加载文本嵌入模型

你需要的第一件事是一个模型,用于从块中创建文本嵌入,你可以使用任何你想要的东西,但此示例将在 all-distilroberta-v1 模型上端到端运行。 创建 Elastic Cloud 集群或准备另一个 Elasticsearch 集群后,我们可以使用 eland 库上传文本嵌入模型。

MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
ELASTIC_PASSWORD = "YOURPASSWORD"
CLOUD_ID = "YOURCLOUDID"

eland_import_hub_model \
    --cloud-id $CLOUD_ID \
    --es-username elastic \
    --es-password $ELASTIC_PASSWORD \
    --hub-model-id $MODEL_ID \
    --task-type text_embedding \
    --start

映射示例

下一步是准备映射以处理将在摄取管道期间创建的句子数组和向量对象。 对于这个特定的文本嵌入模型,维度为 384,dot_product 相似度将用于最近邻计算:

PUT chunker
{
  "mappings": {
    "dynamic": "true",
    "properties": {
      "passages": {
        "type": "nested",
        "properties": {
          "vector": {
            "properties": {
              "predicted_value": {
                "type": "dense_vector",
                "index": true,
                "dims": 384,
                "similarity": "dot_product"
              }
            }
          }
        }
      }
    }
  }
}

摄取管道示例

最后的准备步骤是定义一个摄取管道,将 body_content 字段分解为存储在 passages 字段中的文本块。 该管道有两个处理器,第一个脚本处理器将 body_content 字段分解为通过正则表达式存储在 passages 字段中的句子数组。 要进一步研究,请阅读正则表达式的高级功能,例如负向后查找和正向后查找,以了解它如何尝试在句子边界上正确分割,而不是在 Mr. 或 Mrs. 或 Ms. 上分割,并保留句子中的标点符号。 只要总字符串长度低于传递给脚本的参数,它还会尝试将句子块连接在一起。 接下来每个处理器通过推理处理器在每个句子上运行文本嵌入模型:

PUT _ingest/pipeline/chunker
{
  "processors": [
    {
      "script": {
        "description": "Chunk body_content into sentences by looking for . followed by a space",
        "lang": "painless",
        "source": """
          String[] envSplit = /((?<!M(r|s|rs)\.)(?<=\.) |(?<=\!) |(?<=\?) )/.split(ctx['body_content']);
          ctx['passages'] = new ArrayList();
          int i = 0;
          boolean remaining = true;
          if (envSplit.length == 0) {
            return
          } else if (envSplit.length == 1) {
            Map passage = ['text': envSplit[0]];ctx['passages'].add(passage)
          } else {
            while (remaining) {
              Map passage = ['text': envSplit[i++]];
              while (i < envSplit.length && passage.text.length() + envSplit[i].length() < params.model_limit) {passage.text = passage.text + ' ' + envSplit[i++]}
              if (i == envSplit.length) {remaining = false}
              ctx['passages'].add(passage)
            }
          }
          """,
        "params": {
          "model_limit": 400
        }
      }
    },
    {
      "foreach": {
        "field": "passages",
        "processor": {
          "inference": {
            "field_map": {
              "_ingest._value.text": "text_field"
            },
            "model_id": "sentence-transformers__all-minilm-l6-v2",
            "target_field": "_ingest._value.vector",
            "on_failure": [
              {
                "append": {
                  "field": "_source._ingest.inference_errors",
                  "value": [
                    {
                      "message": "Processor 'inference' in pipeline 'ml-inference-title-vector' failed with message '{{ _ingest.on_failure_message }}'",
                      "pipeline": "ml-inference-title-vector",
                      "timestamp": "{{{ _ingest.timestamp }}}"
                    }
                  ]
                }
              }
            ]
          }
        }
      }
    }
  ]
}

添加一些文档

现在我们可以在 body_content 中添加包含大量文本的文档,并自动将它们分块,并将每个块文本通过模型嵌入到向量中:

PUT chunker/_doc/1?pipeline=chunker
{
"title": "Adding passage vector search to Lucene",
"body_content": "Vector search is a powerful tool in the information retrieval tool box. Using vectors alongside lexical search like BM25 is quickly becoming commonplace. But there are still a few pain points within vector search that need to be addressed. A major one is text embedding models and handling larger text input. Where lexical search like BM25 is already designed for long documents, text embedding models are not. All embedding models have limitations on the number of tokens they can embed. So, for longer text input it must be chunked into passages shorter than the model’s limit. Now instead of having one document with all its metadata, you have multiple passages and embeddings. And if you want to preserve your metadata, it must be added to every new document. A way to address this is with Lucene's “join” functionality. This is an integral part of Elasticsearch’s nested field type. It makes it possible to have a top-level document with multiple nested documents, allowing you to search over nested documents and join back against their parent documents. This sounds perfect for multiple passages and vectors belonging to a single top-level document! This is all awesome! But, wait, Elasticsearch® doesn’t support vectors in nested fields. Why not, and what needs to change? The key issue is how Lucene can join back to the parent documents when searching child vector passages. Like with kNN pre-filtering versus post-filtering, when the joining occurs determines the result quality and quantity. If a user searches for the top four nearest parent documents (not passages) to a query vector, they usually expect four documents. But what if they are searching over child vector passages and all four of the nearest vectors are from the same parent document? This would end up returning just one parent document, which would be surprising. This same kind of issue occurs with post-filtering."
}

PUT chunker/_doc/3?pipeline=chunker
{
"title": "Automatic Byte Quantization in Lucene",
"body_content": "While HNSW is a powerful and flexible way to store and search vectors, it does require a significant amount of memory to run quickly. For example, querying 1MM float32 vectors of 768 dimensions requires roughly 1,000,000∗4∗(768+12)=3120000000≈31,000,000∗4∗(768+12)=3120000000bytes≈3GB of ram. Once you start searching a significant number of vectors, this gets expensive. One way to use around 75% less memory is through byte quantization. Lucene and consequently Elasticsearch has supported indexing byte vectors for some time, but building these vectors has been the user's responsibility. This is about to change, as we have introduced int8 scalar quantization in Lucene. All quantization techniques are considered lossy transformations of the raw data. Meaning some information is lost for the sake of space. For an in depth explanation of scalar quantization, see: Scalar Quantization 101. At a high level, scalar quantization is a lossy compression technique. Some simple math gives significant space savings with very little impact on recall. Those used to working with Elasticsearch may be familiar with these concepts already, but here is a quick overview of the distribution of documents for search. Each Elasticsearch index is composed of multiple shards. While each shard can only be assigned to a single node, multiple shards per index gives you compute parallelism across nodes. Each shard is composed as a single Lucene Index. A Lucene index consists of multiple read-only segments. During indexing, documents are buffered and periodically flushed into a read-only segment. When certain conditions are met, these segments can be merged in the background into a larger segment. All of this is configurable and has its own set of complexities. But, when we talk about segments and merging, we are talking about read-only Lucene segments and the automatic periodic merging of these segments. Here is a deeper dive into segment merging and design decisions."
}

PUT chunker/_doc/2?pipeline=chunker
{
"title": "Use a Japanese language NLP model in Elasticsearch to enable semantic searches",
"body_content": "Quickly finding necessary documents from among the large volume of internal documents and product information generated every day is an extremely important task in both work and daily life. However, if there is a high volume of documents to search through, it can be a time-consuming process even for computers to re-read all of the documents in real time and find the target file. That is what led to the appearance of Elasticsearch® and other search engine software. When a search engine is used, search index data is first created so that key search terms included in documents can be used to quickly find those documents. However, even if the user has a general idea of what type of information they are searching for, they may not be able to recall a suitable keyword or they may search for another expression that has the same meaning. Elasticsearch enables synonyms and similar terms to be defined to handle such situations, but in some cases it can be difficult to simply use a correspondence table to convert a search query into a more suitable one. To address this need, Elasticsearch 8.0 released the vector search feature, which searches by the semantic content of a phrase. Alongside that, we also have a blog series on how to use Elasticsearch to perform vector searches and other NLP tasks. However, up through the 8.8 release, it was not able to correctly analyze text in languages other than English. With the 8.9 release, Elastic added functionality for properly analyzing Japanese in text analysis processing. This functionality enables Elasticsearch to perform semantic searches like vector search on Japanese text, as well as natural language processing tasks such as sentiment analysis in Japanese. In this article, we will provide specific step-by-step instructions on how to use these features."
}

PUT chunker/_doc/5?pipeline=chunker
{
"title": "We can chunk whatever we want now basically to the limits of a document ingest",
"body_content": """Chonk is an internet slang term used to describe overweight cats that grew popular in the late summer of 2018 after a photoshopped chart of cat body-fat indexes renamed the "Chonk" scale grew popular on Twitter and Reddit. Additionally, "Oh Lawd He Comin'," the final level of the Chonk Chart, was adopted as an online catchphrase used to describe large objects, animals or people. It is not to be confused with the Saturday Night Live sketch of the same name. The term "Chonk" was popularized in a photoshopped edit of a chart illustrating cat body-fat indexes and the risk of health problems for each class (original chart shown below). The first known post of the "Chonk" photoshop, which classifies each cat to a certain level of "chonk"-ness ranging from "A fine boi" to "OH LAWD HE COMIN," was posted to Facebook group THIS CAT IS C H O N K Y on August 2nd, 2018 by Emilie Chang (shown below). The chart surged in popularity after it was tweeted by @dreamlandtea[1] on August 10th, 2018, gaining over 37,000 retweets and 94,000 likes (shown below). After the chart was posted there, it began growing popular on Reddit. It was reposted to /r/Delighfullychubby[2] on August 13th, 2018, and /r/fatcats on August 16th.[3] Additionally, cats were shared with variations on the phrase "Chonk." In @dreamlandtea's Twitter thread, she rated several cats on the Chonk scale (example, shown below, left). On /r/tumblr, a screenshot of a post featuring a "good luck cat" titled "Lucky Chonk" gained over 27,000 points (shown below, right). The popularity of the phrase led to the creation of a subreddit, /r/chonkers,[4] that gained nearly 400 subscribers in less than a month. Some photoshops of the chonk chart also spread on Reddit. For example, an edit showing various versions of Pikachu on the chart posted to /r/me_irl gained over 1,200 points (shown below, left). The chart gained further popularity when it was posted to /r/pics[5] September 29th, 2018."""
}

搜索那些文档

要搜索数据并返回与查询最匹配的块,你可以使用 inner_hits 和 knn 子句,以在查询的命中输出中返回文档的最佳匹配块:

GET chunker/_search
{
  "_source": false,
  "fields": [
    "title"
  ],
  "knn": {
    "inner_hits": {
      "_source": false,
      "fields": [
        "passages.text"
      ]
    },
    "field": "passages.vector.predicted_value",
    "k": 1,
    "num_candidates": 100,
    "query_vector_builder": {
      "text_embedding": {
        "model_id": "sentence-transformers__all-minilm-l6-v2",
        "model_text": "Can I use multiple vectors per document now?"
      }
    }
  }
}

将返回最佳文档和较大文档文本的相关部分:

{
  "took": 4,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": 0.75261426,
    "hits": [
      {
        "_index": "chunker",
        "_id": "1",
        "_score": 0.75261426,
        "_ignored": [
          "body_content.keyword",
          "passages.text.keyword"
        ],
        "fields": {
          "title": [
            "Adding passage vector search to Lucene"
          ]
        },
        "inner_hits": {
          "passages": {
            "hits": {
              "total": {
                "value": 1,
                "relation": "eq"
              },
              "max_score": 0.75261426,
              "hits": [
                {
                  "_index": "chunker",
                  "_id": "1",
                  "_nested": {
                    "field": "passages",
                    "offset": 3
                  },
                  "_score": 0.75261426,
                  "fields": {
                    "passages": [
                      {
                        "text": [
                          "This sounds perfect for multiple passages and vectors belonging to a single top-level document! This is all awesome! But, wait, Elasticsearch® doesn’t support vectors in nested fields. Why not, and what needs to change? The key issue is how Lucene can join back to the parent documents when searching child vector passages."
                        ]
                      }
                    ]
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

回顾一下

这里使用的方法展示了利用 Elasticsearch 的不同功能来解决更大问题的力量。

摄取管道允许你在索引之前对文档进行预处理,虽然有许多处理器可以执行特定的目标任务,但有时你需要脚本语言的强大功能才能执行诸如将文本分解为句子数组之类的操作。 因为你可以在对文档建立索引之前访问该文档,所以只要所有信息都在文档本身中,你就可以以几乎任何你能想象到的方式重新创建数据。 foreach 处理器允许我们包装一些可能运行零到 N 次的东西,而无需事先知道它需要执行多少次。 在本例中,我们使用它来运行与提取的句子一样多的句子,以运行推断处理器来生成向量。

索引的映射准备好处理原始文档中不存在的文本和向量的现在的对象数组,并使用嵌套对象来索引数据,以便我们可以正确搜索文档。

使用带有向量嵌套支持的 knn 允许使用 inner_hits 来呈现文档的最佳评分部分,这可以替代通常通过在 BM25 查询中使用 highlight 显示来完成的操作。

原文:Chunking Large Documents via Ingest pipelines plus nested vectors equals easy passage search — Elastic Search Labs

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/159994.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

Linux CentOS 8(DHCP的配置与管理)

Linux CentOS 8&#xff08;DHCP的配置与管理&#xff09; 目录 一、项目介绍二、DHCP服务简介三、DHCP工作原理四、配置DHCP服务4.1 项目配置准备4.2 dhcpd配置文件框架与参数说明4.3 登录客户机验证4.4 客户端IP地址的释放与重新申请4.5 保留特定IP地址 一、项目介绍 当计算…

使用Open3D库处理3D模型数据的实践指南

目录 引言 一、安装Open3D库 二、加载3D模型数据 三、处理3D模型数据 1、去除模型中的无效面 2、提取模型特征 四、存储处理后的3D模型数据 五、可视化处理后的3D模型数据 六、注意事项 结论 引言 在处理3D模型数据时&#xff0c;Open3D库是一个功能强大且易于使用的…

【Qt之QStandardItemModel】使用,tableview、listview、treeview设置模型

1. 引入 QStandardItemModel类提供了一个通用的模型&#xff0c;用于存储自定义数据。 以下是其用法&#xff1a;该类属于gui模块&#xff0c;因此在.pro中&#xff0c;需添加QT gui&#xff0c;如果已存在&#xff0c;则无需重复添加。 首先&#xff0c;引入头文件&#xff…

ARM 版 Kylin V10 部署 KubeSphere 3.4.0 不完全指南

前言 知识点 定级&#xff1a;入门级KubeKey 安装部署 ARM 版 KubeSphere 和 KubernetesARM 版麒麟 V10 安装部署 KubeSphere 和 Kubernetes 常见问题 实战服务器配置 (个人云上测试服务器) 主机名IPCPU内存系统盘数据盘用途ksp-master-1172.16.33.1681650200KubeSphere/k8…

hash 哈希表

哈希表是一种期望算法。 一般情况下&#xff0c;哈希表的时间复杂度是 O(1)。 作用 将一个复杂数据结构映射到一个较小的空间 0~N&#xff08;10^5~10^6&#xff09;&#xff0c;最常用的情景&#xff1a;将 0~10^9 映射到 0~10^5。 离散化是一种及其特殊的哈希方式。离散化…

【SAP-ABAP】--MRKO隐式增强字段步骤

业务需求&#xff1a;给MRKO增加几个增强字段 给标准表进行增强 1.如果标准表或者结构&#xff0c;带CL_***&#xff0c;一般表示SAP预留的增强位置&#xff0c;可以 直接双击这个类型&#xff0c;点击创建&#xff0c;然后直接在预留的结构里面添加自己 需要增加的字段 2.如…

无线物理层安全大作业

这个标题很帅 Beamforming Optimization for Physical Layer Security in MISO Wireless NetworksProblem Stateme![在这里插入图片描述](https://img-blog.csdnimg.cn/58ebb0df787c4e23b0c7be4189ebc322.png) Beamforming Optimization for Physical Layer Security in MISO W…

wpf devexpress Property Grid创建属性定义

WPF Property Grid控件使用属性定义定义如何做和显示 本教程示范如何绑定WP Property Grid控件到数据和创建属性定义。 执行如下步骤 第一步-创建属性定义 添加PropertyGridControl组件到项目。 打开工具箱在vs&#xff0c;定位到DX.23.1: Data 面板&#xff0c;选择Prope…

Spring 如何自己创建一个IOC 容器

IOC(Inversion of Control),意思是控制反转&#xff0c;不是什么技术&#xff0c;而是一种设计思想&#xff0c;IOC意味着将你设计好的对象交给容器控制&#xff0c;而不是传统的在你的对象内部直接控制。 在传统的程序设计中&#xff0c;我们直接在对象内部通过new进行对象创建…

ssrf学习笔记总结

SSRF概述 ​ 服务器会根据用户提交的URL 发送一个HTTP 请求。使用用户指定的URL&#xff0c;Web 应用可以获取图片或者文件资源等。典型的例子是百度识图功能。 ​ 如果没有对用户提交URL 和远端服务器所返回的信息做合适的验证或过滤&#xff0c;就有可能存在“请求伪造”的…

新品首发 | HP1011:高性能双相交错 PFC 数字控制器

随着PFC技术的发展&#xff0c;不断有新型PFC拓扑结构提出&#xff0c;如单相PFC、交错并联 PFC、传统无桥PFC、图腾柱无桥 PFC等。交错Boost PFC系统不仅具有并联系统的所有优点&#xff0c;还能减少输入电流纹波&#xff0c;降低开关管的电流应力。在中大功率场所通常采用工作…

【汇编】栈及栈操作的实现

文章目录 前言一、栈是什么&#xff1f;二、栈的特点三、栈操作四、8086cpu操作栈4.1 汇编指令4.2 汇编代码讲解问题&#xff1a;回答&#xff1a; 4.3 栈的操作4.3 push 指令和pop指令的执行过程执行入栈(push)时&#xff0c;栈顶超出栈空间执行出栈(pop)时&#xff0c;栈顶超…

图像分类系列(四) InceptionV2-V3学习详细记录

前言 上一篇我们介绍了Inception的原始版本和V1版本&#xff1a;经典神经网络论文超详细解读&#xff08;三&#xff09;——GoogLeNet学习笔记&#xff08;翻译&#xff0b;精读代码复现&#xff09; 这个结构在当时获得了第一名&#xff0c;备受关注。但InceptionV1是比较复…

机器学习第4天:模型优化方法—梯度下降

文章目录 前言 梯度下降原理简述 介绍 可能的问题 批量梯度下降 随机梯度下降 基本算法 存在的问题 退火算法 代码演示 小批量梯度下降 前言 若没有机器学习基础&#xff0c;建议先阅读同一系列以下文章 机器学习第1天&#xff1a;概念与体系漫游-CSDN博客 机器学习…

随着大模型中数据局限问题的严峻化,向量数据库应运而生

向量数据库与亚马逊大模型 什么是向量数据库 向量嵌入&#xff08;vector embedding&#xff09;已经无处不在。它们构成了许多机器学习和深度学习算法的基础&#xff0c;被广泛运用于各种应用&#xff0c;从搜索引擎到智能助手再到推荐系统等。通常&#xff0c;机器学习和深度…

解析 Python requests 库 POST 请求中的参数顺序问题

在这篇文章中&#xff0c;我们将探讨一个用户在使用Python的requests库进行POST请求时遇到的问题&#xff0c;即参数顺序的不一致。用户通过Fiddler进行网络抓包&#xff0c;发现请求体中的参数顺序与他设置的顺序不符。我们将深入了解POST请求的工作原理&#xff0c;并提供解决…

使用requests库设置no_proxy选项的方法

问题背景 在使用requests库进行HTTP请求时&#xff0c;如果需要使用爬虫IP服务器&#xff0c;可以通过设置proxies参数来实现。proxies参数是一个字典&#xff0c;其中包含了爬虫IP服务器的地址和端口号。然而&#xff0c;当前的requests库并不支持通过proxies参数来设置no_pr…

DIY私人图床:使用CFimagehost源码自建无需数据库支持的PHP图片托管服务

文章目录 1.前言2. CFImagehost网站搭建2.1 CFImagehost下载和安装2.2 CFImagehost网页测试2.3 cpolar的安装和注册 3.本地网页发布3.1 Cpolar临时数据隧道3.2 Cpolar稳定隧道&#xff08;云端设置&#xff09;3.3.Cpolar稳定隧道&#xff08;本地设置&#xff09; 4.公网访问测…

c题目9:证明1000以内的偶数可以写成两个质数之和

每日小语 心若没有栖息的地方&#xff0c;在哪都是流浪。——三毛 自己敲写 这里需要用到一个联系&#xff1a;oushuprime1prime2 这个问题在于将这个联系变换&#xff0c;用于让我们判断是否是质数&#xff0c;转换后可以方便清晰的理解&#xff0c;并且减掉一个变量。 这…

3.ubuntu20.04环境的ros搭建

ros搭建比较简单&#xff0c;主要步骤如下&#xff1a; 1.配置ros软件源&#xff1a; sudo sh -c echo "deb http://packages.ros.org/ros/ubuntu $(lsb_release -sc) main" > /etc/apt/sources.list.d/ros-latest.list 2.配置密钥 sudo apt-key adv --keyser…