英伟达基于Mistral 7B开发新一代Embedding模型——NV-Embed-v2

在这里插入图片描述

我们介绍的 NV-Embed-v2 是一种通用嵌入模型,它在大规模文本嵌入基准(MTEB 基准)(截至 2024 年 8 月 30 日)的 56 项文本嵌入任务中以 72.31 的高分排名第一。此外,它还在检索子类别中排名第一(在 15 项任务中获得 62.65 分),这对 RAG 技术的发展至关重要。

NV-Embed-v2 采用了多项新设计,包括让 LLM 关注潜在向量,以获得更好的池化嵌入输出,并展示了一种两阶段指令调整方法,以提高检索和非检索任务的准确性。此外,NV-Embed-v2 还采用了一种新颖的硬阴性挖掘方法,该方法考虑了正相关性得分,能更好地去除假阴性。

有关更多技术细节,请参阅我们的论文: NV-Embed:将 LLM 训练为通用嵌入模型的改进技术。

型号详情

  • 仅用于解码器的基本 LLM:Mistral-7B-v0.1
  • 池类型: Latent-Attention
  • 嵌入尺寸: 4096

如何使用

所需软件包

如果遇到问题,请尝试安装以下 python 软件包

pip uninstall -y transformer-engine
pip install torch==2.2.0
pip install transformers==4.42.4
pip install flash-attn==2.2.0
pip install sentence-transformers==2.7.0

以下是如何使用 Huggingface-transformer 和 Sentence-transformer 对查询和段落进行编码的示例。

HuggingFace Transformers

import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

# Each query needs to be accompanied by an corresponding instruction describing the task.
task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question",}

query_prefix = "Instruct: "+task_name_to_instruct["example"]+"\nQuery: "
queries = [
    'are judo throws allowed in wrestling?', 
    'how to become a radiology technician in michigan?'
    ]

# No instruction needed for retrieval passages
passage_prefix = ""
passages = [
    "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
    "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."
]

# load model with tokenizer
model = AutoModel.from_pretrained('nvidia/NV-Embed-v2', trust_remote_code=True)

# get the embeddings
max_length = 32768
query_embeddings = model.encode(queries, instruction=query_prefix, max_length=max_length)
passage_embeddings = model.encode(passages, instruction=passage_prefix, max_length=max_length)

# normalize embeddings
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
passage_embeddings = F.normalize(passage_embeddings, p=2, dim=1)

# get the embeddings with DataLoader (spliting the datasets into multiple mini-batches)
# batch_size=2
# query_embeddings = model._do_encode(queries, batch_size=batch_size, instruction=query_prefix, max_length=max_length, num_workers=32, return_numpy=True)
# passage_embeddings = model._do_encode(passages, batch_size=batch_size, instruction=passage_prefix, max_length=max_length, num_workers=32, return_numpy=True)

scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())
# [[87.42693328857422, 0.46283677220344543], [0.965264618396759, 86.03721618652344]]

Sentence-Transformers

import torch
from sentence_transformers import SentenceTransformer

# Each query needs to be accompanied by an corresponding instruction describing the task.
task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question",}

query_prefix = "Instruct: "+task_name_to_instruct["example"]+"\nQuery: "
queries = [
    'are judo throws allowed in wrestling?', 
    'how to become a radiology technician in michigan?'
    ]

# No instruction needed for retrieval passages
passages = [
    "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
    "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."
]

# load model with tokenizer
model = SentenceTransformer('nvidia/NV-Embed-v2', trust_remote_code=True)
model.max_seq_length = 32768
model.tokenizer.padding_side="right"

def add_eos(input_examples):
  input_examples = [input_example + model.tokenizer.eos_token for input_example in input_examples]
  return input_examples

# get the embeddings
batch_size = 2
query_embeddings = model.encode(add_eos(queries), batch_size=batch_size, prompt=query_prefix, normalize_embeddings=True)
passage_embeddings = model.encode(add_eos(passages), batch_size=batch_size, normalize_embeddings=True)

scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())

MTEB 基准的指令模板

对于检索、STS 和摘要的 MTEB 子任务,请使用 instructions.json 中的指令前缀模板。 对于分类、聚类和重排,请使用 NV-Embed 论文表 7 中提供的说明。 7 中提供的说明。

instructions.json

{
    "ClimateFEVER":
            {
                "query": "Given a claim about climate change, retrieve documents that support or refute the claim",
                "corpus": ""
            },
    "HotpotQA":
        {
            "query": "Given a multi-hop question, retrieve documents that can help answer the question",
            "corpus": ""
        },
    "FEVER":
        {
            "query": "Given a claim, retrieve documents that support or refute the claim",
            "corpus": ""
        },
    "MSMARCO":
        {
            "query": "Given a web search query, retrieve relevant passages that answer the query",
            "corpus": ""
        },
    "DBPedia":
        {
            "query": "Given a query, retrieve relevant entity descriptions from DBPedia",
            "corpus": ""
        },
    "NQ":
        {
            "query": "Given a question, retrieve passages that answer the question",
            "corpus": ""
        },
    "QuoraRetrieval":
        {
            "query": "Given a question, retrieve questions that are semantically equivalent to the given question",
            "corpus": "Given a question, retrieve questions that are semantically equivalent to the given question"
        },
    "SCIDOCS":
        {
            "query": "Given a scientific paper title, retrieve paper abstracts that are cited by the given paper",
            "corpus": ""
        },
    "TRECCOVID":
        {
            "query": "Given a query on COVID-19, retrieve documents that answer the query",
            "corpus": ""
        },
    "Touche2020":
        {
            "query": "Given a question, retrieve passages that answer the question",
            "corpus": ""
        },
    "SciFact":
        {
            "query": "Given a scientific claim, retrieve documents that support or refute the claim",
            "corpus": ""
        },
    "NFCorpus":
        {
            "query": "Given a question, retrieve relevant documents that answer the question",
            "corpus": ""
        },
    "ArguAna":
        {
            "query": "Given a claim, retrieve documents that support or refute the claim",
            "corpus": ""
        },
    "FiQA2018":
        {
            "query": "Given a financial question, retrieve relevant passages that answer the query",
            "corpus": ""
        },
    "STS":
        {
            "text": "Retrieve semantically similar text"
        },
    "SUMM":
        {
            "text": "Given a news summary, retrieve other semantically similar summaries"
        }
}

如何启用多 GPU(注意,这是 HuggingFace Transformers的情况)

from transformers import AutoModel
from torch.nn import DataParallel

embedding_model = AutoModel.from_pretrained("nvidia/NV-Embed-v2")
for module_key, module in embedding_model._modules.items():
    embedding_model._modules[module_key] = DataParallel(module)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/917199.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

24 年第十届数维杯国际数模竞赛赛题浅析

本次万众瞩目的数维杯国际大学生数学建模赛题已正式出炉,无论是赛题难度还是认可度,该比赛都是数模届的独一档,含金量极高,可以用于综测加分、保研、简历添彩等各方面。考虑到大家解题实属不易,为了帮助大家取得好成绩…

SOP搭建:企业标准化操作程序构建与实施指南

一、引言 在当今充满竞争的商业领域,实现企业运营的标准化、高效化和高质量化是提升企业市场竞争力的关键所在。标准操作程序(SOP)作为一种至关重要的管理工具,能够清晰地阐述业务流程,规范操作行为,并促进…

【每日刷题】Day156

【每日刷题】Day156 🥕个人主页:开敲🍉 🔥所属专栏:每日刷题🍍 🌼文章目录🌼 1. 1020. 飞地的数量 - 力扣(LeetCode) 2. 1765. 地图中的最高点 - 力扣(LeetCode) 3. 1162. 地图分…

J.U.C - 深入解读Condition条件变量原理源码

文章目录 Pre概述Condition 主要方法Condition案例Condition的源码解析1. 等待:condition. await2. 唤醒Condition. signal Condition总结 Pre J.U.C - 深入解析ReentrantLock原理&源码 概述 配合synchronized同步锁在同步代码块中调用加锁对象notify和wait方…

c++ 类和对象(中)

前言 我们看看下面的代码以及代码运行结果 代码1 我们可以看到在我们的类Data中的函数成员print中,我们并没有设置形参,在调用此函数时,也并没有多余传参,但是我们调用它时,却能准确打印出我们的_year、_month、_day…

python:用 sklearn 构建 K-Means 聚类模型

pip install scikit-learn 或者 直接用 Anaconda3 sklearn 提供了 preprocessing 数据预处理模块、cluster 聚类模型、manifold.TSNE 数据降维模块。 编写 test_sklearn_3.py 如下 # -*- coding: utf-8 -*- """ 使用 sklearn 构建 K-Means 聚类模型 "&…

使用python编写工具:快速生成chrome插件相关文件结构

本文将详细分析一段用 wxPython 编写的 Python 应用程序代码。该程序允许用户创建一些特定文件并将它们保存在指定的文件夹中,同时也能够启动 Google Chrome 浏览器并打开扩展页面,自动执行一些操作。 C:\pythoncode\new\crxiterationtaburl.py 全部代码…

使用Web Components构建模块化Web应用

💓 博客主页:瑕疵的CSDN主页 📝 Gitee主页:瑕疵的gitee主页 ⏩ 文章专栏:《热点资讯》 使用Web Components构建模块化Web应用 使用Web Components构建模块化Web应用 使用Web Components构建模块化Web应用 引言 Web Co…

谷歌浏览器的自动翻译功能如何开启

在当今全球化的网络环境中,能够流畅地浏览不同语言的网页是至关重要的。谷歌浏览器(Google Chrome)提供了一项强大的自动翻译功能,可以帮助用户轻松跨越语言障碍。本文将详细介绍如何开启和使用谷歌浏览器的自动翻译功能&#xff…

算法---解决“汉诺塔”问题

# 初始化步骤计数器 i 1 # 定义移动盘子的函数 def move(n, mfrom, mto): global i # 使用全局变量i来跟踪步骤 print("第%d步:将%d号盘子从%s->%s" % (i, n, mfrom, mto)) # 打印移动步骤 i 1 # 步骤计数器加1 #第一种方法 # 定义汉诺塔问题的递归…

uniapp对接极光推送,实现消息推送功能

通过集成JG-JPush和JG-JCore插件,可以在应用中添加消息推送功能,向用户发送通知、消息等。这对于提升用户体验、增加用户粘性非常有帮助‌。 效果图: 一、登录极光官网 进入【服务中心】-【开发者平台】 创建应用:【概览】- 【创…

redis高性能键值数据库技术简介

什么是redis redis是远程字典服务(Remote Dictionary Server )的简写,是一个完全开源的高性能的Key-Value数据库,提供了丰富的数据结构如string、Hash、List、SetSortedset等等。数据是存在内存中的,同时Redis支持事务…

进程信号

目录 信号入门 1. 生活角度的信号 2. 技术应用角度的信号 3. 注意 4. 信号概念 5. 用kill -l命令可以察看系统定义的信号列表 6. 信号处理常见方式概览 产生信号 1. 通过终端按键产生信号 Core Dump 2. 调用系统函数向进程发信号 3. 由软件条件产生信号 4. 硬件异…

NotePad++中安装XML Tools插件

一、概述 作为开发人员,日常开发中大部的数据是标准的json格式,但是对于一些古老的应用,例如webservice接口,由于其响应结果是xml,那么我们拿到xml格式的数据后,常常会对其进行格式化,以便阅读。…

Java基础——多线程

1. 线程 是一个程序内部的一条执行流程程序中如果只有一条执行流程,那这个程序就是单线程的程序 2. 多线程 指从软硬件上实现的多条执行流程的技术(多条线程由CPU负责调度执行) 2.1. 如何创建多条线程 Java通过java.lang.Thread类的对象…

HarmonyOS ArkUI(基于ArkTS) 常用组件

一 Button 按钮 Button是按钮组件,通常用于响应用户的点击操作,可以加子组件 Button(我是button)Button(){Text(我是button)}type 按钮类型 Button有三种可选类型,分别为胶囊类型(Capsule)、圆形按钮(Circle&#xf…

【FPGA开发】AXI-Stream总线协议解读

文章目录 AXI-Stream概述协议中一些定义字节定义流的定义 数据流类别字节流连续对齐流连续不对齐流稀疏流 协议的信号信号列表 文章为个人理解整理,如有错误,欢迎指正! 参考文献 ARM官方手册 《IHI0051B》 AXI-Stream概述 协议中一些定义 A…

谷粒商城のMySQL集群分库分表

文章目录 前言一、MySQL的集群架构二、MySQL主从同步实践1.创建主节点实例2.创建从节点实例3.修改配置4.开始同步4.测试主从同步效果5.小结 三、MySQL分库分表1.配置sharding-proxy2.测试sharding-proxy3.小结 前言 本篇是谷粒商城集群部署篇,搭建MySQL集群以及分库…

计算机组成原理对于学习嵌入式开发的意义

计算机组成原理对于学习嵌入式开发的意义 前言 最近有位同学向我咨询,问学习嵌入式开发需不需要学习硬件?进而引申到了需不需要学习计算机组成原理呢? 正文 首先计算机组成原理是计算机科学与技术专业的一门核心基础课程,它深入…

npm list -g --depth=0(用来列出全局安装的所有 npm 软件包而不显示它们的依赖项)

您提供的命令 npm list -g --depth0 是在 Node Package Manager (npm) 的上下文中使用的,用来列出全局安装的所有 npm 软件包而不显示它们的依赖项。 这是它的运作方式: npm list -g --depth0-g: 指定列表应包括全局安装的软件包。--depth0: 限制树形结…