Weaviate

在这里插入图片描述

文章目录

- 关于 Weaviate
- - 核心功能
  - 部署方式
  - 使用场景
- 快速上手（Python)
- - 1、创建 Weaviate 数据库
  - 2、安装
  - 3、连接到 Weaviate
  - 4、定义数据集
  - 5、添加对象
  - 6、查询
  - - 1）Semantic search
    - 2) Semantic search with a filter
- 使用示例
- - Similarity search
  - LLMs and search
  - Classification
  - Other use cases

关于 Weaviate

Weaviate is a cloud-native, open source vector database that is robust, fast, and scalable.

官网：https://weaviate.io
github : https://github.com/weaviate/weaviate
官方文档：https://weaviate.io/developers/weaviate

核心功能

在这里插入图片描述

部署方式

Multiple deployment options are available to cater for different users and use cases.

All options offer vectorizer and RAG module integration.

在这里插入图片描述

使用场景

Weaviate is flexible and can be used in many contexts and scenarios.

在这里插入图片描述

快速上手（Python)

参考：https://weaviate.io/developers/weaviate/quickstart

1、创建 Weaviate 数据库

你可以在 Weaviate Cloud Services (WCS). 创建一个免费的 cloud sandbox 实例

方式如：https://weaviate.io/developers/wcs/quickstart

从WCS 的Details tab 拿到 API key 和 URL。

2、安装

使用 v4 client， Weaviate 1.23.7 及以上：

pip install -U weaviate-client

使用 v3

pip install "weaviate-client==3.*"

3、连接到 Weaviate

使用步骤一拿到的 API Key 和 URL，以及 OpenAI 的推理 API Key：https://platform.openai.com/signup

运行以下代码：

import weaviate
import weaviate.classes as wvc
import os
import requests
import json

client = weaviate.connect_to_wcs(
    cluster_url=os.getenv("WCS_CLUSTER_URL"),
    auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WCS_API_KEY")),
    headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"]  # Replace with your inference API key
    }
)

try:
    pass # Replace with your code. Close client gracefully in the finally block.

finally:
    client.close()  # Close client gracefully

import weaviate
import json

client = weaviate.Client(
    url = "https://some-endpoint.weaviate.network",  # Replace with your endpoint
    auth_client_secret=weaviate.auth.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY"),  # Replace w/ your Weaviate instance API key
    additional_headers = {
        "X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY"  # Replace with your inference API key
    }
)

4、定义数据集

Next, we define a data collection (a “class” in Weaviate) to store objects in.

This is analogous to creating a table in relational (SQL) databases.

The following code:

Configures a class object with:
- Name Question
- Vectorizer module text2vec-openai
- Generative module generative-openai
Then creates the class.

    questions = client.collections.create(
        name="Question",
        vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),  # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
        generative_config=wvc.config.Configure.Generative.openai()  # Ensure the `generative-openai` module is used for generative queries
    )

class_obj = {
    "class": "Question",
    "vectorizer": "text2vec-openai",  # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
    "moduleConfig": {
        "text2vec-openai": {},
        "generative-openai": {}  # Ensure the `generative-openai` module is used for generative queries
    }
}

client.schema.create_class(class_obj)

5、添加对象

You can now add objects to Weaviate. You will be using a batch import (read more) process for maximum efficiency.

The guide covers using the vectorizer defined for the class to create a vector embedding for each object.

The above code:

Loads objects, and
Adds objects to the target class (Question) one by one.

    resp = requests.get('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json')
    data = json.loads(resp.text)  # Load data

    question_objs = list()
    for i, d in enumerate(data):
        question_objs.append({
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        })

    questions = client.collections.get("Question")
    questions.data.insert_many(question_objs)  # This uses batching under the hood

import requests
import json
resp = requests.get('https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/jeopardy_tiny.json')
data = json.loads(resp.text)  # Load data

client.batch.configure(batch_size=100)  # Configure batch
with client.batch as batch:  # Initialize a batch process
    for i, d in enumerate(data):  # Batch import data
        print(f"importing question: {i+1}")
        properties = {
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        }
        batch.add_data_object(
            data_object=properties,
            class_name="Question"
        )

6、查询

1）Semantic search

Let’s start with a similarity search. A nearText search looks for objects in Weaviate whose vectors are most similar to the vector for the given input text.

Run the following code to search for objects whose vectors are most similar to that of biology.

import weaviate
import weaviate.classes as wvc
import os

client = weaviate.connect_to_wcs(
    cluster_url=os.getenv("WCS_CLUSTER_URL"),
    auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WCS_API_KEY")),
    headers={
        "X-OpenAI-Api-Key": os.environ["OPENAI_APIKEY"]  # Replace with your inference API key
    }
)

try:
    pass # Replace with your code. Close client gracefully in the finally block.
    questions = client.collections.get("Question")

    response = questions.query.near_text(
        query="biology",
        limit=2
    )

    print(response.objects[0].properties)  # Inspect the first object

finally:
    client.close()  # Close client gracefully

import weaviate
import json

client = weaviate.Client(
    url = "https://some-endpoint.weaviate.network",  # Replace with your endpoint
    auth_client_secret=weaviate.auth.AuthApiKey(api_key="YOUR-WEAVIATE-API-KEY"),  # Replace w/ your Weaviate instance API key
    additional_headers = {
        "X-OpenAI-Api-Key": "YOUR-OPENAI-API-KEY"  # Replace with your inference API key
    }
)

response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text({"concepts": ["biology"]})
    .with_limit(2)
    .do()
)

print(json.dumps(response, indent=4))

结果如下

{
    "data": {
        "Get": {
            "Question": [
                {
                    "answer": "DNA",
                    "category": "SCIENCE",
                    "question": "In 1953 Watson & Crick built a model of the molecular structure of this, the gene-carrying substance"
                },
                {
                    "answer": "Liver",
                    "category": "SCIENCE",
                    "question": "This organ removes excess glucose from the blood & stores it as glycogen"
                }
            ]
        }
    }
}

2) Semantic search with a filter

You can add Boolean filters to searches. For example, the above search can be modified to only in objects that have a “category” value of “ANIMALS”. Run the following code to see the results:

    questions = client.collections.get("Question")

    response = questions.query.near_text(
        query="biology",
        limit=2,
        filters=wvc.query.Filter.by_property("category").equal("ANIMALS")
    )

    print(response.objects[0].properties)  # Inspect the first object

response = (
    client.query
    .get("Question", ["question", "answer", "category"])
    .with_near_text({"concepts": ["biology"]})
    .with_where({
        "path": ["category"],
        "operator": "Equal",
        "valueText": "ANIMALS"
    })
    .with_limit(2)
    .do()
)

print(json.dumps(response, indent=4))

结果如下：

{
    "data": {
        "Get": {
            "Question": [
                {
                    "answer": "Elephant",
                    "category": "ANIMALS",
                    "question": "It's the only living mammal in the order Proboseidea"
                },
                {
                    "answer": "the nose or snout",
                    "category": "ANIMALS",
                    "question": "The gavial looks very much like a crocodile except for this bodily feature"
                }
            ]
        }
    }
}

更多可见：https://weaviate.io/developers/weaviate/quickstart

使用示例

This page illustrates various use cases for vector databases by way of open-source demo projects. You can fork and modify any of them.

If you would like to contribute your own project to this page, please let us know by creating an issue on GitHub.

Similarity search

https://weaviate.io/developers/weaviate/more-resources/example-use-cases#similarity-search

A vector databases enables fast, efficient similarity searches on and across any modalities, such as text or images, as well as their combinations. Vector database’ similarity search capabilities can be used for other complex use cases, such as recommendation systems in classical machine learning applications.

Title	Description	Modality	Code
Plant search	Semantic search over plants.	Text	Javascript
Wine search	Semantic search over wines.	Text	Python
Book recommender system (Video, Demo)	Find book recommendations based on search query.	Text	TypeScript
Movie recommender system (Blog)	Find similar movies.	Text	Javascript
Multilingual Wikipedia Search	Search through Wikipedia in multiple languages.	Text	TypeScript
Podcast search	Semantic search over podcast episodes.	Text	Python
Video Caption Search	Find the timestamp of the answer to your question in a video.	Text	Python
Facial Recognition	Identify people in images	Image	Python
Image Search over dogs (Blog)	Find images of similar dog breeds based on uploaded image.	Image	Python
Text to image search	Find images most similar to a text query.	Multimodal	Javascript
Text to image and image to image search	Find images most similar to a text or image query.	Multimodal	Python

LLMs and search

https://weaviate.io/developers/weaviate/more-resources/example-use-cases#llms-and-search

Vector databases and LLMs go together like cookies and milk!

Vector databases help to address some of large language models (LLMs) limitations, such as hallucinations, by helping to retrieve the relevant information to provide to the LLM as a part of its input.

Title	Description	Modality	Code
Verba, the golden RAGtriever (Video, Demo)	Retrieval-Augmented Generation (RAG) system to chat with Weaviate documentation and blog posts.	Text	Python
HealthSearch (Blog, Demo)	Recommendation system of health products based on symptoms.	Text	Python
Magic Chat	Search through Magic The Gathering cards	Text	Python
AirBnB Listings (Blog)	Generation of customized advertisements for AirBnB listings with Generative Feedback Loops	Text	Python
Distyll	Summarize text or video content.	Text	Python

Learn more in our LLMs and Search blog post.

Classification

https://weaviate.io/developers/weaviate/more-resources/example-use-cases#classification

Weaviate can leverage its vectorization capabilities to enable automatic, real-time classification of unseen, new concepts based on its semantic understanding.

Title	Description	Modality	Code
Toxic Comment Classification	Clasify whether a comment is toxic or non-toxic.	Text	Python
Audio Genre Classification	Classify the music genre of an audio file.	Image	Python