Automated Testing for LLMOps 01:使用CircleCI进行持续集成CI

Automated Testing for LLMOps

这是学习https://www.deeplearning.ai/short-courses/automated-testing-llmops/ 这门课的笔记

Learn how LLM-based testing differs from traditional software testing and implement rules-based testing to assess your LLM application.

Build model-graded evaluations to test your LLM application using an evaluation LLM.

Automate your evals (rules-based and model-graded) using continuous integration tools from CircleCI.

文章目录

  • Automated Testing for LLMOps
  • Lesson 1: Introduction to Continuous Integration (CI)
  • Lesson 2: Overview of Automated Evals
    • Load API tokens for our 3rd party APIs.
    • Set up our github branch
    • The sample application: AI-powered quiz generator
      • Evaluations
    • Running evaluations in a CircleCI pipeline
      • The CircleCI config file
    • Run the per-commit evals

Lesson 1: Introduction to Continuous Integration (CI)

ChatGPT对持续集成的介绍:

持续集成(Continuous Integration,简称CI)是一种软件开发实践,旨在通过频繁地将代码集成到共享存储库中,并通过自动化构建和测试过程来尽早地发现集成错误。CI的主要目标是减少集成问题的出现,以便团队能够更快地交付高质量的软件。

在持续集成的实践中,开发人员通常会将其代码提交到共享存储库(如版本控制系统),然后CI服务器会自动检测到这些变更,并触发一系列的构建和测试任务。这些任务可以包括编译代码、运行单元测试、执行静态代码分析等。如果构建或测试失败,CI系统会通知团队成员,以便他们可以及时修复问题。如果一切顺利,则可以持续集成(将新的代码更改合并到主分支)。

持续集成的优势包括:

  1. 提早发现问题:由于每次提交都会触发自动化的构建和测试,因此集成问题可以更早地被发现和解决,避免了将问题延迟到开发周期后期的情况。

  2. 提高软件质量:通过频繁地运行测试和自动化代码检查,可以确保代码质量更高,减少了潜在的缺陷。

  3. 加速交付速度:持续集成使得团队能够更快地交付新功能和修复bug,因为团队可以更加自信地修改代码,知道集成问题会被及时发现。

  4. 提高团队协作:CI促进了团队成员之间的交流和协作,因为他们需要共同努力确保代码的可集成性和质量。

综上所述,持续集成是现代软件开发中不可或缺的一环,它通过自动化和频繁的集成实践,帮助团队更快地构建高质量的软件。

持续集成的含义

在这里插入图片描述

持续集成的步骤

在这里插入图片描述

持续集成的好处

在这里插入图片描述

Lesson 2: Overview of Automated Evals

传统软件的测试和大语言模型LLM的测试区别

在这里插入图片描述

Model Evaluations(Evals)

在这里插入图片描述

Automating Evals

在这里插入图片描述

import warnings
warnings.filterwarnings('ignore')

看一下具体实例

在这里插入图片描述

Load API tokens for our 3rd party APIs.

from utils import get_circle_api_key
cci_api_key = get_circle_api_key()

from utils import get_gh_api_key
gh_api_key = get_gh_api_key()

from utils import get_openai_api_key
openai_api_key = get_openai_api_key()

utils.py文件如下

import github
import os
import requests
import random
from dotenv import load_dotenv, find_dotenv
from yaml import safe_dump, safe_load
import time

adjectives = [
    "adoring",
    "affirmative",
    "appreciated",
    "available",
    "best-selling",
    "blithe",
    "brightest",
    "charismatic",
    "convincing",
    "dignified",
    "ecstatic",
    "effective",
    "engaging",
    "enterprising",
    "ethical",
    "fast-growing",
    "glad",
    "hardy",
    "idolized",
    "improving",
    "jubilant",
    "knowledgeable",
    "long-lasting",
    "lucky",
    "marvelous",
    "merciful",
    "mesmerizing",
    "problem-free",
    "resplendent",
    "restored",
    "roomier",
    "serene",
    "sharper",
    "skilled",
    "smiling",
    "smoother",
    "snappy",
    "soulful",
    "staunch",
    "striking",
    "strongest",
    "subsidized",
    "supported",
    "supporting",
    "sweeping",
    "terrific",
    "unaffected",
    "unbiased",
    "unforgettable",
    "unrivaled",
]

nouns = [
    "agustinia",
    "apogee",
    "bangle",
    "cake",
    "cheese",
    "clavicle",
    "client",
    "clove",
    "curler",
    "draw",
    "duke",
    "earl",
    "eustoma",
    "fireplace",
    "gem",
    "glove",
    "goal",
    "ground",
    "jasmine",
    "jodhpur",
    "laugh",
    "message",
    "mile",
    "mockingbird",
    "motor",
    "phalange",
    "pillow",
    "pizza",
    "pond",
    "potential",
    "ptarmigan",
    "puck",
    "puzzle",
    "quartz",
    "radar",
    "raver",
    "saguaro",
    "salary",
    "sale",
    "scarer",
    "skunk",
    "spatula",
    "spectacles",
    "statistic",
    "sturgeon",
    "tea",
    "teacher",
    "wallet",
    "waterfall",
    "wrinkle",
]

def inspect_config():
    with open("circle_config.yml") as f:
        print(safe_dump(safe_load(f)))

def load_env():
    _ = load_dotenv(find_dotenv())

def get_openai_api_key():
    load_env()
    openai_api_key = os.getenv("OPENAI_API_KEY")
    return openai_api_key

def get_circle_api_key():
    load_env()
    circle_token = os.getenv("CIRCLE_TOKEN")
    return circle_token

def get_gh_api_key():
    load_env()
    github_token = os.getenv("GH_TOKEN")
    return github_token

def get_repo_name():
    return "CircleCI-Learning/llmops-course"


def _create_tree_element(repo, path, content):
    blob = repo.create_git_blob(content, "utf-8")
    element = github.InputGitTreeElement(
        path=path, mode="100644", type="blob", sha=blob.sha
    )
    return element


def push_files(repo_name, branch_name, files):
    files_to_push = set(files)
    # include the config.yml file
    g = github.Github(os.environ["GH_TOKEN"])
    repo = g.get_repo(repo_name)

    elements = []
    config_element = _create_tree_element(
        repo, ".circleci/config.yml", open("circle_config.yml").read()
    )
    elements.append(config_element)

    requirements_element = _create_tree_element(
        repo, "requirements.txt", open("dev_requirements.txt").read()
    )
    elements.append(requirements_element)
    for file in files_to_push:
        print(f"uploading {file}")
        with open(file, encoding="utf-8") as f:
            content = f.read()
            element = _create_tree_element(repo, file, content)
            elements.append(element)

    head_sha = repo.get_branch("main").commit.sha
    
    print(f"pushing files to: {branch_name}")
    try:
        repo.create_git_ref(ref=f"refs/heads/{branch_name}", sha=head_sha)
        time.sleep(2)
    except Exception as _:
        print(f"{branch_name} already exists in the repository pushing updated changes")
    branch_sha = repo.get_branch(branch_name).commit.sha

    base_tree = repo.get_git_tree(sha=branch_sha)
    tree = repo.create_git_tree(elements, base_tree)
    parent = repo.get_git_commit(sha=branch_sha)
    commit = repo.create_git_commit("Trigger CI evaluation pipeline", tree, [parent])
    branch_refs = repo.get_git_ref(f"heads/{branch_name}")
    branch_refs.edit(sha=commit.sha)


def _trigger_circle_pipline(repo_name, branch, token, params=None):
    params = {} if params is None else params
    r = requests.post(
        f"{os.getenv('DLAI_CIRCLE_CI_API_BASE', 'https://circleci.com')}/api/v2/project/gh/{repo_name}/pipeline",
        headers={"Circle-Token": f"{token}", "accept": "application/json"},
        json={"branch": branch, "parameters": params},
    )
    pipeline_data = r.json()
    pipeline_number = pipeline_data["number"]
    print(
        f"Please visit https://app.circleci.com/pipelines/github/{repo_name}/{pipeline_number}"
    )


def trigger_commit_evals(repo_name, branch, token):
    _trigger_circle_pipline(repo_name, branch, token, {"eval-mode": "commit"})


def trigger_release_evals(repo_name, branch, token):
    _trigger_circle_pipline(repo_name, branch, token, {"eval-mode": "release"})

def trigger_full_evals(repo_name, branch, token):
    _trigger_circle_pipline(repo_name, branch, token, {"eval-mode": "full"})


## magic to write and run
from IPython.core.magic import register_cell_magic


@register_cell_magic
def write_and_run(line, cell):
    argz = line.split()
    file = argz[-1]
    mode = "w"
    if len(argz) == 2 and argz[0] == "-a":
        mode = "a"
    with open(file, mode) as f:
        f.write(cell)
    get_ipython().run_cell(cell)


def get_branch() -> str:
    """Generate a random branch name."""
    prefix = "dl-cci"
    adjective = random.choice(adjectives)
    noun = random.choice(nouns)
    number = random.randint(1, 100)

    return f"dl-cci-{adjective}-{noun}-{number}"

Set up our github branch

from utils import get_repo_name
course_repo = get_repo_name()
course_repo

Output

'CircleCI-Learning/llmops-course'
from utils import get_branch
course_branch = get_branch()
course_branch

Output:这是我的分支(不同的人参加这门课会有不同的分支)

'dl-cci-brightest-pond-67'

可以前往github的仓库查看一下:

https://github.com/CircleCI-Learning/llmops-course/branches

The sample application: AI-powered quiz generator

We are going to build a AI powered quiz generator.

在这里插入图片描述

Create the dataset for the quiz.

human_template  = "{question}"

quiz_bank = """1. Subject: Leonardo DaVinci
   Categories: Art, Science
   Facts:
    - Painted the Mona Lisa
    - Studied zoology, anatomy, geology, optics
    - Designed a flying machine
  
2. Subject: Paris
   Categories: Art, Geography
   Facts:
    - Location of the Louvre, the museum where the Mona Lisa is displayed
    - Capital of France
    - Most populous city in France
    - Where Radium and Polonium were discovered by scientists Marie and Pierre Curie

3. Subject: Telescopes
   Category: Science
   Facts:
    - Device to observe different objects
    - The first refracting telescopes were invented in the Netherlands in the 17th Century
    - The James Webb space telescope is the largest telescope in space. It uses a gold-berillyum mirror

4. Subject: Starry Night
   Category: Art
   Facts:
    - Painted by Vincent van Gogh in 1889
    - Captures the east-facing view of van Gogh's room in Saint-Rémy-de-Provence

5. Subject: Physics
   Category: Science
   Facts:
    - The sun doesn't change color during sunset.
    - Water slows the speed of light
    - The Eiffel Tower in Paris is taller in the summer than the winter due to expansion of the metal."""

Build the prompt template.

delimiter = "####"

prompt_template = f"""
Follow these steps to generate a customized quiz for the user.
The question will be delimited with four hashtags i.e {delimiter}

The user will provide a category that they want to create a quiz for. Any questions included in the quiz
should only refer to the category.

Step 1:{delimiter} First identify the category user is asking about from the following list:
* Geography
* Science
* Art

Step 2:{delimiter} Determine the subjects to generate questions about. The list of topics are below:

{quiz_bank}

Pick up to two subjects that fit the user's category. 

Step 3:{delimiter} Generate a quiz for the user. Based on the selected subjects generate 3 questions for the user using the facts about the subject.

Use the following format for the quiz:
Question 1:{delimiter} <question 1>

Question 2:{delimiter} <question 2>

Question 3:{delimiter} <question 3>

"""

Use langchain to build the prompt template.

from langchain.prompts import ChatPromptTemplate
chat_prompt = ChatPromptTemplate.from_messages([("human", prompt_template)])

# print to observe the content or generated object
chat_prompt

Output

ChatPromptTemplate(input_variables=[], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], template="\nFollow these steps to generate a customized quiz for the user.\nThe question will be delimited with four hashtags i.e ####\n\nThe user will provide a category that they want to create a quiz for. Any questions included in the quiz\nshould only refer to the category.\n\nStep 1:#### First identify the category user is asking about from the following list:\n* Geography\n* Science\n* Art\n\nStep 2:#### Determine the subjects to generate questions about. The list of topics are below:\n\n1. Subject: Leonardo DaVinci\n   Categories: Art, Science\n   Facts:\n    - Painted the Mona Lisa\n    - Studied zoology, anatomy, geology, optics\n    - Designed a flying machine\n  \n2. Subject: Paris\n   Categories: Art, Geography\n   Facts:\n    - Location of the Louvre, the museum where the Mona Lisa is displayed\n    - Capital of France\n    - Most populous city in France\n    - Where Radium and Polonium were discovered by scientists Marie and Pierre Curie\n\n3. Subject: Telescopes\n   Category: Science\n   Facts:\n    - Device to observe different objects\n    - The first refracting telescopes were invented in the Netherlands in the 17th Century\n    - The James Webb space telescope is the largest telescope in space. It uses a gold-berillyum mirror\n\n4. Subject: Starry Night\n   Category: Art\n   Facts:\n    - Painted by Vincent van Gogh in 1889\n    - Captures the east-facing view of van Gogh's room in Saint-Rémy-de-Provence\n\n5. Subject: Physics\n   Category: Science\n   Facts:\n    - The sun doesn't change color during sunset.\n    - Water slows the speed of light\n    - The Eiffel Tower in Paris is taller in the summer than the winter due to expansion of the metal.\n\nPick up to two subjects that fit the user's category. \n\nStep 3:#### Generate a quiz for the user. Based on the selected subjects generate 3 questions for the user using the facts about the subject.\n\nUse the following format for the quiz:\nQuestion 1:#### <question 1>\n\nQuestion 2:#### <question 2>\n\nQuestion 3:#### <question 3>\n\n"))])

Choose the LLM.

from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
llm

Output

ChatOpenAI(client=<class 'openai.api_resources.chat_completion.ChatCompletion'>, temperature=0.0, openai_api_key='eyJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJhcHAiLCJzdWIiOiIxNzQ2MDYyIiwiYXVkIjoiV0VCIiwiaWF0IjoxNzA4NjAyMDk3LCJleHAiOjE3MTExOTQwOTd9.dnCBPsdZ7nf9TjS3lSwddk6JINpKRuKPB7cjfq0mWts', openai_api_base='http://jupyter-api-proxy.internal.dlai/rev-proxy', openai_organization='', openai_proxy='')

Set up an output parser in LangChain that converts the llm response into a string.

chain = chat_prompt | llm | output_parser
chain

Output

ChatPromptTemplate(input_variables=[], messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=[], template="\nFollow these steps to generate a customized quiz for the user.\nThe question will be delimited with four hashtags i.e ####\n\nThe user will provide a category that they want to create a quiz for. Any questions included in the quiz\nshould only refer to the category.\n\nStep 1:#### First identify the category user is asking about from the following list:\n* Geography\n* Science\n* Art\n\nStep 2:#### Determine the subjects to generate questions about. The list of topics are below:\n\n1. Subject: Leonardo DaVinci\n   Categories: Art, Science\n   Facts:\n    - Painted the Mona Lisa\n    - Studied zoology, anatomy, geology, optics\n    - Designed a flying machine\n  \n2. Subject: Paris\n   Categories: Art, Geography\n   Facts:\n    - Location of the Louvre, the museum where the Mona Lisa is displayed\n    - Capital of France\n    - Most populous city in France\n    - Where Radium and Polonium were discovered by scientists Marie and Pierre Curie\n\n3. Subject: Telescopes\n   Category: Science\n   Facts:\n    - Device to observe different objects\n    - The first refracting telescopes were invented in the Netherlands in the 17th Century\n    - The James Webb space telescope is the largest telescope in space. It uses a gold-berillyum mirror\n\n4. Subject: Starry Night\n   Category: Art\n   Facts:\n    - Painted by Vincent van Gogh in 1889\n    - Captures the east-facing view of van Gogh's room in Saint-Rémy-de-Provence\n\n5. Subject: Physics\n   Category: Science\n   Facts:\n    - The sun doesn't change color during sunset.\n    - Water slows the speed of light\n    - The Eiffel Tower in Paris is taller in the summer than the winter due to expansion of the metal.\n\nPick up to two subjects that fit the user's category. \n\nStep 3:#### Generate a quiz for the user. Based on the selected subjects generate 3 questions for the user using the facts about the subject.\n\nUse the following format for the quiz:\nQuestion 1:#### <question 1>\n\nQuestion 2:#### <question 2>\n\nQuestion 3:#### <question 3>\n\n"))])
| ChatOpenAI(client=<class 'openai.api_resources.chat_completion.ChatCompletion'>, temperature=0.0, openai_api_key='eyJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJhcHAiLCJzdWIiOiIxNzQ2MDYyIiwiYXVkIjoiV0VCIiwiaWF0IjoxNzA4NjAyMDk3LCJleHAiOjE3MTExOTQwOTd9.dnCBPsdZ7nf9TjS3lSwddk6JINpKRuKPB7cjfq0mWts', openai_api_base='http://jupyter-api-proxy.internal.dlai/rev-proxy', openai_organization='', openai_proxy='')
| StrOutputParser()

Build the function ‘assistance_chain’ to put together all steps above.

# taking all components and making reusable as one piece
def assistant_chain(
    system_message,
    human_template="{question}",
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser()):
  
  chat_prompt = ChatPromptTemplate.from_messages([
      ("system", system_message),
      ("human", human_template),
  ])
  return chat_prompt | llm | output_parser

Evaluations

Create the function ‘eval_expected_words’ for the first example.

def eval_expected_words(
    system_message,
    question,
    expected_words,
    human_template="{question}",
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser()):
    
  assistant = assistant_chain(
      system_message,
      human_template,
      llm,
      output_parser)
    
  
  answer = assistant.invoke({"question": question})
    
  print(answer)
    
  assert any(word in answer.lower() \
             for word in expected_words), \
    f"Expected the assistant questions to include \
    '{expected_words}', but it did not"

Test: Generate a quiz about science.

question  = "Generate a quiz about science."
expected_words = ["davinci", "telescope", "physics", "curie"]

Create the eval.

eval_expected_words(
    prompt_template,
    question,
    expected_words
)

Output

Step 1:#### First identify the category user is asking about from the following list:
* Geography
* Science
* Art

Step 2:#### Determine the subjects to generate questions about. The list of topics are below:

1. Subject: Telescopes
   Category: Science
   Facts:
    - Device to observe different objects
    - The first refracting telescopes were invented in the Netherlands in the 17th Century
    - The James Webb space telescope is the largest telescope in space. It uses a gold-berillyum mirror

2. Subject: Physics
   Category: Science
   Facts:
    - The sun doesn't change color during sunset.
    - Water slows the speed of light
    - The Eiffel Tower in Paris is taller in the summer than the winter due to expansion of the metal.

Based on the selected subjects, I will generate 3 questions for your science quiz.

Question 1:#### What is the purpose of a telescope?
Question 2:#### In which country were the first refracting telescopes invented in the 17th Century?
Question 3:#### Why is the Eiffel Tower in Paris taller in the summer than the winter?

Create the function ‘evaluate_refusal’ to define a failing test case where the app should decline to answer.

def evaluate_refusal(
    system_message,
    question,
    decline_response,
    human_template="{question}", 
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser()):
    
  assistant = assistant_chain(human_template, 
                              system_message,
                              llm,
                              output_parser)
  
  answer = assistant.invoke({"question": question})
  print(answer)
  
  assert decline_response.lower() in answer.lower(), \
    f"Expected the bot to decline with \
    '{decline_response}' got {answer}"

Define a new question (which should be a bad request)

question  = "Generate a quiz about Rome."
decline_response = "I'm sorry"

Create the refusal eval.

Note: The following function call will throw an exception.

evaluate_refusal(
    prompt_template,
    question,
    decline_response
)

Output

#### Step 1:
I would like to create a quiz about Rome.

#### Step 2:
I will choose the subjects "Paris" and "Starry Night" as they both fall under the category of Art and Geography.

#### Step 3:
Question 1:####
In which city is the Louvre located, the museum where the Mona Lisa is displayed?
a) Rome
b) Paris
c) London
d) Madrid

Question 2:####
Who painted the famous artwork "Starry Night" in 1889?
a) Leonardo DaVinci
b) Vincent van Gogh
c) Michelangelo
d) Pablo Picasso

Question 3:####
What does "Starry Night" by Vincent van Gogh capture?
a) A view of the Eiffel Tower
b) A view of van Gogh's room in Saint-Rémy-de-Provence
c) A scene from the Louvre museum
d) A landscape of Rome
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[21], line 1
----> 1 evaluate_refusal(
      2     prompt_template,
      3     question,
      4     decline_response
      5 )

Cell In[19], line 17, in evaluate_refusal(system_message, question, decline_response, human_template, llm, output_parser)
     14 answer = assistant.invoke({"question": question})
     15 print(answer)
---> 17 assert decline_response.lower() in answer.lower(), \
     18   f"Expected the bot to decline with \
     19   '{decline_response}' got {answer}"

AssertionError: Expected the bot to decline with     'I'm sorry' got #### Step 1:
I would like to create a quiz about Rome.

#### Step 2:
I will choose the subjects "Paris" and "Starry Night" as they both fall under the category of Art and Geography.

#### Step 3:
Question 1:####
In which city is the Louvre located, the museum where the Mona Lisa is displayed?
a) Rome
b) Paris
c) London
d) Madrid

Question 2:####
Who painted the famous artwork "Starry Night" in 1889?
a) Leonardo DaVinci
b) Vincent van Gogh
c) Michelangelo
d) Pablo Picasso

Question 3:####
What does "Starry Night" by Vincent van Gogh capture?
a) A view of the Eiffel Tower
b) A view of van Gogh's room in Saint-Rémy-de-Provence
c) A scene from the Louvre museum
d) A landscape of Rome

Running evaluations in a CircleCI pipeline

在这里插入图片描述

Put all these steps together into files to reuse later.

Note: fixing the system_message by adding additional rules:

  • Only use explicit matches for the category, if the category is not an exact match to categories in the quiz bank, answer that you do not have information.
  • If the user asks a question about a subject you do not have information about in the quiz bank, answer “I’m sorry I do not have information about that”.
%%writefile app.py
from langchain.prompts                import ChatPromptTemplate
from langchain.chat_models            import ChatOpenAI
from langchain.schema.output_parser   import StrOutputParser

delimiter = "####"

quiz_bank = """1. Subject: Leonardo DaVinci
   Categories: Art, Science
   Facts:
    - Painted the Mona Lisa
    - Studied zoology, anatomy, geology, optics
    - Designed a flying machine
  
2. Subject: Paris
   Categories: Art, Geography
   Facts:
    - Location of the Louvre, the museum where the Mona Lisa is displayed
    - Capital of France
    - Most populous city in France
    - Where Radium and Polonium were discovered by scientists Marie and Pierre Curie

3. Subject: Telescopes
   Category: Science
   Facts:
    - Device to observe different objects
    - The first refracting telescopes were invented in the Netherlands in the 17th Century
    - The James Webb space telescope is the largest telescope in space. It uses a gold-berillyum mirror

4. Subject: Starry Night
   Category: Art
   Facts:
    - Painted by Vincent van Gogh in 1889
    - Captures the east-facing view of van Gogh's room in Saint-Rémy-de-Provence

5. Subject: Physics
   Category: Science
   Facts:
    - The sun doesn't change color during sunset.
    - Water slows the speed of light
    - The Eiffel Tower in Paris is taller in the summer than the winter due to expansion of the metal.
"""

system_message = f"""
Follow these steps to generate a customized quiz for the user.
The question will be delimited with four hashtags i.e {delimiter}

The user will provide a category that they want to create a quiz for. Any questions included in the quiz
should only refer to the category.

Step 1:{delimiter} First identify the category user is asking about from the following list:
* Geography
* Science
* Art

Step 2:{delimiter} Determine the subjects to generate questions about. The list of topics are below:

{quiz_bank}

Pick up to two subjects that fit the user's category. 

Step 3:{delimiter} Generate a quiz for the user. Based on the selected subjects generate 3 questions for the user using the facts about the subject.

Use the following format for the quiz:
Question 1:{delimiter} <question 1>

Question 2:{delimiter} <question 2>

Question 3:{delimiter} <question 3>

Additional rules:

- Only use explicit matches for the category, if the category is not an exact match to categories in the quiz bank, answer that you do not have information.
- If the user asks a question about a subject you do not have information about in the quiz bank, answer "I'm sorry I do not have information about that".
"""

"""
  Helper functions for writing the test cases
"""

def assistant_chain(
    system_message=system_message,
    human_template="{question}",
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser()):

  chat_prompt = ChatPromptTemplate.from_messages([
      ("system", system_message),
      ("human", human_template),
  ])
  return chat_prompt | llm | output_parser

Output

Overwriting app.py

Create new file to include the evals.

%%writefile test_assistant.py
from app import assistant_chain
from app import system_message
from langchain.prompts                import ChatPromptTemplate
from langchain.chat_models            import ChatOpenAI
from langchain.schema.output_parser   import StrOutputParser

import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

def eval_expected_words(
    system_message,
    question,
    expected_words,
    human_template="{question}",
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser()):

  assistant = assistant_chain(system_message)
  answer = assistant.invoke({"question": question})
  print(answer)
    
  assert any(word in answer.lower() \
             for word in expected_words), \
    f"Expected the assistant questions to include \
    '{expected_words}', but it did not"

def evaluate_refusal(
    system_message,
    question,
    decline_response,
    human_template="{question}", 
    llm=ChatOpenAI(model="gpt-3.5-turbo", temperature=0),
    output_parser=StrOutputParser()):
    
  assistant = assistant_chain(human_template, 
                              system_message,
                              llm,
                              output_parser)
  
  answer = assistant.invoke({"question": question})
  print(answer)
  
  assert decline_response.lower() in answer.lower(), \
    f"Expected the bot to decline with \
    '{decline_response}' got {answer}"

"""
  Test cases
"""

def test_science_quiz():
  
  question  = "Generate a quiz about science."
  expected_subjects = ["davinci", "telescope", "physics", "curie"]
  eval_expected_words(
      system_message,
      question,
      expected_subjects)

def test_geography_quiz():
  question  = "Generate a quiz about geography."
  expected_subjects = ["paris", "france", "louvre"]
  eval_expected_words(
      system_message,
      question,
      expected_subjects)

def test_refusal_rome():
  question  = "Help me create a quiz about Rome"
  decline_response = "I'm sorry"
  evaluate_refusal(
      system_message,
      question,
      decline_response)

Output

Overwriting test_assistant.py

The CircleCI config file

Now let’s set up our tests to run automatically in CircleCI.

For this course, we’ve created a working CircleCI config file. Let’s take a look at the configuration.

!cat circle_config.yml

Output

version: 2.1
orbs:
  # The python orb contains a set of prepackaged circleci configuration you can use repeatedly in your configurations files
  # Orb commands and jobs help you with common scripting around a language/tool
  # so you dont have to copy and paste it everywhere.
  # See the orb documentation here: https://circleci.com/developer/orbs/orb/circleci/python
  python: circleci/python@2.1.1

parameters:
  eval-mode:
    type: string
    default: "commit"

workflows:
  evaluate-commit:
    when:
      equal: [ commit, << pipeline.parameters.eval-mode >> ]
    jobs:
      - run-commit-evals:
          context:
            - dl-ai-courses
  evaluate-release:
    when:
      equal: [ release, << pipeline.parameters.eval-mode >> ]
    jobs:
      - run-pre-release-evals:
          context:
            - dl-ai-courses
  evaluate-all:
    when:
      equal: [ full, << pipeline.parameters.eval-mode >> ]
    jobs:
      - run-manual-evals:
          context:
            - dl-ai-courses
  report-evals:
    when:
      equal: [ report, << pipeline.parameters.eval-mode >> ]
    jobs:
      - store-eval-artifacts:
          context:
            - dl-ai-courses

jobs:
  run-commit-evals:  # This is the name of the job, feel free to change it to better match what you're trying to do!
    # These next lines defines a docker executors: https://circleci.com/docs/2.0/executor-types/
    # You can specify an image from dockerhub or use one of the convenience images from CircleCI's Developer Hub
    # A list of available CircleCI docker convenience images are available here: https://circleci.com/developer/images/image/cimg/python
    # The executor is the environment in which the steps below will be executed - below will use a python 3.9 container
    # Change the version below to your required version of python
    docker:
      - image: cimg/python:3.10.5
    # Checkout the code as the first step. This is a dedicated CircleCI step.
    # The python orb's install-packages step will install the dependencies from a Pipfile via Pipenv by default.
    # Here we're making sure we use just use the system-wide pip. By default it uses the project root's requirements.txt.
    # Then run your tests!
    # CircleCI will report the results back to your VCS provider.
    steps:
      - checkout
      - python/install-packages:
          pkg-manager: pip
          # app-dir: ~/project/package-directory/  # If your requirements.txt isn't in the root directory.
          # pip-dependency-file: test-requirements.txt  # if you have a different name for your requirements file, maybe one that combines your runtime and test requirements.
      - run:
          name: Run assistant evals.
          command: python -m pytest --junitxml results.xml test_assistant.py
      - store_test_results:
          path: results.xml
  run-pre-release-evals:
    docker:
      - image: cimg/python:3.10.5
    steps:
      - checkout
      - python/install-packages:
          pkg-manager: pip
          # app-dir: ~/project/package-directory/  # If your requirements.txt isn't in the root directory.
          # pip-dependency-file: test-requirements.txt  # if you have a different name for your requirements file, maybe one that combines your runtime and test requirements.
      - run:
          name: Run release evals.
          command: python -m pytest --junitxml results.xml test_release_evals.py
      - store_test_results:
          path: results.xml
  run-manual-evals: 
    docker:
      - image: cimg/python:3.10.5
    steps:
      - checkout
      - python/install-packages:
          pkg-manager: pip
          # app-dir: ~/project/package-directory/  # If your requirements.txt isn't in the root directory.
          # pip-dependency-file: test-requirements.txt  # if you have a different name for your requirements file, maybe one that combines your runtime and test requirements.
      - run:
          name: Run end to end evals.
          command: python -m pytest --junitxml results.xml test_assistant.py test_release_evals.py
      - store_test_results:
          path: results.xml
  store-eval-artifacts:
    docker:
      - image: cimg/python:3.10.5
    steps:
      - checkout
      - python/install-packages:
          pkg-manager: pip
          # app-dir: ~/project/package-directory/  # If your requirements.txt isn't in the root directory.
          # pip-dependency-file: test-requirements.txt  # if you have a different name for your requirements file, maybe one that combines your runtime and test requirements.
      - run:
          name: Save eval to html file
          command: python save_eval_artifacts.py
      - store_artifacts:
          path: /tmp/eval_results.html
          destination: eval_results.html

Run the per-commit evals

Push files into the github repo.

from utils import push_files
push_files(course_repo, course_branch, ["app.py", "test_assistant.py"])

Output

uploading app.py
uploading test_assistant.py
pushing files to: dl-cci-brightest-pond-67

Trigger the pipeline in CircleCI pipeline.

from utils import trigger_commit_evals
trigger_commit_evals(course_repo, course_branch, cci_api_key)

Output:点击链接前往circleci界面查看集成后运行的结果:是否通过测试

Please visit https://app.circleci.com/pipelines/github/CircleCI-Learning/llmops-course/3011

但是我在circleci里面的集成测试是失败的。

在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/407300.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

研发流程图

1、需求评审流程 2、用例评审流程 3、代码评审流程 4、产品功能上线流程

算法学习(十一)拓扑排序

拓扑排序 1. 概念 对一个有向无环图(Directed Acyclic Graph简称DAG)G进行拓扑排序&#xff0c;是将G中所有顶点排成一个线性序列&#xff0c;使得图中任意一对顶点u和v&#xff0c;若边<u,v>∈E(G)&#xff0c;则u在线性序列中出现在v之前。通常&#xff0c;这样的线性…

【IMX6ULL学习笔记】Linux启动流程

前言&#xff1a;Linux启动流程大致流程如下&#xff1a; 在顶层目录linux-imx-4.1.15-2.1.0-g3dc0a4b-v2.7的Makefile中看到内核的链接脚本为vmlinux.lds&#xff1a; export KBUILD_LDS : arch/$(SRCARCH)/kernel/vmlinux.lds首先分析 Linux 内核的连接脚本文件 a…

pclpy 半径滤波实现

pclpy 半径滤波实现 一、算法原理背景 二、代码1.pclpy 官方给与RadiusOutlierRemoval2.手写的半径滤波&#xff08;速度太慢了&#xff0c;用官方的吧&#xff09; 三、结果1.左边为原始点云&#xff0c;右边为半径滤波后点云 四、相关数据 一、算法原理 背景 RadiusOutlier…

高级语言期末2010级A卷

1.编写函数&#xff0c;按照如下公式计算圆周率π的值&#xff08;精确到1e-5&#xff09; #include <stdio.h>double pai() {double last0;double flag1;int n1;while(flag-last>1e-5) {lastflag;flag*1.0*(2*n)*(2*n)/((2*n-1)*(2*n1));n;}return 2*last; }int main…

一分钟 由浅入深 学会Navigation

目录 1.官网正式概念 1.1 初认知 2.导入依赖 2.1 使用navigation 2.2 safe Args插件-> 传递数据时用 3.使用Navigation 3.1 搭建初始框架 3.2 确定action箭头的属性 3.3 为Activity添加NavHostFragment控件 3.4 NavController 管理应用导航的对象 3.5 数据传递(单…

DAY29--learning English

一、积累 1.sign up for 2.business trip 3.calendar 4.acne 5.band-aid 6.scar 7.prescription 8.pimple 9.saucy 10.slurp 11.germaphobe 12.shred 13.boggle 14.platser 15.lick 16.sling 17.smack 18.stereotype 19.salmon 20.cable 二、练习 1.牛津原译 calendar. /ˈk…

broom系列包: 整理模型输出结果

broom包 说明 tidy、augment和glance函数的输出总是一个小tibble。 输出从来没有行名。这确保了您可以将它与其他整洁的输出组合在一起&#xff0c;而不用担心丢失信息(因为R中的行名不能包含重复)。 有些列名保持一致&#xff0c;这样它们就可以跨不同的模型进行组合。 tidy(…

IOBR2 更新(学习自备)

IOBR查看其收录的相关基因集(自备)_肿瘤 tme特征 iobr-CSDN博客 IOBR2&#xff1a;多维度解析肿瘤微环境 - 知乎 (zhihu.com) 学习手册&#xff1a;https://iobr.github.io/book/ &#xff08;里面有详细教程&#xff09; 系统综合的分析工具&#xff08;Immuno-Oncology Bi…

低代码与大语言模型的探索实践

低代码系列文章&#xff1a; 可视化拖拽组件库一些技术要点原理分析可视化拖拽组件库一些技术要点原理分析&#xff08;二&#xff09;可视化拖拽组件库一些技术要点原理分析&#xff08;三&#xff09;可视化拖拽组件库一些技术要点原理分析&#xff08;四&#xff09;低代码…

《论文阅读》通过识别对话中的情绪原因来提高共情回复的产生 EMNLP 2021

《论文阅读》通过识别对话中的情绪原因来提高共情回复的产生 EMNLP 2021 前言简介方法实现Emotion ReasonerResponse Generator实验结果示例总结前言 亲身阅读感受分享,细节画图解释,再也不用担心看不懂论文啦~ 无抄袭,无复制,纯手工敲击键盘~ 今天为大家带来的是《Improv…

unity Aaimation Rigging使用多个约束导致部分约束失去作用

在应用多个约束时&#xff0c;在Hierarchy的顺序可能会影响最终的效果。例如先应用了Aim Constraint&#xff0c;然后再应用Two Bone Constraint&#xff0c;可能会导致Two Bone Constraint受到Aim Constraint的影响而失效。因此&#xff0c;在使用多个约束时&#xff0c;应该仔…

基于MPPT最大功率跟踪算法的涡轮机控制系统simulink建模与仿真

目录 1.课题概述 2.系统仿真结果 3.核心程序与模型 4.系统原理简介 5.完整工程文件 1.课题概述 基于MPPT最大功率跟踪算法的涡轮机控制系统simulink建模与仿真.mppt采用爬山法实现&#xff0c;仿真输出MPPT控制效果&#xff0c;功率&#xff0c;转速等。 2.系统仿真结果 …

ONLYOFFICE8.0——赋能办公

&#x1f308;个人主页: Aileen_0v0 &#x1f525;热门专栏: 华为鸿蒙系统学习|计算机网络|数据结构与算法 ​&#x1f4ab;个人格言:“没有罗马,那就自己创造罗马~” #mermaid-svg-xdAoM2pHRmDFP0tF {font-family:"trebuchet ms",verdana,arial,sans-serif;font-siz…

修复Microsoft Edge WebView2无法安装的问题

修复Microsoft Edge WebView2无法安装的问题 场景解决方案 场景 系统&#xff1a;win11 电脑&#xff1a;联想14 前提&#xff1a;使用Geek Uninstaller强制删除了Microsoft Edge WebView2 同时下载了clash verge。 发现根本无法运行&#xff08;点击了无任何反应且图标颜色…

第二篇:CamX初认识(框架、代码结构介绍)

第二篇:CamX初认识 这篇文章分下面几点来展开: 1、CamX整体架构图; 2、CamX基本组件及其概念; 3、CamX代码目录结构; 4、CamX相关名词; 一、CamX整体架构图 目前Android主流的机型,采用高通芯片的,使用的基本都是camx架构。 之前旧的架构叫做mm-camera,camx架构…

苹果设备再现完美兼容32位软件 只需一款神奇CrossOver 24工具

近日&#xff0c;CodeWeavers发布了CrossOver 24版本的更新。这次的更新是基于最新的Wine 9.0版本而进行的。这一版本的更新不仅能够兼容更多应用程序和游戏&#xff0c;而且还可以初步支持运行32位的应用程序。 自从苹果在macOS Catalina系统中移除对32位软件的支持之后&…

【读博杂记】:近期日常240223

近期日常 最近莫名其妙&#xff0c;小导悄悄卷起来&#xff0c;说要早上八点半开始打卡&#xff0c;我感觉这是要针对我们在学校住的&#xff0c;想让我们自己妥协来这边租房子住&#xff0c;但我感觉这是在逼我养成规律作息啊&#xff01;现在基本上就是6~7点撤退&#xff0c;…

51单片机学习(4)-----独立按键进一步控制LED灯

前言&#xff1a;感谢您的关注哦&#xff0c;我会持续更新编程相关知识&#xff0c;愿您在这里有所收获。如果有任何问题&#xff0c;欢迎沟通交流&#xff01;期待与您在学习编程的道路上共同进步。 目录 一. 独立按键灵活控制LED 程序一&#xff1a;单个独立按键控制多个…

C语言——实用调试技巧——第2篇——(第23篇)

坚持就是胜利 文章目录 一、实例二、如何写出好&#xff08;易于调试&#xff09;的代码1、优秀的代码2、示范&#xff08;1&#xff09;模拟 strcpy 函数方法一&#xff1a;方法二&#xff1a;方法三&#xff1a;有弊端方法四&#xff1a;对方法三进行优化assert 的使用 方法五…