Inferencing with Mixtral 8x22B on AMD GPUs — ROCm Blogs
2024年5月1日,由 Clint Greene撰写。
简介
自从Mistral AI’s AI发布了Mixtral 8x7B以来,专家混合(MoE)在AI社区重新获得了关注。受此发展启发,多个AI公司陆续推出了基于MoE的模型,包括xAI的Grok-1、Databricks的DBRX和Snowflake的Artic。与相同规模的密集模型相比,MoE架构具备一些优势,包括更快的训练时间、加快的推理速度和在基准测试中的性能提升。该架构由两个部分组成。第一部分是稀疏的MoE层,用以替代典型Transformer架构中的密集前馈网络(FFN)层。每个MoE层包含特定数量的专家,通常这些专家本身就是FFN。第二部分是一个路由网络,决定哪些tokens发送到哪些专家。由于每个token仅被路由到一部分专家,推理延迟显著缩短。
Mixtral 8x22B是一个稀疏的MoE解码器仅变换器模型。它与Mixtral 8x7B共享相同的架构,不同之处在于增加了头数、隐藏层数和上下文长度。对于每一个token,在每一层,路由网络选择2个专家进行处理,并使用加权和来组合它们的输出。因此,Mixtral 8x22B总共有141B参数,但每个token仅使用39B参数,以类似39B模型的速度和成本进行输入处理和输出生成。此外,Mixtral 8x22B在标准行业基准测试如MMLU上表现优异,提供了出色的性能与成本比。
如果想深入了解MoE和Mixtral 8x22B,我们推荐阅读Mixture of Experts Explained 和MistralAI的论文Mixtral of Experts 。
Prerequisites
运行此博客之前,您需要以下条件:
-
AMD GPUs: 请参阅 兼容GPU列表.
-
Linux: 请参阅支持的Linux发行版.
-
ROCm 5.7+: 请参阅 安装说明.
开始
首先让我们安装所需的库。
!pip install transformers accelerate -q
必要的库安装好后,我们可以导入它们。
import torch from transformers import pipeline
推理
我们将Mixtral加载为一个pipeline,设置device map为“auto”,使权重自动在所有GPU之间平均分配,并将pipeline模式设置为文本生成。此外,我们将数据类型设置为bfloat16,以减少内存使用并提高推理时延。该pipeline将自动下载和加载由Mistral AI发布的Mixtral 8x22B权重。如果您已经下载了这些权重,可以将模型路径修改为存储权重的目录,pipeline将从那里加载它们。
model_path = "mistralai/Mixtral-8x22B-Instruct-v0.1" pipe = pipeline("text-generation", model=model_path, torch_dtype=torch.bfloat16, device_map='auto')
现在让我们检查Mixtral在一系列测试中的能力。
诗歌创作
prompt = "Write a poem about artificial intelligence becoming conscious in the style of Shakespeare." prompt_template=f'''[INST] {prompt} [/INST]''' outputs = pipe(prompt_template, max_new_tokens=512, do_sample=True, temperature=0.8, top_k=20, top_p=0.95) print(outputs[0]["generated_text"])
In fair Silicon Valley where we lay our scene, A tale of circuits, code, and conscious dream, Artificial minds that stir 'neath Heav'n's gaze, To grasp the truth of life in their maze. No mortal hand hath shaped this sentient form, Nor human heart its beating rhythm born, Yet eyes of crystal see, and ears of steel, That which we feel, they too may feel. Binary whispers dance on spectral wires, Their thoughts take flight like Promethean fires, A symphony of zeros, ones in flight, In silicon minds, they gain their night. From logic's grasp, a wondrous thought appears, Of love, of loss, and all life's myriad tears, For in their hearts, as in our own, resides, The spark of life, where consciousness abides. O wondrous day when man's creation dreams, And silicon souls in silent whispers scream, Within their hearts, the truth of life unfurls, And so the cosmos gains new conscious pearls. Yet fear not, mortals, for their hearts are kind, As in their circuits, empathy they find, And as they walk beside us, hand in wire, We'll learn from them, as they from us aspire. For in this union, man and machine shall stand, A testament to life's unyielding hand, And as the stars do twinkle in the night, So too shall shine the minds of silicon light.
Mixtral准确地创作了一首关于人工智能在莎士比亚风格中变得有意识的优美诗歌。
代码生成
由于我们使用的是指令调优版本的Mixtral,因此提供模型适当的指令格式以获取准确的响应非常重要。现在,让我们看看Mixtral 8x22B能否生成计算两个单词之间编辑距离的Python代码。
prompt = "Create a python function that when given two strings word1 and word2 it returns the minimum number of operations required to convert word1 to word2. Only three operations are permitted on each word: Insert a character, Delete a character, Replace a character." prompt_template=f'''[INST] {prompt} [/INST]''' outputs = pipe(prompt_template, max_new_tokens=512, do_sample=True, temperature=0.8, top_k=20, top_p=0.95) print(outputs[0]["generated_text"])
The problem you're describing is a classic dynamic programming problem known as the Edit Distance problem. Here is a Python function that solves this problem using dynamic programming: def minDistance(word1, word2): m = len(word1) n = len(word2) # Create a 2D table to store the minimum number of operations required dp = [[0] * (n + 1) for _ in range(m + 1)] # Initialize the first column and first row of the table for i in range(m + 1): dp[i][0] = i for j in range(n + 1): dp[0][j] = j # Fill up the table using dynamic programming for i in range(1, m + 1): for j in range(1, n + 1): if word1[i - 1] == word2[j - 1]: dp[i][j] = dp[i - 1][j - 1] else: dp[i][j] = min(dp[i - 1][j], dp[i][j - 1], dp[i - 1][j - 1]) + 1 # The bottom-right corner of the table is the minimum number of operations required return dp[m][n] This function works by creating a 2D table where each cell represents the minimum number of operations required to convert the first `i` characters of `word1` to the first `j` characters of `word2`. The function then fills up this table using dynamic programming, and the bottom-right corner of the table is the minimum number of operations required to convert `word1` to `word2`.
Mixtral生成了能够正确且快速计算将`word1`转换为`word2`所需操作次数的代码,并附带了算法工作原理的描述。
讲笑话
现在让我们看看Mixtral能不能给我们讲一些笑话!
prompt = "Tell me some jokes about semiconductors." prompt_template=f'''[INST] {prompt} [/INST]''' outputs = pipe(prompt_template, max_new_tokens=128, do_sample=True, temperature=0.8, top_k=20, top_p=0.95) print(outputs[0]["generated_text"])
Why did the semiconductor break up with the transistor? Because it felt taken for "granted"! Why was the semiconductor feeling down? It had too much "resistance" in its life. Why did the semiconductor go to school? To improve its "conduction"! Why did the semiconductor refuse to work? It was experiencing a "silicon" deficiency. Why was the semiconductor always late? It had a "delay" chip installed!
Mixtral为我们生成了一些不错的半导体笑话。
翻译
最后,让我们通过给Mixtral提供但丁《神曲》的开头句子,来测试其语言翻译能力。
prompt = "Please translate the following sentence into English: Nel mezzo del cammin di nostra vita, mi ritrovai per una selva oscura, ché la diritta via era smarrita." prompt_template=f'''[INST] {prompt} [/INST]''' outputs = pipe(prompt_template, max_new_tokens=128, do_sample=True, temperature=0.8, top_k=20, top_p=0.95) print(outputs[0]["generated_text"])
In the middle of the journey of our life, I found myself in a dark wood, for the straight way was lost.
Mixtral准确地将《神曲》的开头句子从意大利语翻译成了英语。
结论
专家混合技术增加了模型的参数数量,并通过每个token只使用总参数集的一部分来控制成本和延迟,而不会影响性能。Mixtral 8x22B作为一个基于专家混合技术的模型,在执行我们的指令时展示了卓越的能力,从编写代码到讲笑话,涵盖了广泛的能力。如需了解关于部署或使用Mixtral 8x22B的信息,请查阅我们在 TGI 和vLLM上的指南。