使用OpenLLM在AMD GPU上的分步指南

Step-by-Step Guide to Use OpenLLM on AMD GPUs — ROCm Blogs

引言

OpenLLM是一个开源平台，旨在促进大型语言模型（LLMs）的部署和使用，支持多种模型，适应不同的应用，无论是在云环境还是本地环境中。在本教程中，我们将指导您如何使用OpenLLM启动一个LLM服务器，并从您的本地机器与服务器进行交互，特别强调利用AMD GPU的能力。

您可以在GitHub文件天中找到与此博文相关的文件。

需求

操作系统、硬件和软件要求

• AMD GPU：支持的操作系统和硬件列表请见ROCm文档页面。
• Anaconda：为Linux安装anaconda。
• ROCm版本：6.0 请参阅ROCm安装说明。
• Docker：Ubuntu的Docker引擎。
• OpenLLM：版本0.4.44 官方文档。
• vLLM：使用vLLM作为运行时。vLLM官方文档

初步准备

为了确保一个顺利和高效的开发过程，我们将流程分成两步。首先，为API测试创建一个专用的Python环境，其次使用Docker镜像托管我们的OpenLLM服务器。

创建conda环境

让我们从为OpenLLM设置一个Conda环境开始。打开您的Linux终端并执行以下命令。

conda create --name openllm_env python=3.11

让我们还在环境内安装OpenLLM和JupyterLab。首先，激活环境：

conda activate openllm_env

然后运行：

pip install openllm==0.4.44
pip install jupyterlab

OpenLLM运行时：PyTorch和vLLM

不同的LLM可能支持多个运行时实现，允许更快的计算或减少内存占用。运行时是提供运行LLM的计算资源的基础框架，同时也处理输入的处理和响应生成等所需任务。
OpenLLM为多种模型运行时提供了集成支持，包括PyTorch和vLLM。当使用PyTorch运行时（后端），OpenLLM在PyTorch框架内进行计算。
相反，如果使用vLLM作为后端，OpenLLM将使用特别创建的运行时，以高吞吐量和高效内存管理来执行和服务LLM。vLLM针对推理进行了优化，集成了诸如连续批处理和PagedAttention的增强功能，从而实现快速预测时间。
OpenLLM允许我们通过在启动OpenLLM服务器时设置选项`--backend pt` 或 --backend vllm 来选择所需的运行时，分别用于PyTorch或vLLM。

有关OpenLLM可用选项的更多信息，您可以运行命令`openllm -h` 并且阅读OpenLLM的官方文档。

构建具有Pytorch和vLLM后端支持的自定义Docker镜像

让我们开始创建一个自定义Docker镜像，该镜像将作为我们OpenLLM服务器的运行环境。我们利用vLLM ROCm支持来为我们的OpenLLM服务器构建自定义Docker镜像。

首先，克隆官方的vLLM GitHub仓库。在我们之前用来创建Python环境的相同终端（或新终端），运行以下命令：

git clone https://github.com/vllm-project/vllm.git && cd vllm

在`vllm`目录内，让我们修改`Dockerfile.rocm`文件的内容。在`CMD ["/bin/bash"]`最后指令之前（在第107到109行之间）添加以下代码。

# 安装OpenLLM和额外的Python包
RUN python3 -m pip install openllm==0.4.44 
RUN python3 -m pip install -U pydantic

# 设置在运行OpenLLM时想要看到的设备
ENV CUDA_VISIBLE_DEVICES=0

# 默认情况下OpenLLM服务器运行在3000端口
EXPOSE 3000

这样我们自定义的`Dockerfile.rocm`应该看起来像这样

# 默认基础镜像
ARG BASE_IMAGE="rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1"

FROM $BASE_IMAGE
...
# 剩下的部分Dockerfile.rocm原始文件
...

# 安装OpenLLM和额外的Python包
RUN python3 -m pip install openllm==0.4.44 
RUN python3 -m pip install -U pydantic

# 设置在运行OpenLLM时想要看到的设备
ENV CUDA_VISIBLE_DEVICES=0

# 默认情况下OpenLLM服务器运行在3000端口
EXPOSE 3000

CMD ["/bin/bash"]

让我们使用上述Dockerfile创建一个新的Docker镜像。或者，您也可以从这里获取`Dockerfile.rocm`文件，并替换`vllm`文件夹中的现有文件。

docker build -t openllm_vllm_rocm -f Dockerfile.rocm .

我们将新镜像命名为`openllm_vllm_rocm`。构建镜像可能会花费一些时间。如果过程没有任何错误完成，一个新的Docker镜像将在您的本地系统上可用。验证这一点，运行以下命令：

sudo docker images

输出将包含类似以下内容：

REPOSITORY                TAG       IMAGE ID       CREATED       SIZE
openllm_vllm_rocm         latest    695ed0675edf   2 hours ago   56.3GB

启动服务器并测试不同模型

首先，让我们回到我们原来的工作目录：

cd ..

使用以下命令启动一个容器：

sudo docker run -it --rm -p 3000:3000 --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --shm-size 8G -v $(pwd):/root/bentoml openllm_vllm_rocm

让我们解释一下用于启动容器的命令中的一些选项：
- -p 3000:3000：此选项将容器的端口发布到主机。在此例中，它将容器上的3000端口映射到主机上的3000端口，使我们可以访问容器内运行在3000端口的OpenLLM服务器。
- --device=/dev/kfd --device=/dev/dri：这些选项允许访问主机上的特定设备。`--device=/dev/kfd`与AMD GPU设备相关，`--device=/dev/dri`与直接访问图形硬件的设备相关。
- --group-add=video：此选项允许容器直接访问视频硬件所需的权限。
- -v $(pwd):/root/bentoml：这将主机上的卷挂载到容器中。将主机上的当前目录($(pwd))映射到容器内的`/root/bentoml`。当OpenLLM下载新模型时，它们会被存储在容器的`/root/bentoml`目录内。设置卷可以在主机上保存模型，避免再次下载。

- openllm_vllm_rocm：我们自定义Docker镜像的名称。

其余选项配置安全偏好设置，授予更多特权并调整资源使用。

服务`facebook/opt-1.3b`模型

让我们使用`facebook/opt-1.3b`模型和PyTorch后端启动一个OpenLLM服务器。在我们正在运行的容器中使用以下命令：

openllm start facebook/opt-1.3b --backend pt

以上命令启动了一个带有`facebook/opt-1.3b`模型和PyTorch后端(--backend pt)的OpenLLM服务器。如果模型尚未存在，该命令还会自动下载模型。OpenLLM支持多个模型，你可以查阅官方文档了解支持的模型列表。

如果服务器运行成功，你会看到类似于这样的输出：

🚀Tip: run 'openllm build facebook/opt-1.3b --backend pt --serialization legacy' to create a BentoLLM for 'facebook/opt-1.3b'
2024-04-11T17:04:18+0000 [INFO] [cli] Prometheus metrics for HTTP BentoServer from "_service:svc" can be accessed at http://localhost:3000/metrics.
2024-04-11T17:04:20+0000 [INFO] [cli] Starting production HTTP BentoServer from "_service:svc" listening on http://0.0.0.0:3000 (Press CTRL+C to quit)

前面的命令在默认端口3000上启动服务器（http://0.0.0.0:3000/），如果模型不存在，OpenLLM会将模型下载到容器的`/root/bentoml`。
有了运行中的服务器，我们可以通过在http://0.0.0.0:3000/上使用web UI或使用OpenLLM内置的Python客户端与之交互。
让我们使用Python客户端进行尝试。为此，我们将使用之前创建的Python环境。打开新的终端并激活我们的环境：

conda activate openllm_env

最后，启动一个新的JupyterLab会话：

jupyter lab

在notebook 中运行:

import openllm

# 同步API
client = openllm.HTTPClient('http://localhost:3000', timeout=120)

# 生成流
for it in client.generate_stream('What is a Large Language Model?', max_new_tokens=120):
  print(it.text, end="")

输出会类似于:

A Large Language Model (LLM) is a model that uses a large number of input languages to produce a large number of output languages. The number of languages used in the model is called the language model size.

In a large language model, the number of input languages is typically large, but not necessarily unlimited. For example, a large language model can be used to model the number of languages that a person can speak, or the number of languages that a person can read, or the number of languages that a person can understand.

A large language model can be used to model the number of languages that a

提供 databricks/dolly-v2-3b 模型

我们尝试一个不同的模型，并通过设定更低的温度值来减少生成的随机性。首先，停止之前的服务器（可能还需要使用 kill -9 $(lsof -ti :3000) 来杀掉3000端口上的所有进程），然后使用以下命令启动一个新服务器:

openllm start databricks/dolly-v2-3b --backend pt --temperature=0.1

现在我们来测试这个模型。返回 Jupyter notebook 并运行：

import openllm

# 同步API
client = openllm.HTTPClient('http://localhost:3000', timeout=120)

# 生成流
for it in client.generate_stream('What industry is Advanced Micro Devices part of?', max_new_tokens=120):
  print(it.text, end="")

输出会类似于：

AMD is a semiconductor company based in Sunnyvale, California. AMD designs and manufactures microprocessors, GPUs (graphics processing units), and memory controllers. AMD's largest product line is its microprocessors for personal computers and video game consoles. AMD's Radeon graphics processing unit (GPU) is used in many personal computers, video game consoles, and televisions.

提供 Mistral-7B-Instruct-v0.1 模型

最后，我们提供一个更强大的模型，并使用 vLLM 作为后端（使用参数 --backend vllm）。停止之前的服务器，然后运行以下命令:

openllm start mistralai/Mistral-7B-Instruct-v0.1 --backend vllm

然后通过运行以下代码进行测试：

import openllm

# 同步API
client = openllm.HTTPClient('http://localhost:3000', timeout=120)

# 生成流
for it in client.generate_stream('Create the python code for an autoencoder neural network', max_new_tokens=1000):
  print(it.text, end="")

输出会类似于：

from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy
from tensorflow.keras.metrics import SparseCategoricalAccuracy

# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Normalize the pixel values to be between 0 and 1
x_train = x_train / 255.0
x_test = x_test / 255.0

# Define the autoencoder model
model = Sequential([
    Flatten(input_shape=(28, 28)), # Flatten the input image of size 28x28
    Dense(128, activation='relu'), # Add a dense layer with 128 neurons and ReLU activation
    Dense(64, activation='relu'), # Add another dense layer with 64 neurons and ReLU activation
    Dense(128, activation='relu'), # Add a third dense layer with 128 neurons and ReLU activation
    Flatten(input_shape=(128,)), # Flatten the output of the previous dense layer
    Dense(10, activation='softmax') # Add a final dense layer with 10 neurons (for each digit) and softmax activation
])

# Define the optimizer, loss function, and metric
optimizer = Adam(learning_rate=0.001)
loss_function = SparseCategoricalCrossentropy(from_logits=True) # From logits
metric = SparseCategoricalAccuracy()

# Compile the model
model.compile(optimizer=optimizer,
              loss=loss_function,
              metrics=[metric])

# Train the model on the training data
model.fit(x_train, y_train, epochs=10, batch_size=32)

# Evaluate the model on the test data
test_loss, test_acc = model.evaluate(x_test, y_test)
print(f'Test accuracy: {test_acc}')