因为我们实验室的服务器不能联网,所以只能部署到docker里面,如果有需要补充的包只能在本地下载好了,再上传到服务器上进行离线安装。
dockerfile
# 使用pytorch镜像作为基础镜像
FROM pytorch/pytorch:2.2.1-cuda12.1-cudnn8-runtime
ARG DEBIAN_FRONTEND=noninteractive
# 设置工作目录
WORKDIR /app
# 更新包索引并安装必要工具
RUN apt-get update && apt-get install -y \
openssh-server \
vim \
curl \
git && \
rm -rf /var/lib/apt/lists/*
# 安装Python依赖
RUN pip config set global.index-url http://mirrors.aliyun.com/pypi/simple
RUN pip config set install.trusted-host mirrors.aliyun.com
RUN pip install jupyter && \
pip install --upgrade ipython && \
ipython kernel install --user
# 安装Node.js 和 pnpm
RUN cd /tmp && \
curl -fsSL https://deb.nodesource.com/setup_22.x -o nodesource_setup.sh && \
bash nodesource_setup.sh && \
apt-get install -y nodejs && \
rm -rf nodesource_setup.sh && \
node -v
RUN npm config set registry https://registry.npmmirror.com
RUN npm install -g pnpm
RUN node -v && pnpm -v
# 复制依赖到容器
RUN mkdir -p /app/GLM-4/composite_demo
RUN mkdir -p /app/GLM-4/basic_demo
RUN mkdir -p /app/GLM-4/finetune_demo
#COPY ./requirements.txt /app/GLM-4
COPY ./composite_demo/requirements.txt /app/GLM-4/composite_demo
COPY ./basic_demo/requirements.txt /app/GLM-4/basic_demo
COPY ./finetune_demo/requirements.txt /app/GLM-4/finetune_demo
# 安装GLM-4依赖
WORKDIR /app/GLM-4
RUN pip install --verbose --use-pep517 -r composite_demo/requirements.txt
RUN pip install --verbose --use-pep517 -r basic_demo/requirements.txt
RUN pip install --verbose --use-pep517 -r finetune_demo/requirements.txt
RUN pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
RUN pip install vllm>=0.5.2
# 暴露端口
EXPOSE 8501
构建和打包
# 构建镜像
docker build -t chatglm4:v1.0 .
# 运行容器(cpu)
docker run -it -p 8501:8501 chatglm4:v1.0 python
# 运行容器(gpu)
docker run --gpus all -p 8501:8501 chatglm4:v1.0 python
# 进入容器
docker run -it -p 8501:8501 chatglm4:v1.0 /bin/bash
# 查看容器ID
docker ps -a
docker commit 368369a3c853【查到的ID】 ubuntu:test
# 导出为tar
docker image save ubuntu:test -o sl_sum.tar【名字】
部署到服务器上
直接运行
python -m vllm.entrypoints.openai.api_server --model /root/data1/GLM-4/ZhipuAI/glm-4-9b-chat --served-model-name glm-4-9b-chat --max-model-len=2048 --trust-remote-code --dtype=half --port=8000
–host 和 --port 参数指定地址。
–model 参数指定模型名称。(我模型存放的位置在/root/data1/GLM-4/ZhipuAI/glm-4-9b-chat)
–chat-template 参数指定聊天模板。
–served-model-name 指定服务模型的名称。
–max-model-len 指定模型的最大长度。(glm-4-9B-chat默认最大长度为128k)
简单在命令行使用:
做了个端口映射:8000:18000
curl http://10.20.26.187:18000/v1/completions -H "Content-Type: application/json" -d "{\"model\": \"glm-4-9b-chat\", \"prompt\": \"可以介绍一下神经网络吗\", \"max_tokens\": 7, \"temperature\": 0}"
回复:
{"id":"cmpl-512bbfc33b874424a61385b15e75771c","object":"text_completion","created":1732022395,"model":"glm-4-9b-chat","choices":[{"index":0,"text":"?\n\n当然可以。神经网络是一种模仿","logprobs":null,"finish_reason":"length","stop_reason":null,"prompt_logprobs":null}],"usage":{"prompt_tokens":6,"total_tokens":13,"completion_tokens":7,"prompt_tokens_details":null}}
使用:
通过vllm:
简单使用的模板:
vllm 非流式
from openai import OpenAI
client = OpenAI(
base_url="http://101.201.68.250:8082/v1",
api_key="123", # 随便设,只是为了通过接口参数校验
)
# 设置提示词
prompt_message = {"role": "system", "content": "你是一个专业的材料科学家。请详细解释什么是高温合金,并讨论它们的应用和特性。"}
completion = client.chat.completions.create(
model="glm4:9b-chat-fp16",
messages=[
{"role": "user", "content": "你好"}
],
# 设置额外参数
extra_body={
"stop_token_ids": [151329, 151336, 151338]
}
)
print(completion.choices[0].message)
vllm流式
多卡推理需要用到
pip install nvidia-nccl-cu12==2.20.5
https://blog.csdn.net/weixin_46398647/article/details/139963697
import requests
from openai import OpenAI
url = "http://101.201.68.250:8082/v1/chat/completions"
headers = {
"Content-Type": "application/json",
"Authorization": "EMPTY"
}
data = {
"model": "glm-4-9b-chat",
"messages": [{"role": "user", "content": "你是谁"}],
"stream": True
}
response = requests.post(url, headers=headers, json=data, stream=True)
for chunk in response:
if chunk:
chunk = chunk.decode('utf-8',errors='ignore').strip()
print(chunk)
# 分割
from flask import Flask, request, jsonify, Response
from urllib.parse import unquote
import requests
app = Flask(__name__)
# 假设这是你的OpenAI客户端和模型名称
client = OpenAI() # 这个变量需要根据实际情况初始化
modelname = "glm-4-9b-chat" # 根据实际使用的模型名称进行替换
@app.route('/operation_get_answer', methods=['GET'])
def get_answer():
try:
# 获取用户的问题
question = request.args.get('question')
if not question:
return jsonify({"error": "缺少问题参数"}), 400
# 解码问题
decoded_question = unquote(question, encoding='utf-8')
# 构建请求数据
data = {
"model": modelname,
"messages": [{"role": "user", "content": decoded_question}],
"stream": True
}
# 发送请求并获取流式响应
url = "http://101.201.68.250:8082/v1/chat/completions"
headers = {
"Content-Type": "application/json",
"Authorization": "EMPTY" # 需要替换为实际的授权信息
}
def generate_stream_response():
with requests.post(url, headers=headers, json=data, stream=True) as r:
for chunk in r.iter_lines(chunk_size=1):
if chunk:
chunk_str = chunk.decode('utf-8', errors='ignore').strip()
yield f"data: {chunk_str}\n\n"
# 返回流式响应
return Response(generate_stream_response(), mimetype='text/event-stream')
except Exception as e:
app.logger.error(str(e))
return jsonify({"error": "无法获取答案"}), 500
if __name__ == '__main__':
app.run(debug=True)