书生·浦语大模型实战营之LMDeploy 量化部署 LLM-VLM 实践

书生·浦语大模型实战营之LMDeploy 量化部署 LLM-VLM 实践

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
创建开发机
打开InternStudio平台,创建开发机。
填写开发机名称;选择镜像Cuda12.2-conda;选择10% A100*1GPU;点击“立即创建”。注意请不要选择Cuda11.7-conda的镜像,新版本的lmdeploy会出现兼容性问题。

studio-conda -t lmdeploy -o pytorch-2.1.2

在这里插入图片描述
安装LMDeploy
接下来,激活刚刚创建的虚拟环境。

conda activate lmdeploy

安装0.3.0版本的lmdeploy

pip install lmdeploy[all]==0.3.0

在这里插入图片描述
在这里插入图片描述

LMDeploy模型对话(chat)

Huggingface与TurboMind

  • HuggingFace
    HuggingFace是一个高速发展的社区,包括Meta、Google、Microsoft、Amazon在内的超过5000家组织机构在为HuggingFace开源社区贡献代码、数据集和模型

托管在HuggingFace社区的模型通常采用HuggingFace格式存储,简写为HF格式。

但是HuggingFace社区的服务器在国外,国内访问不太方便。国内可以使用阿里巴巴的MindScope社区,或者上海AI Lab搭建的OpenXLab社区,上面托管的模型也通常采用HF格式。

  • TurboMind
    TurboMind是LMDeploy团队开发的一款关于LLM推理的高效推理引擎,它的主要功能包括:LLaMa 结构模型的支持,continuous batch 推理模式和可扩展的 KV 缓存管理器。

TurboMind推理引擎仅支持推理TurboMind格式的模型。因此,TurboMind在推理HF格式的模型时,会首先自动将HF格式模型转换为TurboMind格式的模型。该过程在新版本的LMDeploy中是自动进行的,无需用户操作。

  • TurboMind与LMDeploy的关系:LMDeploy是涵盖了LLM 任务全套轻量化、部署和服务解决方案的集成功能包,TurboMind是LMDeploy的一个推理引擎,是一个子模块。LMDeploy也可以使用pytorch作为推理引擎。
  • TurboMind与TurboMind模型的关系:TurboMind是推理引擎的名字,TurboMind模型是一种模型存储格式,TurboMind引擎只能推理TurboMind格式的模型。

下载模型
本次实战营已经在开发机的共享目录中准备好了常用的预训练模型,可以运行如下命令查看

ls /root/share/new_models/Shanghai_AI_Laboratory/

在这里插入图片描述
执行如下指令由开发机的共享目录软链接或拷贝模型:

ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b /root/
# cp -r /root/share/new_models/Shanghai_AI_Laboratory/internlm2-chat-1_8b /root/

在终端中输入如下指令,新建pipeline_transformer.py

vim /root/pipeline_transformer.py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("/root/internlm2-chat-1_8b", trust_remote_code=True)

# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()

inp = "hello"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=[])
print("[OUTPUT]", response)

inp = "please provide three suggestions about time management"
print("[INPUT]", inp)
response, history = model.chat(tokenizer, inp, history=history)
print("[OUTPUT]", response)

在这里插入图片描述

 conda activate lmdeploy

运行python代码:

python /root/pipeline_transformer.py

在这里插入图片描述

使用LMDeploy与模型对话

应用LMDeploy直接与模型进行对话。
使用LMDeploy与模型进行对话的通用命令格式为

lmdeploy chat [HF格式模型路径/TurboMind格式模型路径]

可以执行如下命令运行下载的1.8B模型:

lmdeploy chat /root/internlm2-chat-1_8b

(lmdeploy) root@intern-studio-061925:~# lmdeploy chat /root/internlm2-chat-1_8b
2024-04-11 16:21:37,565 - lmdeploy - WARNING - model_source: hf_model
2024-04-11 16:21:37,565 - lmdeploy - WARNING - kwargs max_batch_size is deprecated to initialize model, use TurbomindEngineConfig instead.
2024-04-11 16:21:37,565 - lmdeploy - WARNING - kwargs cache_max_entry_count is deprecated to initialize model, use TurbomindEngineConfig instead.
2024-04-11 16:21:40,956 - lmdeploy - WARNING - model_config:

[llama]
model_name = internlm2
tensor_para_size = 1
head_num = 16
kv_head_num = 8
vocab_size = 92544
num_layer = 24
inter_size = 8192
norm_eps = 1e-05
attn_bias = 0
start_id = 1
end_id = 2
session_len = 32776
weight_type = bf16
rotary_embedding = 128
rope_theta = 1000000.0
size_per_head = 128
group_size = 0
max_batch_size = 128
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.8
cache_block_seq_len = 64
cache_chunk_size = -1
num_tokens_per_iter = 0
max_prefill_iters = 1
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 32768
rope_scaling_factor = 0.0
use_dynamic_ntk = 0
use_logn_attn = 0


2024-04-11 16:21:41,628 - lmdeploy - WARNING - get 195 model params
2024-04-11 16:22:00,321 - lmdeploy - WARNING - Input chat template with model_name is None. Forcing to use internlm2
[WARNING] gemm_config.in is not found; using default GEMM algo
session 1

double enter to end input >>>

在这里插入图片描述

与InternLM2-Chat-1.8B大模型对话。输入“请给我讲一个小故事吧”,然后按两下回车键

double enter to end input >>> 请给我讲一个小故事吧

<|im_start|>system
You are an AI assistant whose name is InternLM (书生·浦语).
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.
<|im_end|>
<|im_start|>user
请给我讲一个小故事吧<|im_end|>
<|im_start|>assistant
 2024-04-11 16:27:46,877 - lmdeploy - WARNING - kwargs ignore_eos is deprecated for inference, use GenerationConfig instead.
2024-04-11 16:27:46,877 - lmdeploy - WARNING - kwargs random_seed is deprecated for inference, use GenerationConfig instead.


当然,我可以给你讲一个关于一只小蝴蝶的故事。

从前,有一只非常美丽的小蝴蝶。它拥有五彩斑斓的翅膀,每只翅膀上都缀满了闪亮的斑点,如同天空中最璀璨的星辰。小蝴蝶非常喜欢飞舞在花丛中,享受着阳光、微风和花香的包围。

然而,有一天,小蝴蝶不小心掉到了一片黑暗的草丛中,它感到害怕和孤独。就在这时,一位智者出现了,他告诉小蝴蝶,要克服恐惧,需要勇气。于是,小蝴蝶鼓起勇气,开始努力地飞向远方。

在接下来的几天里,小蝴蝶不仅学会了飞翔,也学会了在黑暗中寻找光明。它变得越来越自信,不再害怕那些未知的事物。最终,小蝴蝶成功地逃脱了草丛的束缚,飞向了广阔的天空。

小蝴蝶的故事告诉我们,勇气和坚持是克服困难的关键,只要我们相信自己,勇敢前行,就能到达远方。无论是在哪里,只要你充满勇气,你就能找到属于自己的未来。

double enter to end input >>>
<|im_start|>user
<|im_end|>
<|im_start|>assistant
 好的,这是一个非常励志的故事,我很高兴能够为你讲述它。如果你还有其他问题,或者需要更多故事内容,请随时告诉我,我会尽力帮助你。

double enter to end input >>>
<|im_start|>user
<|im_end|>
<|im_start|>assistant
 好的,接下来再给你讲一个关于一只小老鼠的故事。

从前,住在一座古老村庄里的有一只名叫小黄的小老鼠。小黄非常聪明,它常常能够找到宝藏和美味的点心。然而,小黄也有一个缺点,那就是它非常胆小,总是不敢靠近人类。

有一天,村庄里来了一位新的居民,一位年轻的画家。他的画作吸引了每个人的注意力,他们都被他独特的风格和色彩所吸引。小黄很羡慕那些能见到这位新居民的人,它不禁感到有些羡慕和嫉妒。

然而,小黄并没有放弃,它决定要变得更勇敢。它开始学习画,努力寻找灵感。最终,小黄的画作被一位富有的商人看中,他邀请小黄到他家中展示画作。

在商人家的客厅里,小黄看到了最美丽的画作,它惊叹于画家的技艺和创意。然而,当它看到画作上的文字时,小黄才意识到自己原来也可以有自己的才艺。它开始学习文字,开始写自己的故事,并展示给人们欣赏。

在艺术家的帮助下,小黄的画作变得更加丰富多彩,它也变得更加自信。最终,它的才艺被人们知道了,它也成为了村庄里最受欢迎的居民之一。

小黄的故事告诉我们,每个人都有自己的特长和机会,只要我们不畏惧困难和挑战,只要我们坚持努力,最终,我们也能实现自己的梦想。无论我们面临什么困难和挑战,只要我们勇敢地面对,我们就能找到属于自己的成功之路。

double enter to end input >>>

输入“exit”并按两下回车,可以退出对话

拓展内容:有关LMDeploy的chat功能的更多参数可通过-h命令查看

lmdeploy chat -h
(lmdeploy) root@intern-studio-061925:~# lmdeploy chat -h





usage: lmdeploy chat [-h] [--backend {pytorch,turbomind}] [--trust-remote-code] [--meta-instruction META_INSTRUCTION]
                     [--cap {completion,infilling,chat,python}] [--adapters [ADAPTERS ...]] [--tp TP] [--model-name MODEL_NAME]
                     [--session-len SESSION_LEN] [--max-batch-size MAX_BATCH_SIZE] [--cache-max-entry-count CACHE_MAX_ENTRY_COUNT]
                     [--model-format {hf,llama,awq}] [--quant-policy QUANT_POLICY] [--rope-scaling-factor ROPE_SCALING_FACTOR]
                     model_path

Chat with pytorch or turbomind engine.

positional arguments:
  model_path            The path of a model. it could be one of the following options: - i) a local directory path of a turbomind
                        model which is converted by `lmdeploy convert` command or download from ii) and iii). - ii) the model_id of
                        a lmdeploy-quantized model hosted inside a model repo on huggingface.co, such as "internlm/internlm-
                        chat-20b-4bit", "lmdeploy/llama2-chat-70b-4bit", etc. - iii) the model_id of a model hosted inside a model
                        repo on huggingface.co, such as "internlm/internlm-chat-7b", "qwen/qwen-7b-chat ", "baichuan-
                        inc/baichuan2-7b-chat" and so on. Type: str

options:
  -h, --help            show this help message and exit
  --backend {pytorch,turbomind}
                        Set the inference backend. Default: turbomind. Type: str
  --trust-remote-code   Trust remote code for loading hf models. Default: True
  --meta-instruction META_INSTRUCTION
                        System prompt for ChatTemplateConfig. Deprecated. Please use --chat-template instead. Default: None. Type:
                        str
  --cap {completion,infilling,chat,python}
                        The capability of a model. Deprecated. Please use --chat-template instead. Default: chat. Type: str

PyTorch engine arguments:
  --adapters [ADAPTERS ...]
                        Used to set path(s) of lora adapter(s). One can input key-value pairs in xxx=yyy format for multiple lora
                        adapters. If only have one adapter, one can only input the path of the adapter.. Default: None. Type: str
  --tp TP               GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int
  --model-name MODEL_NAME
                        The name of the to-be-deployed model, such as llama-7b, llama-13b, vicuna-7b and etc. You can run `lmdeploy
                        list` to get the supported model names. Default: None. Type: str
  --session-len SESSION_LEN
                        The max session length of a sequence. Default: None. Type: int
  --max-batch-size MAX_BATCH_SIZE
                        Maximum batch size. Default: 128. Type: int
  --cache-max-entry-count CACHE_MAX_ENTRY_COUNT
                        The percentage of gpu memory occupied by the k/v cache. Default: 0.8. Type: float

TurboMind engine arguments:
  --tp TP               GPU number used in tensor parallelism. Should be 2^n. Default: 1. Type: int
  --model-name MODEL_NAME
                        The name of the to-be-deployed model, such as llama-7b, llama-13b, vicuna-7b and etc. You can run `lmdeploy
                        list` to get the supported model names. Default: None. Type: str
  --session-len SESSION_LEN
                        The max session length of a sequence. Default: None. Type: int
  --max-batch-size MAX_BATCH_SIZE
                        Maximum batch size. Default: 128. Type: int
  --cache-max-entry-count CACHE_MAX_ENTRY_COUNT
                        The percentage of gpu memory occupied by the k/v cache. Default: 0.8. Type: float
  --model-format {hf,llama,awq}
                        The format of input model. `hf` meaning `hf_llama`, `llama` meaning `meta_llama`, `awq` meaning the
                        quantized model by awq. Default: None. Type: str
  --quant-policy QUANT_POLICY
                        Whether to use kv int8. Default: 0. Type: int
  --rope-scaling-factor ROPE_SCALING_FACTOR
                        Rope scaling factor. Default: 0.0. Type: float

LMDeploy模型量化(lite)

本部分内容主要介绍如何对模型进行量化。主要包括 KV8量化和W4A16量化。量化是一种以参数或计算中间结果精度下降换空间节省(以及同时带来的性能提升)的策略。

  • 计算密集(compute-bound): 指推理过程中,绝大部分时间消耗在数值计算上;针对计算密集型场景,可以通过使用更快的硬件计算单元来提升计算速。
  • 访存密集(memory-bound): 指推理过程中,绝大部分时间消耗在数据读取上;针对访存密集型场景,一般通过减少访存次数、提高计算访存比或降低访存量来优化。
    常见的 LLM 模型由于 Decoder Only 架构的特性,实际推理时大多数的时间都消耗在了逐 Token 生成阶段(Decoding 阶段),是典型的访存密集型场景。

那么,如何优化 LLM 模型推理中的访存密集问题呢? 我们可以使用KV8量化和W4A16量化。

  • KV8量化是指将逐 Token(Decoding)生成过程中的上下文 K 和 V 中间结果进行 INT8 量化(计算时再反量化),以降低生成过程中的显存占用。

  • W4A16 量化,将 FP16 的模型权重量化为 INT4,Kernel 计算时,访存量直接降为 FP16 模型的 1/4,大幅降低了访存成本。Weight Only 是指仅量化权重,数值计算依然采用 FP16(需要将 INT4 权重反量化)

设置最大KV Cache缓存大小
KV Cache是一种缓存技术,通过存储键值对的形式来复用计算结果,以达到提高性能和降低内存消耗的目的。在大规模训练和推理中,KV Cache可以显著减少重复计算量,从而提升模型的推理速度。理想情况下,KV Cache全部存储于显存,以加快访存速度。当显存空间不足时,也可以将KV Cache放在内存,通过缓存管理器控制将当前需要使用的数据放入显存。

模型在运行时,占用的显存可大致分为三部分:

  • 模型参数本身占用的显存、
  • KV Cache占用的显存
  • 以及中间运算结果占用的显存。
    LMDeploy的KV Cache管理器可以通过设置–cache-max-entry-count参数,控制KV缓存占用剩余显存的最大比例。默认的比例为0.8。

下面通过几个例子,来看一下调整–cache-max-entry-count参数的效果。首先保持不加该参数(默认0.8),运行1.8B模型

lmdeploy chat /root/internlm2-chat-1_8b
(lmdeploy) root@intern-studio-061925:~# lmdeploy chat /root/internlm2-chat-1_8b
2024-04-11 16:38:07,201 - lmdeploy - WARNING - model_source: hf_model
2024-04-11 16:38:07,202 - lmdeploy - WARNING - kwargs max_batch_size is deprecated to initialize model, use TurbomindEngineConfig instead.
2024-04-11 16:38:07,202 - lmdeploy - WARNING - kwargs cache_max_entry_count is deprecated to initialize model, use TurbomindEngineConfig instead.
2024-04-11 16:38:10,353 - lmdeploy - WARNING - model_config:

[llama]
model_name = internlm2
tensor_para_size = 1
head_num = 16
kv_head_num = 8
vocab_size = 92544
num_layer = 24
inter_size = 8192
norm_eps = 1e-05
attn_bias = 0
start_id = 1
end_id = 2
session_len = 32776
weight_type = bf16
rotary_embedding = 128
rope_theta = 1000000.0
size_per_head = 128
group_size = 0
max_batch_size = 128
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.8
cache_block_seq_len = 64
cache_chunk_size = -1
num_tokens_per_iter = 0
max_prefill_iters = 1
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 32768
rope_scaling_factor = 0.0
use_dynamic_ntk = 0
use_logn_attn = 0


2024-04-11 16:38:11,885 - lmdeploy - WARNING - get 195 model params
2024-04-11 16:38:30,314 - lmdeploy - WARNING - Input chat template with model_name is None. Forcing to use internlm2
[WARNING] gemm_config.in is not found; using default GEMM algo
session 1

double enter to end input >>>

此时显存占用为7816MB。
在这里插入图片描述
改变–cache-max-entry-count参数,设为0.5。

lmdeploy chat /root/internlm2-chat-1_8b --cache-max-entry-count 0.5

在这里插入图片描述
看到显存占用明显降低,变为6600M。
在这里插入图片描述
把–cache-max-entry-count参数设置为0.01,约等于禁止KV Cache占用显存。

lmdeploy chat /root/internlm2-chat-1_8b --cache-max-entry-count 0.01
(lmdeploy) root@intern-studio-061925:~# lmdeploy chat /root/internlm2-chat-1_8b --cache-max-entry-count 0.01


2024-04-11 16:45:14,771 - lmdeploy - WARNING - model_source: hf_model
2024-04-11 16:45:14,772 - lmdeploy - WARNING - kwargs max_batch_size is deprecated to initialize model, use TurbomindEngineConfig instead.
2024-04-11 16:45:14,772 - lmdeploy - WARNING - kwargs cache_max_entry_count is deprecated to initialize model, use TurbomindEngineConfig instead.
2024-04-11 16:45:18,016 - lmdeploy - WARNING - model_config:

[llama]
model_name = internlm2
tensor_para_size = 1
head_num = 16
kv_head_num = 8
vocab_size = 92544
num_layer = 24
inter_size = 8192
norm_eps = 1e-05
attn_bias = 0
start_id = 1
end_id = 2
session_len = 32776
weight_type = bf16
rotary_embedding = 128
rope_theta = 1000000.0
size_per_head = 128
group_size = 0
max_batch_size = 128
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.01
cache_block_seq_len = 64
cache_chunk_size = -1
num_tokens_per_iter = 0
max_prefill_iters = 1
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 32768
rope_scaling_factor = 0.0
use_dynamic_ntk = 0
use_logn_attn = 0


2024-04-11 16:45:18,734 - lmdeploy - WARNING - get 195 model params
2024-04-11 16:45:38,126 - lmdeploy - WARNING - Input chat template with model_name is None. Forcing to use internlm2
[WARNING] gemm_config.in is not found; using default GEMM algo
session 1

double enter to end input >>> <|im_start|>system
You are an AI assistant whose name is InternLM (书生·浦语).
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.
<|im_end|>
<|im_start|>user
<|im_end|>
<|im_start|>assistant
 2024-04-11 16:45:38,612 - lmdeploy - WARNING - kwargs ignore_eos is deprecated for inference, use GenerationConfig instead.
2024-04-11 16:45:38,613 - lmdeploy - WARNING - kwargs random_seed is deprecated for inference, use GenerationConfig instead.
您好!我可以帮助您解答任何问题,无论是关于生活、科学、技术还是其他任何您感兴趣的主题。请随时向我提问,我会尽力为您提供有用和准确的信息。

double enter to end input >>>
<|im_start|>user
<|im_end|>
<|im_start|>assistant
 您好!我可以帮助您解决任何问题。如果您需要获取任何信息,请告诉我您需要了解什么,我将为您提供尽可能准确和有用的信息。无论是关于生活、科学、技术还是其他任何您感兴趣的主题,我都会尽我所能地为您提供帮助。

double enter to end input >>> hello


<|im_start|>user
hello<|im_end|>
<|im_start|>assistant
 好的,您好!有什么我可以帮助
您的吗?

double enter to end input >>>
<|im_start|>user
<|im_end|>
<|im_start|>assistant
 您好!请问有什么我可以帮您的吗?

double enter to end input >>>

此时显存占用仅为4560MB,代价是会降低模型推理速度。
在这里插入图片描述

使用W4A16量化

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

LMDeploy使用AWQ算法,实现模型4bit权重量化。推理引擎TurboMind提供了非常高效的4bit推理cuda kernel,性能是FP16的2.4倍以上。它支持以下NVIDIA显卡:

图灵架构(sm75):20系列、T4
安培架构(sm80,sm86):30系列、A10、A16、A30、A100
Ada Lovelace架构(sm90):40 系列
运行前,首先安装一个依赖库

pip install einops==0.7.0

在这里插入图片描述
执行一条命令,就可以完成模型量化工作

lmdeploy lite auto_awq \
   /root/internlm2-chat-1_8b \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 1024 \
  --w-bits 4 \
  --w-group-size 128 \
  --work-dir /root/internlm2-chat-1_8b-4bit

运行时间较长,请耐心等待。量化工作结束后,新的HF模型被保存到internlm2-chat-1_8b-4bit目录。

(lmdeploy) root@intern-studio-061925:~# lmdeploy lite auto_awq \
>    /root/internlm2-chat-1_8b \
>   --calib-dataset 'ptb' \
>   --calib-samples 128 \
>   --calib-seqlen 1024 \
>   --w-bits 4 \
>   --w-group-size 128 \
>   --work-dir /root/internlm2-chat-1_8b-4bit




Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [00:33<00:00, 16.93s/it]
Move model.tok_embeddings to GPU.
Move model.layers.0 to CPU.
Move model.layers.1 to CPU.
Move model.layers.2 to CPU.
Move model.layers.3 to CPU.
Move model.layers.4 to CPU.
Move model.layers.5 to CPU.
Move model.layers.6 to CPU.
Move model.layers.7 to CPU.
Move model.layers.8 to CPU.
Move model.layers.9 to CPU.
Move model.layers.10 to CPU.
Move model.layers.11 to CPU.
Move model.layers.12 to CPU.
Move model.layers.13 to CPU.
Move model.layers.14 to CPU.
Move model.layers.15 to CPU.
Move model.layers.16 to CPU.
Move model.layers.17 to CPU.
Move model.layers.18 to CPU.
Move model.layers.19 to CPU.
Move model.layers.20 to CPU.
Move model.layers.21 to CPU.
Move model.layers.22 to CPU.
Move model.layers.23 to CPU.
Move model.norm to GPU.
Move output to CPU.
Loading calibrate dataset ...
/root/.conda/envs/lmdeploy/lib/python3.10/site-packages/datasets/load.py:1461: FutureWarning: The repository for ptb_text_only contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/ptb_text_only
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Downloading builder script: 6.50kB [00:00, 23.2MB/s]
Downloading readme: 4.21kB [00:00, 16.3MB/s]
Downloading data: 5.10MB [00:00, 16.5MB/s]
Downloading data: 400kB [00:06, 57.5kB/s]
Downloading data: 450kB [00:00, 624kB/s]
Generating train split: 100%|█████████████████████████████████████████████████████████| 42068/42068 [00:00<00:00, 94649.96 examples/s]
Generating test split: 100%|███████████████████████████████████████████████████████████| 3761/3761 [00:00<00:00, 107928.88 examples/s]
Generating validation split: 100%|██████████████████████████████████████████████████████| 3370/3370 [00:00<00:00, 94977.28 examples/s]
model.layers.0, samples: 128, max gpu memory: 2.25 GB
model.layers.1, samples: 128, max gpu memory: 2.75 GB
model.layers.2, samples: 128, max gpu memory: 2.75 GB
model.layers.3, samples: 128, max gpu memory: 2.75 GB
model.layers.4, samples: 128, max gpu memory: 2.75 GB
model.layers.5, samples: 128, max gpu memory: 2.75 GB
model.layers.6, samples: 128, max gpu memory: 2.75 GB
model.layers.7, samples: 128, max gpu memory: 2.75 GB
model.layers.8, samples: 128, max gpu memory: 2.75 GB
model.layers.9, samples: 128, max gpu memory: 2.75 GB
model.layers.10, samples: 128, max gpu memory: 2.75 GB
model.layers.11, samples: 128, max gpu memory: 2.75 GB
model.layers.12, samples: 128, max gpu memory: 2.75 GB
model.layers.13, samples: 128, max gpu memory: 2.75 GB
model.layers.14, samples: 128, max gpu memory: 2.75 GB
model.layers.15, samples: 128, max gpu memory: 2.75 GB
model.layers.16, samples: 128, max gpu memory: 2.75 GB
model.layers.17, samples: 128, max gpu memory: 2.75 GB
model.layers.18, samples: 128, max gpu memory: 2.75 GB
model.layers.19, samples: 128, max gpu memory: 2.75 GB
model.layers.20, samples: 128, max gpu memory: 2.75 GB
model.layers.21, samples: 128, max gpu memory: 2.75 GB
model.layers.22, samples: 128, max gpu memory: 2.75 GB
model.layers.23, samples: 128, max gpu memory: 2.75 GB
model.layers.0 smooth weight done.
model.layers.1 smooth weight done.
model.layers.2 smooth weight done.
model.layers.3 smooth weight done.
model.layers.4 smooth weight done.
model.layers.5 smooth weight done.
model.layers.6 smooth weight done.
model.layers.7 smooth weight done.
model.layers.8 smooth weight done.
model.layers.9 smooth weight done.
model.layers.10 smooth weight done.
model.layers.11 smooth weight done.
model.layers.12 smooth weight done.
model.layers.13 smooth weight done.
model.layers.14 smooth weight done.
model.layers.15 smooth weight done.
model.layers.16 smooth weight done.
model.layers.17 smooth weight done.
model.layers.18 smooth weight done.
model.layers.19 smooth weight done.
model.layers.20 smooth weight done.
model.layers.21 smooth weight done.
model.layers.22 smooth weight done.
model.layers.23 smooth weight done.
model.layers.0.attention.wqkv weight packed.
model.layers.0.attention.wo weight packed.
model.layers.0.feed_forward.w1 weight packed.
model.layers.0.feed_forward.w3 weight packed.
model.layers.0.feed_forward.w2 weight packed.
model.layers.1.attention.wqkv weight packed.
model.layers.1.attention.wo weight packed.
model.layers.1.feed_forward.w1 weight packed.
model.layers.1.feed_forward.w3 weight packed.
model.layers.1.feed_forward.w2 weight packed.
model.layers.2.attention.wqkv weight packed.
model.layers.2.attention.wo weight packed.
model.layers.2.feed_forward.w1 weight packed.
model.layers.2.feed_forward.w3 weight packed.
model.layers.2.feed_forward.w2 weight packed.
model.layers.3.attention.wqkv weight packed.
model.layers.3.attention.wo weight packed.
model.layers.3.feed_forward.w1 weight packed.
model.layers.3.feed_forward.w3 weight packed.
model.layers.3.feed_forward.w2 weight packed.
model.layers.4.attention.wqkv weight packed.
model.layers.4.attention.wo weight packed.
model.layers.4.feed_forward.w1 weight packed.
model.layers.4.feed_forward.w3 weight packed.
model.layers.4.feed_forward.w2 weight packed.
model.layers.5.attention.wqkv weight packed.
model.layers.5.attention.wo weight packed.
model.layers.5.feed_forward.w1 weight packed.
model.layers.5.feed_forward.w3 weight packed.
model.layers.5.feed_forward.w2 weight packed.
model.layers.6.attention.wqkv weight packed.
model.layers.6.attention.wo weight packed.
model.layers.6.feed_forward.w1 weight packed.
model.layers.6.feed_forward.w3 weight packed.
model.layers.6.feed_forward.w2 weight packed.
model.layers.7.attention.wqkv weight packed.
model.layers.7.attention.wo weight packed.
model.layers.7.feed_forward.w1 weight packed.
model.layers.7.feed_forward.w3 weight packed.
model.layers.7.feed_forward.w2 weight packed.
model.layers.8.attention.wqkv weight packed.
model.layers.8.attention.wo weight packed.
model.layers.8.feed_forward.w1 weight packed.
model.layers.8.feed_forward.w3 weight packed.
model.layers.8.feed_forward.w2 weight packed.
model.layers.9.attention.wqkv weight packed.
model.layers.9.attention.wo weight packed.
model.layers.9.feed_forward.w1 weight packed.
model.layers.9.feed_forward.w3 weight packed.
model.layers.9.feed_forward.w2 weight packed.
model.layers.10.attention.wqkv weight packed.
model.layers.10.attention.wo weight packed.
model.layers.10.feed_forward.w1 weight packed.
model.layers.10.feed_forward.w3 weight packed.
model.layers.10.feed_forward.w2 weight packed.
model.layers.11.attention.wqkv weight packed.
model.layers.11.attention.wo weight packed.
model.layers.11.feed_forward.w1 weight packed.
model.layers.11.feed_forward.w3 weight packed.
model.layers.11.feed_forward.w2 weight packed.
model.layers.12.attention.wqkv weight packed.
model.layers.12.attention.wo weight packed.
model.layers.12.feed_forward.w1 weight packed.
model.layers.12.feed_forward.w3 weight packed.
model.layers.12.feed_forward.w2 weight packed.
model.layers.13.attention.wqkv weight packed.
model.layers.13.attention.wo weight packed.
model.layers.13.feed_forward.w1 weight packed.
model.layers.13.feed_forward.w3 weight packed.
model.layers.13.feed_forward.w2 weight packed.
model.layers.14.attention.wqkv weight packed.
model.layers.14.attention.wo weight packed.
model.layers.14.feed_forward.w1 weight packed.
model.layers.14.feed_forward.w3 weight packed.
model.layers.14.feed_forward.w2 weight packed.
model.layers.15.attention.wqkv weight packed.
model.layers.15.attention.wo weight packed.
model.layers.15.feed_forward.w1 weight packed.
model.layers.15.feed_forward.w3 weight packed.
model.layers.15.feed_forward.w2 weight packed.
model.layers.16.attention.wqkv weight packed.
model.layers.16.attention.wo weight packed.
model.layers.16.feed_forward.w1 weight packed.
model.layers.16.feed_forward.w3 weight packed.
model.layers.16.feed_forward.w2 weight packed.
model.layers.17.attention.wqkv weight packed.
model.layers.17.attention.wo weight packed.
model.layers.17.feed_forward.w1 weight packed.
model.layers.17.feed_forward.w3 weight packed.
model.layers.17.feed_forward.w2 weight packed.
model.layers.18.attention.wqkv weight packed.
model.layers.18.attention.wo weight packed.
model.layers.18.feed_forward.w1 weight packed.
model.layers.18.feed_forward.w3 weight packed.
model.layers.18.feed_forward.w2 weight packed.
model.layers.19.attention.wqkv weight packed.
model.layers.19.attention.wo weight packed.
model.layers.19.feed_forward.w1 weight packed.
model.layers.19.feed_forward.w3 weight packed.
model.layers.19.feed_forward.w2 weight packed.
model.layers.20.attention.wqkv weight packed.
model.layers.20.attention.wo weight packed.
model.layers.20.feed_forward.w1 weight packed.
model.layers.20.feed_forward.w3 weight packed.
model.layers.20.feed_forward.w2 weight packed.
model.layers.21.attention.wqkv weight packed.
model.layers.21.attention.wo weight packed.
model.layers.21.feed_forward.w1 weight packed.
model.layers.21.feed_forward.w3 weight packed.
model.layers.21.feed_forward.w2 weight packed.
model.layers.22.attention.wqkv weight packed.
model.layers.22.attention.wo weight packed.
model.layers.22.feed_forward.w1 weight packed.
model.layers.22.feed_forward.w3 weight packed.
model.layers.22.feed_forward.w2 weight packed.
model.layers.23.attention.wqkv weight packed.
model.layers.23.attention.wo weight packed.
model.layers.23.feed_forward.w1 weight packed.
model.layers.23.feed_forward.w3 weight packed.
model.layers.23.feed_forward.w2 weight packed.
(lmdeploy) root@intern-studio-061925:~#

在这里插入图片描述

下面使用Chat功能运行W4A16量化后的模型。


(lmdeploy) root@intern-studio-061925:~# lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq


2024-04-11 17:18:54,874 - lmdeploy - WARNING - model_source: hf_model
2024-04-11 17:18:54,874 - lmdeploy - WARNING - kwargs model_format is deprecated to initialize model, use TurbomindEngineConfig instead.
2024-04-11 17:18:54,874 - lmdeploy - WARNING - kwargs max_batch_size is deprecated to initialize model, use TurbomindEngineConfig instead.
2024-04-11 17:18:54,874 - lmdeploy - WARNING - kwargs cache_max_entry_count is deprecated to initialize model, use TurbomindEngineConfig instead.
2024-04-11 17:19:03,771 - lmdeploy - WARNING - model_config:

[llama]
model_name = internlm2
tensor_para_size = 1
head_num = 16
kv_head_num = 8
vocab_size = 92544
num_layer = 24
inter_size = 8192
norm_eps = 1e-05
attn_bias = 0
start_id = 1
end_id = 2
session_len = 32776
weight_type = int4
rotary_embedding = 128
rope_theta = 1000000.0
size_per_head = 128
group_size = 128
max_batch_size = 128
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.8
cache_block_seq_len = 64
cache_chunk_size = -1
num_tokens_per_iter = 0
max_prefill_iters = 1
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 32768
rope_scaling_factor = 0.0
use_dynamic_ntk = 0
use_logn_attn = 0


2024-04-11 17:19:04,497 - lmdeploy - WARNING - get 267 model params
2024-04-11 17:19:18,963 - lmdeploy - WARNING - Input chat template with model_name is None. Forcing to use internlm2
[WARNING] gemm_config.in is not found; using default GEMM algo
session 1

double enter to end input >>> <|im_start|>system
You are an AI assistant whose name is InternLM (书生·浦语).
- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.
- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.
<|im_end|>
<|im_start|>user
<|im_end|>
<|im_start|>assistant
 2024-04-11 17:19:19,768 - lmdeploy - WARNING - kwargs ignore_eos is deprecated for inference, use GenerationConfig instead.
2024-04-11 17:19:19,768 - lmdeploy - WARNING - kwargs random_seed is deprecated for inference, use GenerationConfig instead.
好的,我可以帮您回答问题。请问您有什么问题需要我回答吗?

double enter to end input >>>
<|im_start|>user
<|im_end|>
<|im_start|>assistant
 当然,我会尽我最大的努力来回答您的问题。有什么事情困扰着您吗?

v
为了更加明显体会到W4A16的作用,我们将KV Cache比例再次调为0.01,查看显存占用情况。

lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq --cache-max-entry-count 0.01

(lmdeploy) root@intern-studio-061925:~# lmdeploy chat /root/internlm2-chat-1_8b-4bit --model-format awq --cache-max-entry-count 0.01
2024-04-11 17:22:12,505 - lmdeploy - WARNING - model_source: hf_model
2024-04-11 17:22:12,506 - lmdeploy - WARNING - kwargs model_format is deprecated to initialize model, use TurbomindEngineConfig instead.
2024-04-11 17:22:12,506 - lmdeploy - WARNING - kwargs max_batch_size is deprecated to initialize model, use TurbomindEngineConfig instead.
2024-04-11 17:22:12,506 - lmdeploy - WARNING - kwargs cache_max_entry_count is deprecated to initialize model, use TurbomindEngineConfig instead.
2024-04-11 17:22:18,574 - lmdeploy - WARNING - model_config:

[llama]
model_name = internlm2
tensor_para_size = 1
head_num = 16
kv_head_num = 8
vocab_size = 92544
num_layer = 24
inter_size = 8192
norm_eps = 1e-05
attn_bias = 0
start_id = 1
end_id = 2
session_len = 32776
weight_type = int4
rotary_embedding = 128
rope_theta = 1000000.0
size_per_head = 128
group_size = 128
max_batch_size = 128
max_context_token_num = 1
step_length = 1
cache_max_entry_count = 0.01
cache_block_seq_len = 64
cache_chunk_size = -1
num_tokens_per_iter = 0
max_prefill_iters = 1
extra_tokens_per_iter = 0
use_context_fmha = 1
quant_policy = 0
max_position_embeddings = 32768
rope_scaling_factor = 0.0
use_dynamic_ntk = 0
use_logn_attn = 0


2024-04-11 17:22:19,471 - lmdeploy - WARNING - get 267 model params
2024-04-11 17:22:31,462 - lmdeploy - WARNING - Input chat template with model_name is None. Forcing to use internlm2
[WARNING] gemm_config.in is not found; using default GEMM algo
session 1

double enter to end input >>> hello

可以看到,显存占用变为2346MB,明显降低。
在这里插入图片描述

LMDeploy服务(serve)

将大模型封装为API接口服务,供客户端访问。

在这里插入图片描述
从架构上把整个服务流程分成下面几个模块。

  • 模型推理/服务。主要提供模型本身的推理,一般来说可以和具体业务解耦,专注模型推理本身性能的优化。可以以模块、API等多种方式提供。
  • API Server。中间协议层,把后端推理/服务通过HTTP,gRPC或其他形式的接口,供前端调用。
  • Client。可以理解为前端,与用户交互的地方。通过通过网页端/命令行去调用API接口,获取模型推理/服务。
    值得说明的是,以上的划分是一个相对完整的模型,但在实际中这并不是绝对的。比如可以把“模型推理”和“API Server”合并,有的甚至是三个流程打包在一起提供服务。

启动API服务器

通过以下命令启动API服务器,推理internlm2-chat-1_8b模型:

lmdeploy serve api_server \
    /root/internlm2-chat-1_8b \
    --model-format hf \
    --quant-policy 0 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

(lmdeploy) root@intern-studio-061925:~# lmdeploy serve api_server \
>     /root/internlm2-chat-1_8b \
>     --model-format hf \
>     --quant-policy 0 \
>     --server-name 0.0.0.0 \
>     --server-port 23333 \
>     --tp 1

[WARNING] gemm_config.in is not found; using default GEMM algo
HINT:    Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT:    Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
INFO:     Started server process [104788]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)

ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.intern-ai.org.cn -p 42659 

在这里插入图片描述
在这里插入图片描述

命令行客户端连接API服务器

在这里插入图片描述

运行命令行客户端

lmdeploy serve api_client http://localhost:23333

通过命令行窗口直接与模型对话


(lmdeploy) root@intern-studio-061925:~# lmdeploy serve api_client http://localhost:23333



double enter to end input >>> 好的,我会尽力帮助您。请问您需要什么帮助?
double enter to end input >>> 请问您有什么问题需要我解答吗?
double enter to end input >>> hello

您好,请问有什么需要帮助的吗?
double enter to end input >>>

网页客户端连接API服务器

在这里插入图片描述

使用Gradio作为前端,启动网页客户端。

lmdeploy serve gradio http://localhost:23333 \
    --server-name 0.0.0.0 \
    --server-port 6006

ssh -CNg -L 6006:127.0.0.1:6006 root@ssh.intern-ai.org.cn -p 42659

在这里插入图片描述

打开浏览器,访问地址http://127.0.0.1:6006
在这里插入图片描述

Python代码集成

(lmdeploy) root@intern-studio-061925:~# python
Python 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from lmdeploy import pipeline
>>> pipe = pipeline('/root/internlm2-chat-1_8b')
[WARNING] gemm_config.in is not found; using default GEMM algo
>>> response = pipe(['Hi, pls intro yourself', '上海是'])
>>> print(response)
[Response(text='Hello! My name is InternLM, and I am a conversational language model developed by Shanghai AI Laboratory. I am designed to be helpful, honest, and harmless. I can understand and communicate fluently in English and Chinese. If you have any questions or need assistance, feel free to ask me!', generate_token_len=62, input_token_len=108, session_id=0, finish_reason='stop'), Response(text='上海,是中华人民共和国的经济、文化、科技、教育、金融、旅游中心,也是国际交往的重要门户。它位于长江入海口,是中国最大的城市之一,也是全球最年轻的经济体。上海是中国的重要门户,也是中国经济的重要中心。', generate_token_len=52, input_token_len=104, session_id=1, finish_reason='stop')]
>>>

在这里插入图片描述

向TurboMind后端传递参数

通过向lmdeploy传递附加参数,实现模型的量化推理,及设置KV Cache最大占用比例。在Python代码中,可以通过创建TurbomindEngineConfig,向lmdeploy传递参数。

from lmdeploy import pipeline, TurbomindEngineConfig

# 调低 k/v cache内存占比调整为总显存的 20%
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)

pipe = pipeline('/root/internlm2-chat-1_8b',
                backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', '上海是'])
print(response)
(lmdeploy) root@intern-studio-061925:~# python
Python 3.10.14 (main, Mar 21 2024, 16:24:04) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from lmdeploy import pipeline, TurbomindEngineConfig
显存的 20%
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)

pipe = pipeline('/root/internlm2-chat-1_8b',   backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', '上海是'])
print(response)
>>>
>>> # 调低 k/v cache内存占比调整为总显存的 20%
>>> backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)
>>>
>>> pipe = pipeline('/root/internlm2-chat-1_8b',   backend_config=backend_config)
[WARNING] gemm_config.in is not found; using default GEMM algo
>>> response = pipe(['Hi, pls intro yourself', '上海是'])
>>> print(response)
[Response(text='你好,我是书生·浦语,一个由上海人工智能实验室开发的人工智能助手。我致力于帮助您解答各种问题,并提供有用的信息和建议。我使用了Transformer模型和深度学习技术,并使用语言模型作为预训练任务。如果您有任何问题,请随时向我提问。', generate_token_len=60, input_token_len=108, session_id=0, finish_reason='stop'), Response(text='你好,请问有什么可以帮助你的?', generate_token_len=8, input_token_len=104, session_id=1, finish_reason='stop')]
>>>

在这里插入图片描述

使用LMDeploy运行视觉多模态大模型llava

运行本pipeline最低需要30%的InternStudio开发机

激活conda环境。

conda activate lmdeploy
安装llava依赖库。

pip install git+https://github.com/haotian-liu/LLaVA.git@4e2277a060da264c4f21b364c867cc622c945874

在这里插入图片描述
在这里插入图片描述
保存后运行pipeline

python /root/pipeline_llava.py

从github下载了一张关于老虎的图片
在这里插入图片描述
得到输出结果

(lmdeploy) root@intern-studio-061925:~# python /root/pipeline_llava.py
[WARNING] gemm_config.in is not found; using default GEMM algo
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
preprocessor_config.json: 100%|████████████████████████████████████████████████| 316/316 [00:00<00:00, 1.89MB/s]
config.json: 4.76kB [00:00, 17.6MB/s]
pytorch_model.bin: 100%|███████████████████████████████████████████████████| 1.71G/1.71G [01:22<00:00, 20.8MB/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.24it/s]
Response(text="The image shows a tiger lying on the grass. The tiger is facing towards the camera, with its head slightly tilted to the left and its front paws extended outwards. Its fur is a mix of orange and black stripes, typical of a tiger's pattern. The background is a blurred green, suggesting a natural outdoor setting, possibly a grassy field or a forest. The lighting appears to be natural, indicating the photo was taken during the day. The focus is sharp on the tiger's face and front legs, while the background is slightly out of focus, which is a common technique in wildlife photography to draw attention to the subject.", generate_token_len=143, input_token_len=1023, session_id=0, finish_reason='stop')
(lmdeploy) root@intern-studio-061925:~#

图片显示一只老虎躺在草地上。老虎面朝镜头,头微微向左倾斜,前爪向外伸出。它的皮毛是橙色和黑色条纹的混合体,这是老虎的典型图案。背景是模糊的绿色,暗示着一个自然的户外环境,可能是草地或森林。光线看起来很自然,表明这张照片是在白天拍摄的。焦点集中在老虎的脸和前腿上,而背景稍微失焦,这是野生动物摄影中吸引注意力的常用技巧

也可以通过Gradio来运行llava模型。新建python文件gradio_llava.py

import gradio as gr
from lmdeploy import pipeline, TurbomindEngineConfig


backend_config = TurbomindEngineConfig(session_len=8192) # 图片分辨率较高时请调高session_len
# pipe = pipeline('liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config) 非开发机运行此命令
pipe = pipeline('/share/new_models/liuhaotian/llava-v1.6-vicuna-7b', backend_config=backend_config)

def model(image, text):
    if image is None:
        return [(text, "请上传一张图片。")]
    else:
        response = pipe((text, image)).text
        return [(text, response)]

demo = gr.Interface(fn=model, inputs=[gr.Image(type="pil"), gr.Textbox()], outputs=gr.Chatbot())
demo.launch()   

在这里插入图片描述
运行python程序。

python /root/gradio_llava.py
(lmdeploy) root@intern-studio-061925:~# python /root/gradio_llava.py
[WARNING] gemm_config.in is not found; using default GEMM algo
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
You are using a model of type llava to instantiate a model of type llava_llama. This is not supported for all configurations of models and can yield errors.
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 3/3 [00:00<00:00,  3.24it/s]
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


通过ssh转发一下7860端口。
ssh -CNg -L 7860:127.0.0.1:7860 root@ssh.intern-ai.org.cn -p 42659
在这里插入图片描述
http://127.0.0.1:7860/
在这里插入图片描述

使用LMDeploy运行第三方大模型

LMDeploy不仅支持运行InternLM系列大模型,还支持其他第三方大模型。支持的模型列表如下:

Model Size
Llama 7B - 65B
Llama2 7B - 70B
InternLM 7B - 20B
InternLM2 7B - 20B
InternLM-XComposer 7B
QWen 7B - 72B
QWen-VL 7B
QWen1.5 0.5B - 72B
QWen1.5-MoE A2.7B
Baichuan 7B - 13B
Baichuan2 7B - 13B
Code Llama 7B - 34B
ChatGLM2 6B
Falcon 7B - 180B
YI 6B - 34B
Mistral 7B
DeepSeek-MoE 16B
DeepSeek-VL 7B
Mixtral 8x7B
Gemma 2B-7B
Dbrx 132B
可以从Modelscope,OpenXLab下载相应的HF模型

定量比较LMDeploy与Transformer库的推理速度差异

测试一波Transformer库推理Internlm2-chat-1.8b的速度,新建python文件,命名为benchmark_transformer.py,填入以下内容:

import torch
import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("/root/internlm2-chat-1_8b", trust_remote_code=True)

# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and cause OOM Error.
model = AutoModelForCausalLM.from_pretrained("/root/internlm2-chat-1_8b", torch_dtype=torch.float16, trust_remote_code=True).cuda()
model = model.eval()

# warmup
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response, history = model.chat(tokenizer, inp, history=[])

# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
    response, history = model.chat(tokenizer, inp, history=history)
    total_words += len(response)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.3f} words/s".format(speed))

运行python脚本:

python benchmark_transformer.py


(lmdeploy) root@intern-studio-061925:~# python benchmark_transformer.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████| 2/2 [00:14<00:00,  7.44s/it]
Warm up...[1/5]
Warm up...[2/5]
Warm up...[3/5]
Warm up...[4/5]
Warm up...[5/5]
Speed: 53.092 words/s

在这里插入图片描述

Transformer库的推理速度约为53.092 words/s,注意单位是words/s,不是token/s,word和token在数量上可以近似认为成线性关系。

下面来测试一下LMDeploy的推理速度,新建python文件benchmark_lmdeploy.py,填入以下内容

import datetime
from lmdeploy import pipeline

pipe = pipeline('/root/internlm2-chat-1_8b')

# warmup
inp = "hello"
for i in range(5):
    print("Warm up...[{}/5]".format(i+1))
    response = pipe([inp])

# test speed
inp = "请介绍一下你自己。"
times = 10
total_words = 0
start_time = datetime.datetime.now()
for i in range(times):
    response = pipe([inp])
    total_words += len(response[0].text)
end_time = datetime.datetime.now()

delta_time = end_time - start_time
delta_time = delta_time.seconds + delta_time.microseconds / 1000000.0
speed = total_words / delta_time
print("Speed: {:.3f} words/s".format(speed))

运行脚本 python benchmark_lmdeploy.py

在这里插入图片描述

可以看到,LMDeploy的推理速度约为487.680 words/s,是Transformer库的6倍。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:/a/534771.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈qq邮箱809451989@qq.com,一经查实,立即删除!

相关文章

2024年腾讯云优惠活动大全

随着云计算技术的日益成熟&#xff0c;越来越多的企业和个人开始选择将业务和数据迁移到云端。作为国内领先的云服务提供商&#xff0c;腾讯云一直致力于为用户提供高效、稳定、安全的云服务体验。在2024年&#xff0c;腾讯云推出了一系列优惠活动&#xff0c;旨在让更多用户能…

了解Vue路由守卫

一、理解&#xff1a; 路由守卫就是通过URL访问组件的过程中&#xff0c;可以设置回调函数做拦截&#xff0c;判断是否允许该URL访问组件。这个过程出现的回调函数&#xff0c;我们叫做路由守卫钩子函数。 路由守卫有3种 1、全局路由守卫 在./router/index.js 文件中&#xff0…

Web服务器架构设计(学习笔记)

软件架构风格 质量属性与架构评估 Web架构综合考察 什么叫做架构风格&#xff1f;又有哪些架构风格&#xff1f;不同的架构风格的优劣如何? 有哪些层次的负载均衡实现&#xff1f;优劣如何&#xff1f; 有哪些层面的集群切片实现&#xff1f; 什么叫做小前端&#xff0c…

fs.1.10 ON CENTOS7 dockerfile模式

概述 freeswitch是一款简单好用的VOIP开源软交换平台。 centos7 docker上编译安装fs.1.10的流程记录&#xff0c;本文使用dockerfile模式。 环境 docker engine&#xff1a;Version 24.0.6 centos docker&#xff1a;7 freeswitch&#xff1a;v1.10.7 dockerfile 创建空…

茴香豆:搭建你的 RAG 智能助理(笔记)

视频地址&#xff1a;https://www.bilibili.com/video/BV1QA4m1F7t4 文档地址&#xff1a;https://github.com/InternLM/Tutorial/blob/camp2/huixiangdou/readme.md 作业地址&#xff1a;https://github.com/InternLM/Tutorial/blob/camp2/huixiangdou/homework.md 茴香豆项目…

国外客户代采1688商品如何实现自动化对接

要实现国外客户代采1688商品的自动化对接&#xff0c;你可以考虑以下步骤&#xff1a; 选择合适的代采平台&#xff1a;选择一个适合你的需求和预算的代采平台&#xff0c;例如Alibaba.com、Amazon FBA代采等。 注册并创建账户&#xff1a;根据你选择的代采平台&#xff0c;注…

git 删除本地分支 删除远程仓库分支

语法&#xff1a; 删除本地分支 git branch -D <分支名>删除远程分支 git push <remote名称> <分支名> --delete 示例&#xff1a; 删除本地分支 git branch -D feature/test_listview删除远程分支 git push origin feature/test_listview --delete 两个…

java数组.day16(冒泡排序,稀疏数组)

冒泡排序 冒泡排序无疑是最为出名的排序算法之一&#xff0c;总共有八大排序! 冒泡的代码还是相当简单的&#xff0c;两层循环&#xff0c;外层冒泡轮数&#xff0c;里层依次比较&#xff0c;江湖中人人尽皆知。 我们看到嵌套循环&#xff0c;应该立马就可以得出这个算法的时…

Java简单实现一个LRU(最近最少使用淘汰策略)

目录 LRU介绍&#xff1a; 一些淘汰策略&#xff1a; Java简单实现LRU&#xff1a; 测试&#xff1a; ​编辑 实现原理&#xff1a; LRU介绍&#xff1a; LRU 是 "Least Recently Used" 的缩写&#xff0c;意为"最近最少使用"。它是计算机科学中的一种…

Linux 计算机网络

目录 一、网络协议 1、 "协议" 是一种约定 2、协议分层 二、网络模型 1、OSI七层模型 2、TCP/IP五层(或四层)模型 三、网络传输基本流程 四、数据包封装和分用 五、网络中的地址管理 六、网络编程套接字 1、理解源IP地址和目的IP地址 2、端口号 理解 &q…

console.log输出五彩斑斓的颜色

console.log 最经常使用它在控制台输出信息然后进行代码调试&#xff0c;会发现输出信息的颜色永远是黑色。本文教学如何用 console.log在控制台输出具有五彩斑斓颜色的信息。 单色输出 console.log(\x1B[31m红色\x1B[34m 蓝色) console.log(%c字符&#xff1a;, color:red, …

(二)ffmpeg 下载安装以及拉流推流示例

一、ffmpeg下载安装 官网&#xff1a;https://www.ffmpeg.org/ 源码下载地址&#xff1a;https://www.ffmpeg.org/download.html#releases 下载源码压缩包 下载完成之后解压并在该目录下打开命令窗口 安装依赖环境&#xff1a; sudo apt-get install build-essential nasm …

Can Transformer and GNN Help Each Other?

ABSTRACT 尽管 Transformer 在自然语言处理和计算机视觉方面取得了巨大成功&#xff0c;但由于两个重要原因&#xff0c;它很难推广到中大规模图数据&#xff1a;(i) 复杂性高。 (ii) 未能捕获复杂且纠缠的结构信息。在图表示学习中&#xff0c;图神经网络&#xff08;GNN&…

Vue中key的原理以及diff算法

简介&#xff1a;Vue的key用于在虚拟DOM中标记节点&#xff0c;方便后面的diff对比算法进行对比&#xff0c;提升效率。 Vue的vm或者vc实例一共管理着3个DOM对象&#xff0c;分别他的模板对应的真实DOM、真实DOM的备份、以及重新生成的新的DOM&#xff0c;后两个可以看成是虚拟…

Prometheus实现自定义脚本监控

#Prometheus# 监控路漫漫其修远兮&#xff0c;吾将上下而求索&#xff01; 一、前言 在监控工作过程中经常会收到大量的定制化的监控需求&#xff0c;Prometheus就提供了一个很强大的组件 --> Pushgateway&#xff0c;他不仅是网关的用途接收exporter的数据&#xff0c;还…

参与 PenPad Season 2 获得勋章,还有海量 Scroll 生态稀缺权益

PenPad 是 Scroll 生态中的首个 LaunchPad 平台&#xff0c;该平台继承了 Scroll 生态的技术优势&#xff0c;具备包括隐私在内的系列特点&#xff0c;同时且也被认为是 Scroll 生态最重要的价值入口之一。Penpad 与 Scroll 官方始终保持着合作&#xff0c;同时该项目自启动以来…

关于无人机,你必须知道的事!!(科技篇)

飞行器的五脏六腑 电机&#xff1a;无人机的动力单元&#xff0c;俗称“马达”。通过电机转动来驱动螺旋桨旋转&#xff0c;最终让飞机上天。 电调&#xff1a;无人机的动力单元&#xff0c;是接收油门信号并调整电机转速的控制枢纽&#xff0c;俗称电机的“黑屋调教师” 飞…

记录Http访问服务接口出现 301 Moved Permanently

记录Http访问服务接口出现 301 Moved Permanently 一、项目背景 ​ 在A服务中 需要通过远程调用 B服务接口 获取数据。A项目与B项目部署在不同的服务接口中。 请求接口响应界面 ​ 在调用B项目的接口时&#xff0c;响应的数据一直为 301 Moved Permanently Html代码&#x…

一文看懂Mac地址

一、Mac地址是什么&#xff1f; 虽然IP地址已经成为一个家喻户晓的术语&#xff0c;但还有一个同样重要的数字标识符值得我们关注——MAC地址。在本文中&#xff0c;我们旨在阐明网络中这个经常被忽视的方面。加入我们&#xff0c;深入研究 MAC 地址的世界&#xff0c;了解它们…

Vue3大事件项目1 登录注册验证

创建项目 引入 element-ui 组件库 登录&#xff1a;注册样式准备之后&#xff0c;配置校验规则&#xff08;4个条件&#xff1a;一数据、二规则&#xff09; 1. 校验相关 (1) 给当前表单绑上整个的数据对象&#xff1a;el-form > :model"ruleForm" 绑…