LLM记录：五一 Llama 3 超级课堂

想玩大模型，自己又没那个环境，参加五一 Llama 3 超级课堂，简单记录一下llama3-8b的相关体验，实在是邀请不到人，还好后面开放了24G显存，好歹模型能跑起来了，只能说感谢大佬！

Llama 3 超级课堂 git地址：https://github.com/SmartFlowAI/Llama3-Tutorial/

第一节：Llama 3 本地 Web Demo 部署

https://github.com/SmartFlowAI/Llama3-Tutorial/blob/main/docs/hello_world.md

比较简单的操作：

就是按照文档按照环境，克隆下源码，启动运行一下就可以了

我这边遇到一个小问题：

说软连接的目录找不到config.json文件，干脆直接改成模型路径好了

streamlit run ~/Llama3-Tutorial/tools/internstudio_web_demo.py \
  ~/model/Meta-Llama-3-8B-Instruct

在这里插入图片描述

streamlit run ~/Llama3-Tutorial/tools/internstudio_web_demo.py /root/share/new_models/meta-llama/Meta-Llama-3-8B-Instruct/

在这里插入图片描述

周杰伦（但有点出入）

在这里插入图片描述

第二节：Llama 3 微调个人小助手认知（XTuner 版）

https://github.com/SmartFlowAI/Llama3-Tutorial/blob/main/docs/assistant.md

数据集准备：稍微做点修改，gdata.py文件里面把名字改成我自己的了

configs/assistant/llama3_8b_instruct_qlora_assistant.py

此文件里面把软连接的模型路径换成了实际的路径/root/share/new_models/meta-llama/Meta-Llama-3-8B-Instruct/

后面就是按照步骤微调

cd ~/Llama3-Tutorial

# 开始训练,使用 deepspeed 加速，A100 40G显存 耗时24分钟
xtuner train configs/assistant/llama3_8b_instruct_qlora_assistant.py --work-dir /root/llama3_pth

# Adapter PTH 转 HF 格式
xtuner convert pth_to_hf /root/llama3_pth/llama3_8b_instruct_qlora_assistant.py \
  /root/llama3_pth/iter_500.pth \
  /root/llama3_hf_adapter

# 模型合并
export MKL_SERVICE_FORCE_INTEL=1
xtuner convert merge /root/model/Meta-Llama-3-8B-Instruct \
  /root/llama3_hf_adapter\
  /root/llama3_hf_merged

qlora微调时显存大概在12GB左右

大概20min不到自我认知微调结束

合并还是那个软链接的问题，改成实际路径就行

xtuner convert merge /root/share/new_models/meta-llama/Meta-Llama-3-8B-Instruct /root/llama3_hf_adapter /root/llama3_hf_merged

然后部署

好吧，显然微调完毕后只能回答这一句话了。估计是2000条同样的数据导致过拟合了。想真正调出一个良好的模型好像还不容易。

第三节：Llama 3 图片理解能力微调（XTuner+LLaVA 版）

https://github.com/SmartFlowAI/Llama3-Tutorial/blob/main/docs/llava.md

由于哥们只有24GB显存玩不了一点，就不做这个多模态的了。跳过！

第四节：Llama 3 高效部署实践（LMDeploy 版）

https://github.com/SmartFlowAI/Llama3-Tutorial/blob/main/docs/lmdeploy.md

也是按照文档来就行

LMDeploy Chat CLI 工具

部署改成模型的路径

lmdeploy chat /root/share/new_models/meta-llama/Meta-Llama-3-8B-Instruct

在这里插入图片描述

LMDeploy模型量化(lite)

设置最大KV Cache缓存大小

lmdeploy chat /root/share/new_models/meta-llama/Meta-Llama-3-8B-Instruct --cache-max-entry-count 0.01

在这里插入图片描述

推理速度也还好没有很慢

使用W4A16量化

lmdeploy lite auto_awq \
   /root/share/new_models/meta-llama/Meta-Llama-3-8B-Instruct \
  --calib-dataset 'ptb' \
  --calib-samples 128 \
  --calib-seqlen 1024 \
  --w-bits 4 \
  --w-group-size 128 \
  --work-dir /root/model/Meta-Llama-3-8B-Instruct_4bit

这块耗时蛮久的可以干别的去

恩，确实6GB显存就可以推理了

在这里插入图片描述

回答速度也挺快的，量化后精度有所下降，不过简简单单的问题感知不高

在这里插入图片描述

LMDeploy服务（serve）

API启动

lmdeploy serve api_server \
    /root/share/new_models/meta-llama/Meta-Llama-3-8B-Instruct \
    --model-format hf \
    --quant-policy 0 \
    --server-name 0.0.0.0 \
    --server-port 23333 \
    --tp 1

启动后转发一下端口

ssh -CNg -L 23333:127.0.0.1:23333 root@ssh.intern-ai.org.cn -p 48212

在这里插入图片描述

给出了api接口，尝试用postman测试一下，调通了但是model应该是要固定传响应模型的type，这里就不管了，安装教程继续

在这里插入图片描述

客户端

有命令行和web端，起了gradio的web端看下，都是大佬写好的，跟着操作就没什么问题

在这里插入图片描述

第五节：Llama 3 Agent 能力体验与微调

https://github.com/SmartFlowAI/Llama3-Tutorial/blob/main/docs/agent.md

前面按部就班操作，到下面这步前先运行pip install deepspeed再继续操作，我这里还是改掉了模型软链接路径。

export MKL_SERVICE_FORCE_INTEL=1
xtuner train ~/Llama3-Tutorial/configs/llama3-agentflan/llama3_8b_instruct_qlora_agentflan_3e.py --work-dir ~/llama3_agent_pth --deepspeed deepspeed_zero2

好吧，大佬已经训练好给出了，那就merge

export MKL_SERVICE_FORCE_INTEL=1
xtuner convert merge /root/share/new_models/meta-llama/Meta-Llama-3-8B-Instruct \
    /share/new_models/agent-flan/iter_2316_hf \
    ~/llama3_agent_pth/merged

运行看一下

streamlit run ~/Llama3-Tutorial/tools/agent_web_demo.py /root/llama3_agent_pth/merged

在这里插入图片描述

第六节：Llama 3 能力评测（OpenCompass 版）

https://github.com/SmartFlowAI/Llama3-Tutorial/blob/main/docs/opencompass.md

按照步骤就好评测路径稍微更改了一下

python run.py --datasets ceval_gen --hf-path /root/share/new_models/meta-llama/Meta-Llama-3-8B-Instruct --tokenizer-path /root/model/Meta-Llama-3-8B-Instruct --tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True --model-kwargs trust_remote_code=True device_map='auto' --max-seq-len 2048 --max-out-len 16 --batch-size 4 --num-gpus 1 --debug