使用 XTuner 微调 InternLM2-Chat-7B 实现自己的小助手认知
- 1 环境配置与数据准备
- 步骤 0. 使用 conda 先构建一个 Python-3.10 的虚拟环境
- 步骤 1. 安装 XTuner
- 修改提供的数据
- 步骤 0. 创建一个新的文件夹用于存储微调数据
- 步骤 1. 创建修改脚本
- 步骤 2. 执行脚本
- 步骤 3. 查看数据
- 训练启动
- 步骤 0. 复制模型
- 步骤 1. 修改 Config
- 步骤 2. 启动微调
- 步骤 3. 权重转换
- 步骤 4. 模型合并
- 模型 WebUI 对话
参考文档: 书生训练营
XTuner 文档链接
1 环境配置与数据准备
步骤 0. 使用 conda 先构建一个 Python-3.10 的虚拟环境
cd ~
#git clone 本repo
git clone https://github.com/InternLM/Tutorial.git -b camp4
mkdir -p /root/finetune && cd /root/finetune
conda create -n xtuner-env python=3.10 -y
conda activate xtuner-env
步骤 1. 安装 XTuner
cd /root/Tutorial/docs/L1/XTuner
pip install -r requirements.txt
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
# requirements.txt
accelerate==0.27.0
addict==2.4.0
aiohttp==3.9.3
aiosignal==1.3.1
aliyun-python-sdk-core==2.14.0
aliyun-python-sdk-kms==2.16.2
altair==5.2.0
annotated-types==0.6.0
anyio==4.2.0
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
arxiv==2.1.0
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==23.2.0
Babel==2.14.0
beautifulsoup4==4.12.3
bitsandbytes==0.42.0
bleach==6.1.0
blinker==1.7.0
cachetools==5.3.2
certifi==2024.2.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
colorama==0.4.6
comm==0.2.1
contourpy==1.2.0
crcmod==1.7
cryptography==42.0.2
cycler==0.12.1
datasets==2.17.0
debugpy==1.8.1
decorator==5.1.1
deepspeed==0.14.4
defusedxml==0.7.1
dill==0.3.8
distro==1.9.0
einops==0.8.0
einx==0.3.0
et-xmlfile==1.1.0
exceptiongroup==1.2.0
executing==2.0.1
fastapi==0.112.0
fastjsonschema==2.19.1
feedparser==6.0.10
filelock==3.14.0
fonttools==4.48.1
fqdn==1.5.1
frozendict==2.4.4
frozenlist==1.4.1
fsspec==2023.10.0
func-timeout==4.3.5
gast==0.5.4
gitdb==4.0.11
GitPython==3.1.41
google-search-results==2.4.2
griffe==0.40.1
h11==0.14.0
hjson==3.1.0
httpcore==1.0.3
httpx==0.26.0
huggingface-hub==0.24.2
idna==3.6
imageio==2.34.2
importlib-metadata==7.0.1
ipykernel==6.29.2
ipython==8.21.0
ipywidgets==8.1.2
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.3
jmespath==0.10.0
json5==0.9.14
jsonpointer==2.4
jsonschema==4.21.1
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
lagent==0.2.1
lazy_loader==0.4
llvmlite==0.43.0
lxml==5.1.0
markdown-it-py==3.0.0
MarkupSafe==2.1.5
matplotlib==3.8.2
matplotlib-inline==0.1.6
mdurl==0.1.2
mistune==3.0.2
mmengine==0.10.3
modelscope==1.12.0
mpi4py_mpich==3.1.5
mpmath==1.3.0
multidict==6.0.5
multiprocess==0.70.16
nbclient==0.9.0
nbconvert==7.16.0
nbformat==5.9.2
nest-asyncio==1.6.0
networkx==3.2.1
ninja==1.11.1.1
notebook==7.0.8
notebook_shim==0.2.3
numba==0.60.0
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.3.101
nvidia-nvtx-cu12==12.1.105
openai==1.12.0
opencv-python==4.9.0.80
openpyxl==3.1.2
oss2==2.17.0
overrides==7.7.0
packaging==24.1
pandas==2.2.0
pandocfilters==1.5.1
parso==0.8.3
peft==0.8.2
pexpect==4.9.0
phx-class-registry==4.1.0
pillow==10.2.0
platformdirs==4.2.0
prometheus-client==0.19.0
prompt-toolkit==3.0.43
protobuf==4.25.2
psutil==5.9.8
ptyprocess==0.7.0
pure-eval==0.2.2
py-cpuinfo==9.0.0
pyarrow==15.0.0
pyarrow-hotfix==0.6
pybase16384==0.3.7
pycparser==2.21
pycryptodome==3.20.0
pydantic==2.6.1
pydantic_core==2.16.2
pydeck==0.8.1b0
Pygments==2.17.2
pynvml==11.5.0
pyparsing==3.1.1
python-dateutil==2.8.2
python-json-logger==2.0.7
python-pptx==0.6.23
PyYAML==6.0.1
pyzmq==25.1.2
qtconsole==5.5.1
QtPy==2.4.1
referencing==0.33.0
regex==2023.12.25
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.4.2
rpds-py==0.17.1
safetensors==0.4.2
scikit-image==0.24.0
scipy==1.12.0
seaborn==0.13.2
Send2Trash==1.8.2
sentencepiece==0.1.99
sgmllib3k==1.0.0
simplejson==3.19.2
six==1.16.0
smmap==5.0.1
sniffio==1.3.0
sortedcontainers==2.4.0
soupsieve==2.5
stack-data==0.6.3
starlette==0.37.2
sympy==1.12
tenacity==8.2.3
termcolor==2.4.0
terminado==0.18.0
tifffile==2024.7.24
tiktoken==0.6.0
timeout-decorator==0.5.0
tinycss2==1.2.1
tokenizers==0.15.2
toml==0.10.2
tomli==2.0.1
toolz==0.12.1
torch==2.2.1
torchvision==0.17.1
tornado==6.4
tqdm==4.65.2
traitlets==5.14.1
transformers==4.39.0
transformers-stream-generator==0.0.4
triton==2.2.0
types-python-dateutil==2.8.19.20240106
typing_extensions==4.9.0
tzdata==2024.1
tzlocal==5.2
uri-template==1.3.0
urllib3==1.26.18
uvicorn==0.30.6
validators==0.22.0
watchdog==4.0.0
wcwidth==0.2.13
webcolors==1.13
webencodings==0.5.1
websocket-client==1.7.0
widgetsnbextension==4.0.10
XlsxWriter==3.1.9
xtuner==0.1.23
xxhash==3.4.1
yapf==0.40.2
yarl==1.9.4
zipp==3.17.0
验证一下:
xtuner list-cfg
修改提供的数据
步骤 0. 创建一个新的文件夹用于存储微调数据
mkdir -p /root/finetune/data && cd /root/finetune/data
cp -r /root/Tutorial/data/assistant_Tuner.jsonl /root/finetune/data
步骤 1. 创建修改脚本
# 创建 `change_script.py` 文件
touch /root/finetune/data/change_script.py
import json
import argparse
from tqdm import tqdm
def process_line(line, old_text, new_text):
# 解析 JSON 行
data = json.loads(line)
# 递归函数来处理嵌套的字典和列表
def replace_text(obj):
if isinstance(obj, dict):
return {k: replace_text(v) for k, v in obj.items()}
elif isinstance(obj, list):
return [replace_text(item) for item in obj]
elif isinstance(obj, str):
return obj.replace(old_text, new_text)
else:
return obj
# 处理整个 JSON 对象
processed_data = replace_text(data)
# 将处理后的对象转回 JSON 字符串
return json.dumps(processed_data, ensure_ascii=False)
def main(input_file, output_file, old_text, new_text):
with open(input_file, 'r', encoding='utf-8') as infile, \
open(output_file, 'w', encoding='utf-8') as outfile:
# 计算总行数用于进度条
total_lines = sum(1 for _ in infile)
infile.seek(0) # 重置文件指针到开头
# 使用 tqdm 创建进度条
for line in tqdm(infile, total=total_lines, desc="Processing"):
processed_line = process_line(line.strip(), old_text, new_text)
outfile.write(processed_line + '\n')
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Replace text in a JSONL file.")
parser.add_argument("input_file", help="Input JSONL file to process")
parser.add_argument("output_file", help="Output file for processed JSONL")
parser.add_argument("--old_text", default="尖米", help="Text to be replaced")
parser.add_argument("--new_text", default="闻星", help="Text to replace with")
args = parser.parse_args()
main(args.input_file, args.output_file, args.old_text, args.new_text)
然后修改如下: 打开 change_script.py ,修改 --new_text 中 default=“闻星” 为你的名字
步骤 2. 执行脚本
# usage:python change_script.py {input_file.jsonl} {output_file.jsonl}
cd ~/finetune/data
python change_script.py ./assistant_Tuner.jsonl ./assistant_Tuner_change.jsonl
步骤 3. 查看数据
cat assistant_Tuner_change.jsonl | head -n 3
训练启动
步骤 0. 复制模型
在InternStudio开发机中的已经提供了微调模型,可以直接软链接即可。
本模型位于/root/share/new_models/Shanghai_AI_Laboratory/internlm2_5-7b-chat
mkdir /root/finetune/models
ln -s /root/share/new_models/Shanghai_AI_Laboratory/internlm2_5-7b-chat /root/finetune/models/internlm2_5-7b-chat
步骤 1. 修改 Config
# cd {path/to/finetune}
cd /root/finetune
mkdir ./config
cd config
xtuner copy-cfg internlm2_5_chat_7b_qlora_alpaca_e3 ./
修改config文件,
#######################################################################
# PART 1 Settings #
#######################################################################
- pretrained_model_name_or_path = 'internlm/internlm2_5-7b-chat'
+ pretrained_model_name_or_path = '/root/finetune/models/internlm2_5-7b-chat'
- alpaca_en_path = 'tatsu-lab/alpaca'
+ alpaca_en_path = '/root/finetune/data/assistant_Tuner_change.jsonl'
evaluation_inputs = [
- '请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
+ '请介绍一下你自己', 'Please introduce yourself'
]
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
- dataset=dict(type=load_dataset, path=alpaca_en_path),
+ dataset=dict(type=load_dataset, path='json', data_files=dict(train=alpaca_en_path)),
tokenizer=tokenizer,
max_length=max_length,
- dataset_map_fn=alpaca_map_fn,
+ dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
修改后如下
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/root/finetune/models/internlm2_5-7b-chat'
use_varlen_attn = False
# Data
alpaca_en_path = '/root/finetune/data/assistant_Tuner_change.jsonl'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 1
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请介绍一下你自己', 'Please introduce yourself'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=dict(train=alpaca_en_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
步骤 2. 启动微调
cd /root/finetune
conda activate xtuner-env
xtuner train ./config/internlm2_5_chat_7b_qlora_alpaca_e3_copy.py --deepspeed deepspeed_zero2 --work-dir ./work_dirs/assistTuner
xtuner train 命令用于启动模型微调进程。
该命令需要一个参数:CONFIG 用于指定微调配置文件。这里我们使用修改好的配置文件 internlm2_5_chat_7b_qlora_alpaca_e3_copy.py。
训练过程中产生的所有文件,包括日志、配置文件、检查点文件、微调后的模型等,默认保存在 work_dirs 目录下,我们也可以通过添加 --work-dir 指定特定的文件保存位置。
–deepspeed 则为使用 deepspeed, deepspeed 可以节约显存。
步骤 3. 权重转换
模型转换的本质其实就是将原本使用 Pytorch 训练出来的模型权重文件转换为目前通用的 HuggingFace 格式文件,那么我们可以通过以下命令来实现一键转换。
我们可以使用 xtuner convert pth_to_hf 命令来进行模型格式转换
xtuner convert pth_to_hf 命令用于进行模型格式转换。
该命令需要三个参数:
CONFIG 表示微调的配置文件,
PATH_TO_PTH_MODEL 表示微调的模型权重文件路径,即要转换的模型权重,
SAVE_PATH_TO_HF_MODEL 表示转换后的 HuggingFace 格式文件的保存路径。
–fp32 代表以fp32的精度开启,假如不输入则默认为fp16
–max-shard-size {GB} 代表每个权重文件最大的大小(默认为2GB)
cd /root/finetune/work_dirs/assistTuner
conda activate xtuner-env
# 先获取最后保存的一个pth文件
pth_file=`ls -t /root/finetune/work_dirs/assistTuner/*.pth | head -n 1 | sed 's/:$//'`
export MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU
xtuner convert pth_to_hf ./internlm2_5_chat_7b_qlora_alpaca_e3_copy.py ${pth_file} ./hf
步骤 4. 模型合并
对于 LoRA 或者 QLoRA 微调出来的模型其实并不是一个完整的模型,而是一个额外的层(Adapter),训练完的这个层最终还是要与原模型进行合并才能被正常的使用。
对于全量微调的模型(full)其实是不需要进行整合这一步的,因为全量微调修改的是原模型的权重而非微调一个新的 Adapter ,因此是不需要进行模型整合的。
在 XTuner 中提供了一键合并的命令 xtuner convert merge,在使用前我们需要准备好三个路径,包括原模型的路径、训练好的 Adapter 层的(模型格式转换后的)路径以及最终保存的路径。
xtuner convert merge命令用于合并模型。该命令需要三个参数:LLM 表示原模型路径,ADAPTER 表示 Adapter 层的路径, SAVE_PATH 表示合并后的模型最终的保存路径。
参数名解释
–max-shard-size {GB} 代表每个权重文件最大的大小(默认为2GB)
–device {device_name} 这里指的就是device的名称,可选择的有cuda、cpu和auto,默认为cuda即使用gpu进行运算
–is-clip 这个参数主要用于确定模型是不是CLIP模型,假如是的话就要加上,不是就不需要添加
cd /root/finetune/work_dirs/assistTuner
conda activate xtuner-env
export MKL_SERVICE_FORCE_INTEL=1
export MKL_THREADING_LAYER=GNU
xtuner convert merge /root/finetune/models/internlm2_5-7b-chat ./hf ./merged --max-shard-size 2GB
模型 WebUI 对话
cd ~/Tutorial/tools/L1_XTuner_code
# 直接修改xtuner_streamlit_demo.py脚本文件第18行
- model_name_or_path = "Shanghai_AI_Laboratory/internlm2_5-7b-chat"
+ model_name_or_path = "/root/finetune/work_dirs/assistTuner/merged"
conda activate xtuner-env
pip install streamlit==1.31.0
streamlit run /root/Tutorial/tools/L1_XTuner_code/xtuner_streamlit_demo.py
# 运行后,确保端口映射正常,如果映射已断开则需要重新做一次端口映射
ssh -CNg -L 8501:127.0.0.1:8501 root@ssh.intern-ai.org.cn -p *****
最后,通过浏览器访问:http://127.0.0.1:8501 来进行对话了