接着上一篇博客:在Ubuntu上用Llama Factory命令行微调Qwen2.5的简单过程_llamafactory 微调qwen 2.5-CSDN博客
如果需要微调比较大的模型,例如Qwen2.5-32B,那么在两个3090上可能不够用,这里我用A6000×4的服务器。但如果仿照上篇博客,直接运行:
llamafactory-cli train examples/train_qlora/qwen_lora.yaml
那还是会报错:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 270.00 MiB. GPU 0 has a total capacity of 47.54 GiB of which 244.94 MiB is free. Including non-PyTorch memory, this process has 47.28 GiB memory in use.
解决方法很多朋友也介绍了:
llamafactory用多卡4090服务器,训练qwen14B大模型时报错GPU显存不足oom(out of memory),已解决_llama factory out of memory-CSDN博客
LLaMA-Factory多机多卡训练_llamafactory多卡训练-CSDN博客
并且Llama Factory的作者也进行了说明:cuda 内存溢出 · Issue #3816 · hiyouga/LLaMA-Factory · GitHub
但是GitHub这里说得比较简略了,具体怎么解决呢,在yaml文件的method那一部分加入:deepspeed: examples/deepspeed/ds_z3_config.json
具体而言,我的yaml文件是这样:
### model
model_name_or_path: /home/ProjectsQuYuNew/Qwen2.5-32B-Instruct
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
deepspeed: examples/deepspeed/ds_z3_config.json
### dataset
dataset: identity_tpri
template: qwen
cutoff_len: 2048
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: saves/qwen2.5-32b/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true
### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 20.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
### eval
# val_size: 0.1
# per_device_eval_batch_size: 1
# eval_strategy: steps
# eval_steps: 500
然后再运行上面的命令,结果报错:
ImportError: DeepSpeed is not available => install it using `pip3 install deepspeed` or build it from source
这个也很好理解,安装一下deepspeed:
pip install deepspeed
结果还是报错:
AssertionError: no_sync context manager is incompatible with gradient partitioning logic of ZeRO stage 3
这里介绍了解决方法:no sync context manager is incompatible with gradientpartitioning logic of ZeRo stage 3_no sync zero3-CSDN博客
其实也很简单,uninstall deepspeed之后安装特定版本的deepspeed即可:
pip install deepspeed==0.15.4
然后就可以正常进行微调了:
100%|█████████████████████| 40/40 [1:10:54<00:00, 106.08s/it]
ps:从loss看似乎没训练好,但是总是可以跑了。
另外,发现Llama factory在A6000服务器上还有另外一个问题,直接运行(这个yaml请参考上一篇博客):
llamafactory-cli chat examples/inference/qwen2_lora.yaml
哪怕不加上微调模块,模型输出也是胡说八道(输出乱码)。解决方法是推理的时候只用一块GPU:
CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/inference/qwen2_lora.yaml
但是这时候问题又出来了,一块GPU加上lora模块可能还是显存溢出。解决方法就是将原模型和lora模块合并:
llamafactory-cli export examples/merge_lora/llama3_lora_sft.yaml
这里的yaml文件是这样写的:
### Note: DO NOT use quantized model or quantization_bit when merging lora adapters
### model
model_name_or_path: /home/admin90601/ProjectsQuYuNew/Qwen2.5-32B-Instruct
adapter_name_or_path: saves/qwen2.5-32b/lora/sft
template: qwen
finetuning_type: lora
trust_remote_code: true
### export
export_dir: models/qwen2.5-32b_lora_sft
export_size: 2
export_device: auto
export_legacy_format: false
然后再运行:
CUDA_VISIBLE_DEVICES=0 llamafactory-cli chat examples/inference/qwen2_lora.yaml
就可以正常加载了。呵呵呵,这个bug还不知道有什么办法能解决,现在对32B的模型可以这样,72B的就无能为力了。
更新:再记录一个bug,上面merge的yaml文件,export_device如果选auto,在我的服务器上就看不出微调效果,必须填成cpu。不得不说这个bug有点坑!