欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://spike.blog.csdn.net/article/details/142882496
免责声明:本文来源于个人知识与公开资料,仅用于学术交流,欢迎讨论,不支持转载。
SWIFT 即 Scalable lightWeight Infrastructure for FineTuning (可扩展轻量级微调基础设施),是高效、轻量级的模型微调和推理框架,支持大语言模型(LLM) 和 多模态大型模型(MLLM) 的训练、推理、评估和部署。可以将 SWIFT 框架直接应用到研究和生产环境中,实现从模型训练和评估到应用的完整工作流程。
GitHub: modelscope/ms-swift
1. 数据集
测试 OCR 数据集:
- 已整理 (Parquet格式):https://modelscope.cn/datasets/AI-ModelScope/LaTeX_OCR
- 原始:https://github.com/LinXueyuanStdio/Data-for-LaTeX_OCR
数据集 缓存( MODELSCOPE_CACHE
) 位置:modelscope_models/AI-ModelScope/LaTeX_OCR
测试数据效果:
[your path]/llm/vision_test_data/latex-print.png
[your path]/llm/vision_test_data/latex-fullhand.png
测试 qwen2-vl-7b-instruct
的 OCR 识别能力,即:
CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-vl-7b-instruct
<<< <image>使用OCR识别图像中的Latex公式
Input an image path or URL <<< [your path]/llm/vision_test_data/latex-print.png
ds^2 = (1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}} \{dr^2 + r^2 d\theta^2 + r^2 sin^2\theta d\phi^2 \} - \frac{dt^2}{(1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}}}.
原始图像:
识别结果(印刷):
d s 2 = ( 1 − q c o s θ r ) 2 1 + α 2 { d r 2 + r 2 d θ 2 + r 2 s i n 2 θ d ϕ 2 } − d t 2 ( 1 − q c o s θ r ) 2 1 + α 2 ds^2 = (1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}} \{dr^2 + r^2 d\theta^2 + r^2 sin^2\theta d\phi^2 \} - \frac{dt^2}{(1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}}} ds2=(1−rqcosθ)1+α22{dr2+r2dθ2+r2sin2θdϕ2}−(1−rqcosθ)1+α22dt2
原始图像:
识别结果(手写):
d s 2 = ( 1 − q c o s θ r ) 2 1 + α 2 { d δ 2 + r 2 d θ 2 + n 2 s / n 2 d ϕ 2 } − d t 2 ( 1 − q c o s θ r ) 2 1 + α 2 . ds^2 = (1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}} \{d\delta^2 + r^2 d\theta^2 + n^2 s/n^2 d\phi^2 \} - \frac{dt^2}{(1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}}}. ds2=(1−rqcosθ)1+α22{dδ2+r2dθ2+n2s/n2dϕ2}−(1−rqcosθ)1+α22dt2.
其中,数据集 latex-ocr-print
的 preprocess_func()
函数,如下:
def _preprocess_latex_ocr_dataset(dataset: DATASET_TYPE) -> DATASET_TYPE:
from datasets import Image
prompt = 'Using LaTeX to perform OCR on the image.'
def _process(d):
return {'query': prompt, 'response': d['text']}
kwargs = {}
if not isinstance(dataset, HfIterableDataset):
kwargs['load_from_cache_file'] = dataset_enable_cache
return dataset.map(_process, **kwargs).rename_column('image', 'images')
使用 ModelScope 下载的数据集,位于 modelscope_models/hub/datasets
,数据集是 arrow 格式,与默认格式不兼容,即:
├── [4.0K] AI-ModelScope___la_te_x_ocr
│ └── [4.0K] synthetic_handwrite-eb02dd1cc52afa40
│ └── [4.0K] 0.0.0
│ ├── [4.0K] master
│ │ ├── [752K] cache-8f28bc5f38ad58b9-fa2020342a21.arrow
│ │ ├── [6.3M] cache-a7c7e67013e13072-fa2020342a21.arrow
│ │ ├── [606M] cache-c67a1e1eba314afd-fa2020342a21.arrow
│ │ ├── [7.9K] cache-e9fb6f7ceeaa8304-fa2020342a21.arrow
│ │ ├── [1.2K] dataset_info.json
│ │ ├── [ 59M] la_te_x_ocr-test.arrow
│ │ ├── [474M] la_te_x_ocr-train.arrow
│ │ └── [ 59M] la_te_x_ocr-validation.arrow
│ ├── [ 0] master.incomplete_info.lock
│ └── [ 0] master_builder.lock
2. 有监督微调训练
有监督微调(Supervised Fine-Tuning, SFT),参数说明:
python [your path]/llm/ms-swift/swift/cli/sft.py --help
在运行过程中,自动下载数据集,至 MODELSCOPE_CACHE
,并且转换成 SWIFT 支持的 Arrow 格式,无法使用默认数据集,即:
MAX_STEPS=2000 SIZE_FACTOR=8 MAX_PIXELS=602112 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8 nohup swift sft \
--model_type qwen2-vl-7b-instruct \
--model_id_or_path qwen/Qwen2-VL-7B-Instruct \
--sft_type lora \
--num_train_epochs 2 \
--batch_size 4 \
--eval_steps 1000 \
--save_steps 1000 \
--dataset latex-ocr-handwrite \
> nohup.latex-ocr-handwrite.out &
tail -f nohup.latex-ocr-handwrite.out
如果使用,自定义数据集格式,参考 Swift - 自定义数据集,需要转换成标准的 json 或 jsonl 格式。
训练完成,输出日志,累计训练 11808 steps,如下:
[INFO:swift] Saving model checkpoint to [your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808
Train: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11808/11808 [6:16:15<00:00, 1.91s/it]
[INFO:swift] last_model_checkpoint: [your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808
[INFO:swift] best_model_checkpoint: [your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11000
[INFO:swift] images_dir: [your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638/images
[INFO:swift] End time of running main: 2024-10-12 03:17:31.020443
{'eval_loss': 0.12784964, 'eval_acc': 0.96368307, 'eval_runtime': 44.673, 'eval_samples_per_second': 21.355, 'eval_steps_per_second': 5.35, 'epoch': 2.0, 'global_step/max_steps': '11808/11808', 'percentage': '100.00%', 'elapsed_time': '6h 16m 14s', 'remaining_time': '0s'}
{'train_runtime': 22574.9994, 'train_samples_per_second': 8.369, 'train_steps_per_second': 0.523, 'train_loss': 0.14006881, 'epoch': 2.0, 'global_step/max_steps': '11808/11808', 'percentage': '100.00%', 'elapsed_time': '6h 16m 15s', 'remaining_time': '0s'}
输出如下,其中 images
保存训练过程的绘制图像,即:
[your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638
├── [4.0K] checkpoint-11000
├── [4.0K] checkpoint-11808
├── [4.0K] images
├── [1.1M] logging.jsonl
├── [4.0K] runs
├── [ 11K] sft_args.json
└── [4.8K] training_args.json
使用 TensorBoard 读取训练日志,即:
# http://127.0.0.1:6006/
tensorboard --logdir=[your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638/runs/ --host=0.0.0.0 --port=6006
训练 Loss,Smooth=0.9,如下:
学习率,如下:
验证集 Loss,eval_steps
=1000,如下:
显存占用 (BatchSize=4),如下:
其他,如果使用 Matplotlib 和 TensorBoard 数据绘制 Loss 曲线,平滑度设置成 0.9,参考:
import os
from typing import Dict, List, Tuple
import matplotlib.pyplot as plt
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
Item = Dict[str, float]
TB_COLOR, TB_COLOR_SMOOTH = '#FFE2D9', '#FF7043'
def read_tensorboard_file(fpath: str) -> Dict[str, List[Item]]:
if not os.path.isfile(fpath):
raise FileNotFoundError(f'fpath: {fpath}')
ea = EventAccumulator(fpath)
ea.Reload()
res: Dict[str, List[Item]] = {}
tags = ea.Tags()['scalars']
print(f"[Info] tags: {tags}")
for tag in tags:
values = ea.Scalars(tag)
r: List[Item] = []
for v in values:
r.append({'step': v.step, 'value': v.value})
res[tag] = r
return res
def tensorboard_smoothing(values: List[float], smooth: float = 0.9) -> List[float]:
norm_factor = 0
x = 0
res: List[float] = []
for i in range(len(values)):
x = x * smooth + values[i] # Exponential decay
norm_factor *= smooth
norm_factor += 1
res.append(x / norm_factor)
return res
def plot_images(images_dir: str,
tb_dir: str,
smooth_key: List[str],
smooth_val: float = 0.9,
figsize: Tuple[int, int] = (8, 5),
dpi: int = 100) -> None:
"""Using tensorboard's data content to plot images"""
os.makedirs(images_dir, exist_ok=True)
fname = [fname for fname in os.listdir(tb_dir) if os.path.isfile(os.path.join(tb_dir, fname))][0]
tb_path = os.path.join(tb_dir, fname)
data = read_tensorboard_file(tb_path)
for k in data.keys():
_data = data[k]
steps = [d['step'] for d in _data]
values = [d['value'] for d in _data]
if len(values) == 0:
continue
_, ax = plt.subplots(1, 1, squeeze=True, figsize=figsize, dpi=dpi)
ax.set_title(k)
if len(values) == 1:
ax.scatter(steps, values, color=TB_COLOR_SMOOTH)
elif k in smooth_key:
ax.plot(steps, values, color=TB_COLOR)
values_s = tensorboard_smoothing(values, smooth_val)
ax.plot(steps, values_s, color=TB_COLOR_SMOOTH)
else:
ax.plot(steps, values, color=TB_COLOR_SMOOTH)
# fpath = os.path.join(images_dir, k.replace('/', '_'))
# plt.savefig(fpath, dpi=dpi, bbox_inches='tight')
# plt.close()
plt.show()
plt.close()
break
ckpt_dir="[your path]/llm/ms-swift/output"
images_dir = os.path.join(ckpt_dir, 'images')
tb_dir = "[your path]/run_cuda/output/qwen2-vl-7b-instruct/v0-20241011-205638/runs/"
plot_images(images_dir, tb_dir, ['train/loss'], 0.9)
3. 合并 LoRA 模型
训练完成,输出的 LoRA 模型,如下:
(rag) output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808# tree -L 1 -h .
.
├── [5.0K] README.md
├── [ 712] adapter_config.json
├── [ 39M] adapter_model.safetensors
├── [ 67] additional_config.json
├── [ 383] configuration.json
├── [ 219] generation_config.json
├── [ 77M] optimizer.pt
├── [ 14K] rng_state.pth
├── [1.0K] scheduler.pt
├── [ 11K] sft_args.json
├── [608K] trainer_state.json
└── [7.2K] training_args.bin
将 LoRA 合并至源模型,同时,评估模型,即:
CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir [your path]/run_cuda/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808/ \
--load_dataset_config true \
--merge_lora true
# 直接评估模型
使用合并之后的模型,进行推理:
# [your path]/run_cuda/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808-merged
# CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-vl-7b-instruct
CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir [your path]/run_cuda/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808-merged
测试输出差异,即:
<<< <image>使用OCR识别图像中的Latex公式
Input an image path or URL <<< [your path]/llm/vision_test_data/latex-fullhand.png
d s ^ { 2 } = ( 1 - \frac { q c o s \theta } { r } ) ^ { \frac { 2 } { 1 + \kappa ^ { 2 } } } \{ d r ^ { 2 } + r ^ { 2 } d \theta ^ { 2 } + r ^ { 2 } s i n ^ { 2 } \theta d \varphi ^ { 2 } \} - \frac { d t ^ { 2 } } { ( 1 - \frac { q c o s \theta } { r } ) ^ { \frac { 2 } { 1 + \kappa ^ { 2 } } } } .
# 之前格式
# ds^2 = (1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}} \{d\delta^2 + r^2 d\theta^2 + n^2 s/n^2 d\phi^2 \} - \frac{dt^2}{(1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}}}.
注意:格式与之前差异较大,模型已经学会新的 OCR 输出格式,之前的输出格式没有空格,新的输出格式包括空格,与微调数据一致。
测试微调的训练数据格式,与 LoRA 输出保持一致,训练成功,如下:
d s ^ { 2 } = ( 1 - { \frac { q c o s \theta } { r } } ) ^ { \frac { 2 } { 1 + \alpha ^ { 2 } } } \lbrace d r ^ { 2 } + r ^ { 2 } d \theta ^ { 2 } + r ^ { 2 } s i n ^ { 2 } \theta d \varphi ^ { 2 } \rbrace - { \frac { d t ^ { 2 } } { ( 1 - { \frac { q c o s \theta } { r } } ) ^ { \frac { 2 } { 1 + \alpha ^ { 2 } } } } } \, .
\widetilde \gamma _ { \mathrm { h o p f } } \simeq \sum _ { n > 0 } \widetilde { G } _ { n } { \frac { ( - a ) ^ { n } } { 2 ^ { 2 n - 1 } } }
4. 训练参数 --dataset 调用逻辑
数据集的声明位于 swift/llm/utils/dataset.py
,参考:
latex_ocr_print = 'latex-ocr-print'
register_dataset(
DatasetName.latex_ocr_print, # dataset_name
'AI-ModelScope/LaTeX_OCR', # dataset_id_or_path
['full'], # subsets
_preprocess_latex_ocr_dataset,# preprocess_func
get_dataset_from_repo, # get_function
split=['validation', 'test'], # There are some problems in the training dataset.
hf_dataset_id='linxy/LaTeX_OCR',
tags=['chat', 'ocr', 'multi-modal', 'vision'])
其中 register_dataset
函数的作用是,把 dataset_info 注册进入 DATASET_MAPPING
中:
dataset_info = {
'dataset_id_or_path': dataset_id_or_path,
'subsets': subsets,
'preprocess_func': preprocess_func,
'split': split,
'hf_dataset_id': hf_dataset_id,
'is_local': is_local,
**kwargs
}
DATASET_MAPPING[dataset_name] = dataset_info
其中 args.dataset
参数是位于 _get_train_val_dataset
函数中:
sft_main = get_sft_main(SftArguments, llm_sft)
def llm_sft(args: SftArguments) -> Dict[str, Any]:
# ...
train_dataset, val_dataset = prepare_dataset(args, template, msg) # 调用
def prepare_dataset(args, template: Template, msg: Optional[Dict[str, Any]] = None):
# ...
train_dataset, val_dataset = _get_train_val_dataset(args) # 调用
def _get_train_val_dataset(args: SftArguments) -> Tuple[HfDataset, Optional[HfDataset]]:
# ...
train_dataset, val_dataset = get_dataset(
args.dataset,
args.dataset_test_ratio,
args.dataset_seed,
check_dataset_strategy=args.check_dataset_strategy,
model_name=args.model_name,
model_author=args.model_author,
streaming=args.streaming,
streaming_val_size=args.streaming_val_size,
streaming_buffer_size=args.streaming_buffer_size)
即 swift/llm/sft.py#llm_sft()
-> prepare_dataset()
-> _get_train_val_dataset()
-> get_dataset()
在 swift/llm/utils/dataset.py
中,即:
def get_dataset(
dataset_name_list: Union[List[str], str],
dataset_test_ratio: float = 0.,
dataset_seed: Union[int, RandomState] = 42,
check_dataset_strategy: Literal['none', 'discard', 'error', 'warning'] = 'none',
*,
# for self-cognition
model_name: Union[Tuple[str, str], List[str], None] = None,
model_author: Union[Tuple[str, str], List[str], None] = None,
**kwargs) -> Tuple[DATASET_TYPE, Optional[DATASET_TYPE]]:
"""Returns train_dataset and val_dataset"""
# ...
if isinstance(dataset_name_list, str):
dataset_name_list = [dataset_name_list]
# ...
# dataset_id_or_path -> dataset_name
dataset_name_list = _dataset_id_to_name(dataset_name_list)
调用 _dataset_id_to_name()
函数:
- 调用
register_dataset_info()
函数 - 调用
register_local_dataset()
函数 - 调用
register_dataset()
函数 - 调用
get_local_dataset()
函数 - 调用
load_dataset_from_local()
函数 - 处理
.jsonl
、.json
、.csv
文件 - 或者调用
preprocess_func()
函数
即:
if dataset_path.endswith('.csv'):
dataset = HfDataset.from_csv(dataset_path, na_filter=False)
elif dataset_path.endswith('.jsonl') or dataset_path.endswith('.json'):
dataset = HfDataset.from_json(dataset_path)
else:
raise ValueError('The custom dataset only supports CSV, JSONL or JSON format.')
dataset = preprocess_func(dataset)