LLM - 配置 ModelScope SWIFT 测试 Qwen2-VL 图像微调(LoRA) 教程(2)

欢迎关注我的CSDN：https://spike.blog.csdn.net/
本文地址：https://spike.blog.csdn.net/article/details/142882496

免责声明：本文来源于个人知识与公开资料，仅用于学术交流，欢迎讨论，不支持转载。

SWIFT

SWIFT 即 Scalable lightWeight Infrastructure for FineTuning (可扩展轻量级微调基础设施)，是高效、轻量级的模型微调和推理框架，支持大语言模型(LLM) 和多模态大型模型(MLLM) 的训练、推理、评估和部署。可以将 SWIFT 框架直接应用到研究和生产环境中，实现从模型训练和评估到应用的完整工作流程。

GitHub: modelscope/ms-swift

1. 数据集

测试 OCR 数据集：

已整理 (Parquet格式)：https://modelscope.cn/datasets/AI-ModelScope/LaTeX_OCR
原始：https://github.com/LinXueyuanStdio/Data-for-LaTeX_OCR

数据集缓存( MODELSCOPE_CACHE) 位置：modelscope_models/AI-ModelScope/LaTeX_OCR

测试数据效果：

[your path]/llm/vision_test_data/latex-print.png
[your path]/llm/vision_test_data/latex-fullhand.png

测试 qwen2-vl-7b-instruct 的 OCR 识别能力，即：

CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-vl-7b-instruct

<<< <image>使用OCR识别图像中的Latex公式
Input an image path or URL <<< [your path]/llm/vision_test_data/latex-print.png
ds^2 = (1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}} \{dr^2 + r^2 d\theta^2 + r^2 sin^2\theta d\phi^2 \} - \frac{dt^2}{(1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}}}.

原始图像：
print

识别结果(印刷)：

$ds^2 = (1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}} \{dr^2 + r^2 d\theta^2 + r^2 sin^2\theta d\phi^2 \} - \frac{dt^2}{(1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}}}$

原始图像：
fullhand
识别结果(手写)：

$ds^2 = (1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}} \{d\delta^2 + r^2 d\theta^2 + n^2 s/n^2 d\phi^2 \} - \frac{dt^2}{(1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}}}.$

其中，数据集 latex-ocr-print 的 preprocess_func() 函数，如下：

def _preprocess_latex_ocr_dataset(dataset: DATASET_TYPE) -> DATASET_TYPE:
    from datasets import Image
    prompt = 'Using LaTeX to perform OCR on the image.'

    def _process(d):
        return {'query': prompt, 'response': d['text']}

    kwargs = {}
    if not isinstance(dataset, HfIterableDataset):
        kwargs['load_from_cache_file'] = dataset_enable_cache
    return dataset.map(_process, **kwargs).rename_column('image', 'images')

使用 ModelScope 下载的数据集，位于 modelscope_models/hub/datasets，数据集是 arrow 格式，与默认格式不兼容，即：

├── [4.0K]  AI-ModelScope___la_te_x_ocr
│   └── [4.0K]  synthetic_handwrite-eb02dd1cc52afa40
│       └── [4.0K]  0.0.0
│           ├── [4.0K]  master
│           │   ├── [752K]  cache-8f28bc5f38ad58b9-fa2020342a21.arrow
│           │   ├── [6.3M]  cache-a7c7e67013e13072-fa2020342a21.arrow
│           │   ├── [606M]  cache-c67a1e1eba314afd-fa2020342a21.arrow
│           │   ├── [7.9K]  cache-e9fb6f7ceeaa8304-fa2020342a21.arrow
│           │   ├── [1.2K]  dataset_info.json
│           │   ├── [ 59M]  la_te_x_ocr-test.arrow
│           │   ├── [474M]  la_te_x_ocr-train.arrow
│           │   └── [ 59M]  la_te_x_ocr-validation.arrow
│           ├── [   0]  master.incomplete_info.lock
│           └── [   0]  master_builder.lock

2. 有监督微调训练

有监督微调(Supervised Fine-Tuning, SFT)，参数说明：

python [your path]/llm/ms-swift/swift/cli/sft.py --help

在运行过程中，自动下载数据集，至 MODELSCOPE_CACHE，并且转换成 SWIFT 支持的 Arrow 格式，无法使用默认数据集，即：

MAX_STEPS=2000 SIZE_FACTOR=8 MAX_PIXELS=602112 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7,8 nohup swift sft \
--model_type qwen2-vl-7b-instruct \
--model_id_or_path qwen/Qwen2-VL-7B-Instruct \
--sft_type lora \
--num_train_epochs 2 \
--batch_size 4 \
--eval_steps 1000 \
--save_steps 1000 \
--dataset latex-ocr-handwrite \
> nohup.latex-ocr-handwrite.out &

tail -f nohup.latex-ocr-handwrite.out

如果使用，自定义数据集格式，参考 Swift - 自定义数据集，需要转换成标准的 json 或 jsonl 格式。

训练完成，输出日志，累计训练 11808 steps，如下：

[INFO:swift] Saving model checkpoint to [your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808
Train: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11808/11808 [6:16:15<00:00,  1.91s/it]
[INFO:swift] last_model_checkpoint: [your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808
[INFO:swift] best_model_checkpoint: [your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11000
[INFO:swift] images_dir: [your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638/images
[INFO:swift] End time of running main: 2024-10-12 03:17:31.020443
{'eval_loss': 0.12784964, 'eval_acc': 0.96368307, 'eval_runtime': 44.673, 'eval_samples_per_second': 21.355, 'eval_steps_per_second': 5.35, 'epoch': 2.0, 'global_step/max_steps': '11808/11808', 'percentage': '100.00%', 'elapsed_time': '6h 16m 14s', 'remaining_time': '0s'}
{'train_runtime': 22574.9994, 'train_samples_per_second': 8.369, 'train_steps_per_second': 0.523, 'train_loss': 0.14006881, 'epoch': 2.0, 'global_step/max_steps': '11808/11808', 'percentage': '100.00%', 'elapsed_time': '6h 16m 15s', 'remaining_time': '0s'}

输出如下，其中 images 保存训练过程的绘制图像，即：

[your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638
├── [4.0K]  checkpoint-11000
├── [4.0K]  checkpoint-11808
├── [4.0K]  images
├── [1.1M]  logging.jsonl
├── [4.0K]  runs
├── [ 11K]  sft_args.json
└── [4.8K]  training_args.json

使用 TensorBoard 读取训练日志，即：

# http://127.0.0.1:6006/
tensorboard --logdir=[your folder]/output/qwen2-vl-7b-instruct/v0-20241011-205638/runs/ --host=0.0.0.0 --port=6006

训练 Loss，Smooth=0.9，如下：

Loss

学习率，如下：

验证集 Loss，eval_steps=1000，如下：

Loss

显存占用 (BatchSize=4)，如下：

GPU

其他，如果使用 Matplotlib 和 TensorBoard 数据绘制 Loss 曲线，平滑度设置成 0.9，参考：

import os
from typing import Dict, List, Tuple

import matplotlib.pyplot as plt
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator

Item = Dict[str, float]
TB_COLOR, TB_COLOR_SMOOTH = '#FFE2D9', '#FF7043'

def read_tensorboard_file(fpath: str) -> Dict[str, List[Item]]:
    if not os.path.isfile(fpath):
        raise FileNotFoundError(f'fpath: {fpath}')
    ea = EventAccumulator(fpath)
    ea.Reload()
    res: Dict[str, List[Item]] = {}
    tags = ea.Tags()['scalars']
    print(f"[Info] tags: {tags}")
    for tag in tags:
        values = ea.Scalars(tag)
        r: List[Item] = []
        for v in values:
            r.append({'step': v.step, 'value': v.value})
        res[tag] = r
    return res


def tensorboard_smoothing(values: List[float], smooth: float = 0.9) -> List[float]:
    norm_factor = 0
    x = 0
    res: List[float] = []
    for i in range(len(values)):
        x = x * smooth + values[i]  # Exponential decay
        norm_factor *= smooth
        norm_factor += 1
        res.append(x / norm_factor)
    return res


def plot_images(images_dir: str,
                tb_dir: str,
                smooth_key: List[str],
                smooth_val: float = 0.9,
                figsize: Tuple[int, int] = (8, 5),
                dpi: int = 100) -> None:
    """Using tensorboard's data content to plot images"""
    os.makedirs(images_dir, exist_ok=True)
    fname = [fname for fname in os.listdir(tb_dir) if os.path.isfile(os.path.join(tb_dir, fname))][0]
    tb_path = os.path.join(tb_dir, fname)
    data = read_tensorboard_file(tb_path)

    for k in data.keys():
        _data = data[k]
        steps = [d['step'] for d in _data]
        values = [d['value'] for d in _data]
        if len(values) == 0:
            continue
        _, ax = plt.subplots(1, 1, squeeze=True, figsize=figsize, dpi=dpi)
        ax.set_title(k)
        if len(values) == 1:
            ax.scatter(steps, values, color=TB_COLOR_SMOOTH)
        elif k in smooth_key:
            ax.plot(steps, values, color=TB_COLOR)
            values_s = tensorboard_smoothing(values, smooth_val)
            ax.plot(steps, values_s, color=TB_COLOR_SMOOTH)
        else:
            ax.plot(steps, values, color=TB_COLOR_SMOOTH)
        # fpath = os.path.join(images_dir, k.replace('/', '_'))
        # plt.savefig(fpath, dpi=dpi, bbox_inches='tight')
        # plt.close()
        plt.show()
        plt.close()
        break
        
ckpt_dir="[your path]/llm/ms-swift/output"
images_dir = os.path.join(ckpt_dir, 'images')
tb_dir = "[your path]/run_cuda/output/qwen2-vl-7b-instruct/v0-20241011-205638/runs/"
plot_images(images_dir, tb_dir, ['train/loss'], 0.9)

3. 合并 LoRA 模型

训练完成，输出的 LoRA 模型，如下：

(rag) output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808# tree -L 1 -h .
.
├── [5.0K]  README.md
├── [ 712]  adapter_config.json
├── [ 39M]  adapter_model.safetensors
├── [  67]  additional_config.json
├── [ 383]  configuration.json
├── [ 219]  generation_config.json
├── [ 77M]  optimizer.pt
├── [ 14K]  rng_state.pth
├── [1.0K]  scheduler.pt
├── [ 11K]  sft_args.json
├── [608K]  trainer_state.json
└── [7.2K]  training_args.bin

将 LoRA 合并至源模型，同时，评估模型，即：

CUDA_VISIBLE_DEVICES=0 swift infer \
--ckpt_dir [your path]/run_cuda/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808/ \
--load_dataset_config true \
--merge_lora true
# 直接评估模型

使用合并之后的模型，进行推理：

# [your path]/run_cuda/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808-merged
# CUDA_VISIBLE_DEVICES=0 swift infer --model_type qwen2-vl-7b-instruct
CUDA_VISIBLE_DEVICES=0 swift infer --ckpt_dir [your path]/run_cuda/output/qwen2-vl-7b-instruct/v0-20241011-205638/checkpoint-11808-merged

测试输出差异，即：

<<< <image>使用OCR识别图像中的Latex公式
Input an image path or URL <<< [your path]/llm/vision_test_data/latex-fullhand.png
d s ^ { 2 } = ( 1 - \frac { q c o s \theta } { r } ) ^ { \frac { 2 } { 1 + \kappa ^ { 2 } } } \{ d r ^ { 2 } + r ^ { 2 } d \theta ^ { 2 } + r ^ { 2 } s i n ^ { 2 } \theta d \varphi ^ { 2 } \} - \frac { d t ^ { 2 } } { ( 1 - \frac { q c o s \theta } { r } ) ^ { \frac { 2 } { 1 + \kappa ^ { 2 } } } } .

# 之前格式
# ds^2 = (1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}} \{d\delta^2 + r^2 d\theta^2 + n^2 s/n^2 d\phi^2 \} - \frac{dt^2}{(1 - \frac{qcos\theta}{r})^{\frac{2}{1 + \alpha^2}}}.

注意：格式与之前差异较大，模型已经学会新的 OCR 输出格式，之前的输出格式没有空格，新的输出格式包括空格，与微调数据一致。

测试微调的训练数据格式，与 LoRA 输出保持一致，训练成功，如下：

d s ^ { 2 } = ( 1 - { \frac { q c o s \theta } { r } } ) ^ { \frac { 2 } { 1 + \alpha ^ { 2 } } } \lbrace d r ^ { 2 } + r ^ { 2 } d \theta ^ { 2 } + r ^ { 2 } s i n ^ { 2 } \theta d \varphi ^ { 2 } \rbrace - { \frac { d t ^ { 2 } } { ( 1 - { \frac { q c o s \theta } { r } } ) ^ { \frac { 2 } { 1 + \alpha ^ { 2 } } } } } \, .
\widetilde \gamma _ { \mathrm { h o p f } } \simeq \sum _ { n > 0 } \widetilde { G } _ { n } { \frac { ( - a ) ^ { n } } { 2 ^ { 2 n - 1 } } }

4. 训练参数 --dataset 调用逻辑

数据集的声明位于 swift/llm/utils/dataset.py，参考：

latex_ocr_print = 'latex-ocr-print'

register_dataset(
    DatasetName.latex_ocr_print,	# dataset_name
    'AI-ModelScope/LaTeX_OCR',		# dataset_id_or_path
    ['full'],											# subsets
    _preprocess_latex_ocr_dataset,# preprocess_func
    get_dataset_from_repo,				# get_function
    split=['validation', 'test'],  # There are some problems in the training dataset.
    hf_dataset_id='linxy/LaTeX_OCR',
    tags=['chat', 'ocr', 'multi-modal', 'vision'])

其中 register_dataset 函数的作用是，把 dataset_info 注册进入 DATASET_MAPPING 中：

dataset_info = {
    'dataset_id_or_path': dataset_id_or_path,
    'subsets': subsets,
    'preprocess_func': preprocess_func,
    'split': split,
    'hf_dataset_id': hf_dataset_id,
    'is_local': is_local,
    **kwargs
}
DATASET_MAPPING[dataset_name] = dataset_info

其中 args.dataset 参数是位于 _get_train_val_dataset 函数中：

sft_main = get_sft_main(SftArguments, llm_sft)

def llm_sft(args: SftArguments) -> Dict[str, Any]:
		# ...
    train_dataset, val_dataset = prepare_dataset(args, template, msg)  # 调用

def prepare_dataset(args, template: Template, msg: Optional[Dict[str, Any]] = None):
  # ...
  train_dataset, val_dataset = _get_train_val_dataset(args)  # 调用

def _get_train_val_dataset(args: SftArguments) -> Tuple[HfDataset, Optional[HfDataset]]:
  # ...
  train_dataset, val_dataset = get_dataset(
      args.dataset,
      args.dataset_test_ratio,
      args.dataset_seed,
      check_dataset_strategy=args.check_dataset_strategy,
      model_name=args.model_name,
      model_author=args.model_author,
      streaming=args.streaming,
      streaming_val_size=args.streaming_val_size,
      streaming_buffer_size=args.streaming_buffer_size)

即 swift/llm/sft.py#llm_sft() -> prepare_dataset() -> _get_train_val_dataset() -> get_dataset()

在 swift/llm/utils/dataset.py 中，即：

def get_dataset(
        dataset_name_list: Union[List[str], str],
        dataset_test_ratio: float = 0.,
        dataset_seed: Union[int, RandomState] = 42,
        check_dataset_strategy: Literal['none', 'discard', 'error', 'warning'] = 'none',
        *,
        # for self-cognition
        model_name: Union[Tuple[str, str], List[str], None] = None,
        model_author: Union[Tuple[str, str], List[str], None] = None,
        **kwargs) -> Tuple[DATASET_TYPE, Optional[DATASET_TYPE]]:
    """Returns train_dataset and val_dataset"""
		# ...
    if isinstance(dataset_name_list, str):
        dataset_name_list = [dataset_name_list]
		# ...
    # dataset_id_or_path -> dataset_name
    dataset_name_list = _dataset_id_to_name(dataset_name_list)

调用 _dataset_id_to_name() 函数：

调用 register_dataset_info() 函数
调用 register_local_dataset() 函数
调用 register_dataset() 函数
调用 get_local_dataset() 函数
调用 load_dataset_from_local() 函数
处理 .jsonl 、 .json 、.csv 文件
或者调用 preprocess_func() 函数

即：

if dataset_path.endswith('.csv'):
    dataset = HfDataset.from_csv(dataset_path, na_filter=False)
elif dataset_path.endswith('.jsonl') or dataset_path.endswith('.json'):
    dataset = HfDataset.from_json(dataset_path)
else:
    raise ValueError('The custom dataset only supports CSV, JSONL or JSON format.')
dataset = preprocess_func(dataset)