使用deepspeed小记

1. 减少显存占用的历程忠告

医学图像经常很大，所以训练模型有时候会有难度，但是现在找到了很多减少显存的方法。
不知道为什么，使用transformers的trainer库确确实实会减少显存的占用，即使没有使用deepspeed，占用的显存也会减少。

别自己造轮子

我之前也使用过 LoRA，自己也设计过，非常非常建议千万不要自己去写LoRA，很浪费时间，设计很费时间，同时检验模型LoRA的有效性也很浪费时间，权重的融合也很浪费时间，尽量使用其他已经写好的LoRA。

我推荐使用transformers集成模型和训练集，只需要写一个dataset和collate_fn，最多再多写一个Trainer的computer_loss,模型就可以自然而然的搞定。效率最高最有效。

2. Deepspeed方便快捷

在这里插入图片描述
使用 deepspeed 的流程是最短的

2.1 如果warning，需要加载一些库

moudle ava
moudle load compiler/gcc/7.3.1
moudle load cuda/7/11.8

由于deepspeed进行编译实际上是将GPU的一些指令重新编译，让CPU执行，同时还要符合CUDA的计算结构，能和GPU交互，所以GCC编译，CUDA编译都要符合版本要求

2.2 编写Trainer的python文件

建议使用transformers的trainer函数，这样很多json文件可以直接设置auto，同时还方便指定json配置文件。
同时要注意，这里可能会要求你加入 args，设置一个 local_rank 全局管控。
在 TrainingArguments 指定 ds_config.json 文件

import argparse
import sys
def parse_agrs():
    parser = argparse.ArgumentParser()
    parser.add_argument("--local_rank", type=int, default=-1, help="Local rank. Necessary for using the torch.distributed.launch utility.")
    return args

args = parse_agrs()

training_args = TrainingArguments(
    output_dir='./checkpoint/Eff_R2GenCMN_base',
    num_train_epochs=1000,
    per_device_train_batch_size=10,
    per_device_eval_batch_size=10,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./checkpoint/Eff_R2GenCMN_base/output_logs',
    logging_steps=10,
    save_strategy='steps',  # 添加保存策略为每一定步骤保存一次
    save_steps=100,  # 每100步保存一次模型
    save_total_limit=5,  # 最多保存5个模型
    report_to="none",
    fp16=True,  # 启用混合精度训练
    deepspeed='./ds_config.json',
)

tokenizer = Tokenizer()
args = parse_agrs()
model = R2GenCMN(args, tokenizer)
dataset_train = Dataset(xlsx_file="./dataset/train_dataset.xlsx")
dataset_test = Dataset(xlsx_file="./dataset/test_dataset.xlsx")
trainer = MyTrainer(
    model=model,  # 使用的模型
    args=training_args,  # 训练参数
    train_dataset=dataset_train,  # 训练数据集
    eval_dataset=dataset_test,  # 验证数据集
    data_collator=collate_fn,
    # 可能需要定义compute_metrics函数来计算评估指标
)

2.3 编写ds_config文件

编写ds_config文件的目的就是简介python文件，同时更改参数方便，减少大脑记忆负担，便于使用。
ds_config.json 文件脚本通常是 通用的， batch如果写auto，deepspeed会根据显卡给你 自动设置batch 大小
这里只是设置了

stage2的

{
    "bfloat16": {
        "enabled": false
    },
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 2e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 2e8,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "steps_per_print": 1e5
}

或者使用stage3

{
    "bfloat16": {
        "enabled": false
    },
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_fp16_weights_on_model_save": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1e5,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

2.4 运行程序

最终deepspeed运行就可以了
这里的warning实际上没有影响模型的运行，是重新编译。

deepspeed train.py
[2024-04-02 12:04:43,112] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-02 12:05:48,493] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-04-02 12:05:48,493] [INFO] [runner.py:555:main] cmd = /public/home/v-yumy/anaconda3/envs/llava2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None transformer_train.py
[2024-04-02 12:05:51,627] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-02 12:05:55,944] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2024-04-02 12:05:55,944] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2024-04-02 12:05:55,944] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2024-04-02 12:05:55,944] [INFO] [launch.py:163:main] dist_world_size=1
[2024-04-02 12:05:55,944] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-04-02 12:06:29,136] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-02 12:06:31,519] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-04-02 12:06:31,519] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-04-02 12:06:31,519] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.742 seconds.
Prefix dict has been built successfully.
EfficientNet: replace first conv
EncoderDecoder 的Transformer 是 base
EncoderDecoder 是 base
视觉特征，不进行预训练
 [WARNING]  cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /public/home/v-yumy/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /public/home/v-yumy/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.7046074867248535 seconds
Rank: 0 partition count [1] and sizes[(42770360, False)] 
{'loss': 6.7285, 'learning_rate': 1.6730270909663467e-05, 'epoch': 0.02}                                                                                                                                   
{'loss': 6.0535, 'learning_rate': 2.3254658315702903e-05, 'epoch': 0.05}                                                                                                                                   
{'loss': 5.598, 'learning_rate': 2.6809450068309278e-05, 'epoch': 0.07}                                                                                                                                    
{'loss': 5.2824, 'learning_rate': 2.9266416338062584e-05, 'epoch': 0.1}                                                                                                                                    
{'loss': 5.0738, 'learning_rate': 3.114597855245884e-05, 'epoch': 0.12}                                                                                                                                    
{'loss': 4.8191, 'learning_rate': 3.266853634404809e-05, 'epoch': 0.15}                                                                                                                                    
{'loss': 4.5336, 'learning_rate': 3.3948300828875964e-05, 'epoch': 0.17}