1. 减少显存占用的历程忠告
医学图像经常很大,所以训练模型有时候会有难度,但是现在找到了很多减少显存的方法。
不知道为什么,使用transformers的trainer库确确实实会减少显存的占用,即使没有使用deepspeed,占用的显存也会减少。
-
别自己造轮子
- 我之前也使用过 LoRA,自己也设计过,非常非常建议千万不要自己去写LoRA,很浪费时间,设计很费时间,同时检验模型LoRA的有效性也很浪费时间,权重的融合也很浪费时间,尽量使用其他已经写好的LoRA。
我推荐使用transformers集成模型和训练集,只需要写一个dataset和collate_fn,最多再多写一个Trainer的computer_loss,模型就可以自然而然的搞定。效率最高最有效。
2. Deepspeed方便快捷
使用 deepspeed 的流程是最短的
2.1 如果warning,需要加载一些库
moudle ava
moudle load compiler/gcc/7.3.1
moudle load cuda/7/11.8
由于deepspeed进行编译实际上是将GPU的一些指令重新编译,让CPU执行,同时还要符合CUDA的计算结构,能和GPU交互,所以GCC编译,CUDA编译都要符合版本要求
2.2 编写Trainer的python文件
建议使用transformers的trainer函数,这样很多json文件可以直接设置auto,同时还方便指定json配置文件。
同时要注意,这里可能会要求你加入 args,设置一个 local_rank 全局管控。
在 TrainingArguments 指定 ds_config.json 文件
import argparse
import sys
def parse_agrs():
parser = argparse.ArgumentParser()
parser.add_argument("--local_rank", type=int, default=-1, help="Local rank. Necessary for using the torch.distributed.launch utility.")
return args
args = parse_agrs()
training_args = TrainingArguments(
output_dir='./checkpoint/Eff_R2GenCMN_base',
num_train_epochs=1000,
per_device_train_batch_size=10,
per_device_eval_batch_size=10,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./checkpoint/Eff_R2GenCMN_base/output_logs',
logging_steps=10,
save_strategy='steps', # 添加保存策略为每一定步骤保存一次
save_steps=100, # 每100步保存一次模型
save_total_limit=5, # 最多保存5个模型
report_to="none",
fp16=True, # 启用混合精度训练
deepspeed='./ds_config.json',
)
tokenizer = Tokenizer()
args = parse_agrs()
model = R2GenCMN(args, tokenizer)
dataset_train = Dataset(xlsx_file="./dataset/train_dataset.xlsx")
dataset_test = Dataset(xlsx_file="./dataset/test_dataset.xlsx")
trainer = MyTrainer(
model=model, # 使用的模型
args=training_args, # 训练参数
train_dataset=dataset_train, # 训练数据集
eval_dataset=dataset_test, # 验证数据集
data_collator=collate_fn,
# 可能需要定义compute_metrics函数来计算评估指标
)
2.3 编写ds_config文件
编写ds_config文件的目的就是简介python文件,同时更改参数方便,减少大脑记忆负担,便于使用。
ds_config.json 文件脚本通常是 通用的, batch如果写auto,deepspeed会根据显卡给你 自动设置batch 大小
这里只是设置了
stage2的
{
"bfloat16": {
"enabled": false
},
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"steps_per_print": 1e5
}
或者使用stage3
{
"bfloat16": {
"enabled": false
},
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupLR",
"params": {
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_fp16_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 1e5,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
2.4 运行程序
最终deepspeed运行就可以了
这里的warning实际上没有影响模型的运行,是重新编译。
deepspeed train.py
[2024-04-02 12:04:43,112] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-02 12:05:48,493] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-04-02 12:05:48,493] [INFO] [runner.py:555:main] cmd = /public/home/v-yumy/anaconda3/envs/llava2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None transformer_train.py
[2024-04-02 12:05:51,627] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-02 12:05:55,944] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}
[2024-04-02 12:05:55,944] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0
[2024-04-02 12:05:55,944] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2024-04-02 12:05:55,944] [INFO] [launch.py:163:main] dist_world_size=1
[2024-04-02 12:05:55,944] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0
[2024-04-02 12:06:29,136] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-02 12:06:31,519] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-04-02 12:06:31,519] [INFO] [comm.py:594:init_distributed] cdb=None
[2024-04-02 12:06:31,519] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.742 seconds.
Prefix dict has been built successfully.
EfficientNet: replace first conv
EncoderDecoder 的Transformer 是 base
EncoderDecoder 是 base
视觉特征,不进行预训练
[WARNING] cpu_adam cuda is missing or is incompatible with installed torch, only cpu ops can be compiled!
Using /public/home/v-yumy/.cache/torch_extensions/py310_cu117 as PyTorch extensions root...
Emitting ninja build file /public/home/v-yumy/.cache/torch_extensions/py310_cu117/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.7046074867248535 seconds
Rank: 0 partition count [1] and sizes[(42770360, False)]
{'loss': 6.7285, 'learning_rate': 1.6730270909663467e-05, 'epoch': 0.02}
{'loss': 6.0535, 'learning_rate': 2.3254658315702903e-05, 'epoch': 0.05}
{'loss': 5.598, 'learning_rate': 2.6809450068309278e-05, 'epoch': 0.07}
{'loss': 5.2824, 'learning_rate': 2.9266416338062584e-05, 'epoch': 0.1}
{'loss': 5.0738, 'learning_rate': 3.114597855245884e-05, 'epoch': 0.12}
{'loss': 4.8191, 'learning_rate': 3.266853634404809e-05, 'epoch': 0.15}
{'loss': 4.5336, 'learning_rate': 3.3948300828875964e-05, 'epoch': 0.17}