中文版
本文详细介绍 DeepSpeed 配置文件,结合 4 卡 3090 的实际使用场景,重点解释各个参数的含义,并提供应对爆显存的方案。
DeepSpeed 配置文件详解:从基础到实战
DeepSpeed 是用于加速大规模分布式训练的重要工具,其灵活的配置文件是实现高效训练的关键。在本篇博客中,我们将深入解读 DeepSpeed 配置文件的结构和关键参数,结合 4 卡 3090 的实际训练场景,探讨如何优化配置,解决爆显存问题。
1. 配置文件的结构
DeepSpeed 的配置文件一般以 JSON 格式定义,包括以下几个核心部分:
- bf16/fp16 配置:决定是否启用混合精度训练。
- ZeRO 优化配置:用于控制内存优化策略。
- 训练相关参数:例如批量大小、梯度累积步数等。
以下是一个典型的配置文件示例:
{
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 2,
"overlap_comm": true,
"contiguous_gradients": false,
"reduce_bucket_size": 5e5,
"sub_group_size": 5e5
},
"gradient_accumulation_steps": 4,
"train_micro_batch_size_per_gpu": 1,
"gradient_clipping": 1.0
}
2. 关键参数解析
bf16.enabled
- 含义:启用 BF16 混合精度训练。
- 影响:显著减少显存占用,提升训练速度。
zero_optimization.stage
- 含义:指定 ZeRO 优化的阶段。
- Stage 1:优化梯度存储。
- Stage 2:进一步优化优化器状态存储。
- Stage 3:支持模型分片。
- 推荐:对于 4 卡 3090,优先选择 Stage 2,在显存允许的情况下使用 Stage 3。
overlap_comm
- 含义:启用通信与计算的重叠,减少通信开销。
- 建议:在多卡场景中始终开启。
contiguous_gradients
- 含义:是否在内存中存储连续梯度。
- 优点:开启后可减少内存碎片化,提高通信效率。
- 缺点:增加显存开销。
- 建议:若显存不足,可将其设置为
false
。
reduce_bucket_size
- 含义:定义一次通信中参数分片的最大大小。
- 单位:字节。
- 默认值:
5e6
(即 5 MB)。 - 调整:
- 若显存不足,减小值至
1e5
或5e5
。 - 如果通信瓶颈明显,可适当增大值。
- 若显存不足,减小值至
sub_group_size
- 含义:设置通信子组的参数分片大小。
- 默认值:
1e8
(即 100 MB)。 - 调整:
- 小模型:
5e5
或更低。 - 大模型:可根据显存容量调试,通常
1e6
至1e7
。
- 小模型:
gradient_accumulation_steps
- 含义:设置梯度累积步数,减少单步的显存压力。
- 建议:逐步增加值(如从
4
到8
),但需注意总批量大小的变化。
train_micro_batch_size_per_gpu
- 含义:每张 GPU 的微批量大小。
- 建议:在显存不足时减小,如从
4
降为1
。
gradient_clipping
- 含义:限制梯度范数,防止梯度爆炸。
- 推荐值:
1.0
。
3. 针对 4 卡 3090 的优化建议
-
显存不足问题解决方法:
- 减小
reduce_bucket_size
和sub_group_size
:"reduce_bucket_size": 1e5, "sub_group_size": 5e5
- 降低
train_micro_batch_size_per_gpu
:"train_micro_batch_size_per_gpu": 1
- 增大
gradient_accumulation_steps
:"gradient_accumulation_steps": 8
- 禁用
contiguous_gradients
:"contiguous_gradients": false
- 减小
-
检查 NCCL 环境变量:
确保以下变量已正确设置,避免通信问题导致显存不足。export NCCL_BLOCKING_WAIT=1 export NCCL_ASYNC_ERROR_HANDLING=1 export NCCL_TIMEOUT=10800
-
启用 CPU Offloading(如果必要):
对于显存严重不足的场景,可将部分优化器状态卸载至 CPU。"offload_optimizer": { "device": "cpu", "pin_memory": true }
4. 实验结果分析与日志监控
在训练过程中,通过以下设置获取详细的资源占用信息:
"wall_clock_breakdown": true
并结合 DeepSpeed 的日志分析显存使用、通信效率等关键指标。
通过合理配置 DeepSpeed 配置文件,结合具体的硬件资源和任务需求,可以显著提升训练效率,减少显存压力。
英文版
This article is about explaining DeepSpeed configuration files, focusing on practical usage with a 4x 3090 GPU setup. This includes a breakdown of key parameters like contiguous_gradients
, reduce_bucket_size
, and sub_group_size
, as well as solutions for handling out-of-memory (OOM) errors.
DeepSpeed Configuration Files: A Comprehensive Guide
DeepSpeed offers advanced optimization features like ZeRO (Zero Redundancy Optimizer) to enable efficient large-scale model training. This post will delve into configuring DeepSpeed for optimal performance, with examples and tips tailored to a 4x NVIDIA 3090 GPU setup.
1. Key Parameters in a DeepSpeed Configuration File
Below is an example configuration file for ZeRO Stage 2 optimization, designed for fine-tuning large models:
{
"zero_optimization": {
"stage": 2,
"overlap_comm": true,
"contiguous_gradients": false,
"reduce_bucket_size": 5e5,
"sub_group_size": 5e5
},
"gradient_accumulation_steps": 4,
"train_micro_batch_size_per_gpu": 1,
"gradient_clipping": 1.0
}
Let’s break down the parameters:
(1) zero_optimization.stage
- Defines the ZeRO optimization stage:
- Stage 2: Optimizes optimizer states and gradients across GPUs, reducing memory usage.
- Use Stage 3 for more aggressive memory savings by offloading parameters to CPU, if applicable.
(2) overlap_comm
- Default:
true
- Enables overlapping communication with computation, improving efficiency during distributed training.
(3) contiguous_gradients
- Default:
false
- When
true
, all gradients are stored contiguously in memory.- Benefit: Faster gradient reductions.
- Drawback: Increases memory usage.
- Recommendation: Set to
false
if facing OOM issues.
(4) reduce_bucket_size
- Defines the size of gradient buckets for all-reduce operations.
- Smaller values (e.g.,
5e5
) reduce memory pressure but may slightly slow down training. - Larger values improve speed but require more memory.
- Smaller values (e.g.,
(5) sub_group_size
- Controls sub-grouping of gradients during communication.
- Default: A large value (e.g.,
1e9
), meaning no sub-grouping. - Recommendation: Reduce to
5e5
or lower for better memory efficiency.
- Default: A large value (e.g.,
(6) gradient_accumulation_steps
- Number of steps to accumulate gradients before performing a backward pass.
- Higher values effectively increase the batch size without increasing per-GPU memory load.
(7) train_micro_batch_size_per_gpu
- Batch size per GPU per step.
- Recommendation: Start with a small value (e.g.,
1
) and scale up gradually.
- Recommendation: Start with a small value (e.g.,
2. Handling Out-of-Memory (OOM) Errors
Training large models like Google Gemma-2-2B on GPUs with limited memory (24 GB, such as NVIDIA 3090) often results in OOM errors. Here are optimization strategies:
(1) Reduce train_micro_batch_size_per_gpu
- Start with
1
and only increase if memory allows.
(2) Lower reduce_bucket_size
and sub_group_size
- Decrease both to
1e5
or5e4
. This reduces the memory footprint during gradient reduction at the cost of slightly increased communication overhead.
(3) Enable offload_optimizer
or offload_param
(for ZeRO Stage 3)
- Offload optimizer states or parameters to CPU if memory remains insufficient.
- Example configuration for optimizer offloading:
{ "zero_optimization": { "stage": 3, "offload_optimizer": { "device": "cpu", "pin_memory": true } } }
(4) Use Gradient Checkpointing
- Activates checkpointing for intermediate activations to save memory during backpropagation.
from deepspeed.runtime.activation_checkpointing import checkpointing_config checkpointing_config( partition_activations=True, contiguous_memory_optimization=False )
(5) Mixed Precision Training (bf16
or fp16
)
- Use
bf16
for better memory efficiency with minimal precision loss.
(6) Increase gradient_accumulation_steps
- Accumulate gradients over more steps to reduce the batch size processed per GPU.
(7) Reduce max_seq_length
- Shorten sequence length (e.g., 512 or 768 tokens) to decrease memory usage.
3. Practical Example: Fine-Tuning on 4x NVIDIA 3090 GPUs
The following accelerate
command illustrates how to combine the above settings for fine-tuning a large model:
accelerate launch \
--mixed_precision bf16 \
--num_machines 1 \
--num_processes 4 \
--machine_rank 0 \
--main_process_ip 127.0.0.1 \
--main_process_port 29400 \
--use_deepspeed \
--deepspeed_config_file configs/ds_config.json \
--model_name_or_path google/gemma-2-2b \
--tokenizer_name google/gemma-2-2b \
--max_seq_length 768 \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 4 \
--learning_rate 5e-6 \
--num_train_epochs 1 \
--output_dir output/sft_gemma2
4. Debugging Tips
- Enable Detailed Logs: Set
wall_clock_breakdown: true
in the config file to identify bottlenecks. - NCCL Tuning: Add environment variables to handle communication errors:
export NCCL_BLOCKING_WAIT=1 export NCCL_ASYNC_ERROR_HANDLING=1
Conclusion
DeepSpeed’s configuration is highly flexible, but tuning requires balancing memory efficiency and computational speed. By adjusting parameters like reduce_bucket_size
, gradient_accumulation_steps
, and leveraging ZeRO’s offloading capabilities, you can effectively train large models even on memory-constrained GPUs like the NVIDIA 3090.
后记
2024年11月27日22点08分于上海,基于GPT4o大模型。