DeepSpeed 配置文件（DeepSpeed Configuration Files）详解：中英文解释

中文版

本文详细介绍 DeepSpeed 配置文件，结合 4 卡 3090 的实际使用场景，重点解释各个参数的含义，并提供应对爆显存的方案。

DeepSpeed 配置文件详解：从基础到实战

DeepSpeed 是用于加速大规模分布式训练的重要工具，其灵活的配置文件是实现高效训练的关键。在本篇博客中，我们将深入解读 DeepSpeed 配置文件的结构和关键参数，结合 4 卡 3090 的实际训练场景，探讨如何优化配置，解决爆显存问题。

1. 配置文件的结构

DeepSpeed 的配置文件一般以 JSON 格式定义，包括以下几个核心部分：

bf16/fp16 配置：决定是否启用混合精度训练。
ZeRO 优化配置：用于控制内存优化策略。
训练相关参数：例如批量大小、梯度累积步数等。

以下是一个典型的配置文件示例：

{
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": true,
        "contiguous_gradients": false,
        "reduce_bucket_size": 5e5,
        "sub_group_size": 5e5
    },
    "gradient_accumulation_steps": 4,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_clipping": 1.0
}

2. 关键参数解析

bf16.enabled

含义：启用 BF16 混合精度训练。
影响：显著减少显存占用，提升训练速度。

zero_optimization.stage

含义：指定 ZeRO 优化的阶段。
- Stage 1：优化梯度存储。
- Stage 2：进一步优化优化器状态存储。
- Stage 3：支持模型分片。
推荐：对于 4 卡 3090，优先选择 Stage 2，在显存允许的情况下使用 Stage 3。

overlap_comm

含义：启用通信与计算的重叠，减少通信开销。
建议：在多卡场景中始终开启。

contiguous_gradients

含义：是否在内存中存储连续梯度。
优点：开启后可减少内存碎片化，提高通信效率。
缺点：增加显存开销。
建议：若显存不足，可将其设置为 false。

reduce_bucket_size

含义：定义一次通信中参数分片的最大大小。
单位：字节。
默认值：5e6（即 5 MB）。
调整：
- 若显存不足，减小值至 1e5 或 5e5。
- 如果通信瓶颈明显，可适当增大值。

sub_group_size

含义：设置通信子组的参数分片大小。
默认值：1e8（即 100 MB）。
调整：
- 小模型：5e5 或更低。
- 大模型：可根据显存容量调试，通常 1e6 至 1e7。

gradient_accumulation_steps

含义：设置梯度累积步数，减少单步的显存压力。
建议：逐步增加值（如从 4 到 8），但需注意总批量大小的变化。

train_micro_batch_size_per_gpu

含义：每张 GPU 的微批量大小。
建议：在显存不足时减小，如从 4 降为 1。

gradient_clipping

含义：限制梯度范数，防止梯度爆炸。
推荐值：1.0。

3. 针对 4 卡 3090 的优化建议

显存不足问题解决方法：
1. 减小 reduce_bucket_size 和 sub_group_size：
```
"reduce_bucket_size": 1e5,
"sub_group_size": 5e5
```
2. 降低 train_micro_batch_size_per_gpu：
```
"train_micro_batch_size_per_gpu": 1
```
3. 增大 gradient_accumulation_steps：
```
"gradient_accumulation_steps": 8
```
4. 禁用 contiguous_gradients：
```
"contiguous_gradients": false
```
检查 NCCL 环境变量：
确保以下变量已正确设置，避免通信问题导致显存不足。
```
export NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_TIMEOUT=10800
```
启用 CPU Offloading（如果必要）：
对于显存严重不足的场景，可将部分优化器状态卸载至 CPU。
```
"offload_optimizer": {
    "device": "cpu",
    "pin_memory": true
}
```

4. 实验结果分析与日志监控

在训练过程中，通过以下设置获取详细的资源占用信息：

"wall_clock_breakdown": true

并结合 DeepSpeed 的日志分析显存使用、通信效率等关键指标。

通过合理配置 DeepSpeed 配置文件，结合具体的硬件资源和任务需求，可以显著提升训练效率，减少显存压力。

英文版

This article is about explaining DeepSpeed configuration files, focusing on practical usage with a 4x 3090 GPU setup. This includes a breakdown of key parameters like contiguous_gradients, reduce_bucket_size, and sub_group_size, as well as solutions for handling out-of-memory (OOM) errors.

DeepSpeed Configuration Files: A Comprehensive Guide

DeepSpeed offers advanced optimization features like ZeRO (Zero Redundancy Optimizer) to enable efficient large-scale model training. This post will delve into configuring DeepSpeed for optimal performance, with examples and tips tailored to a 4x NVIDIA 3090 GPU setup.

1. Key Parameters in a DeepSpeed Configuration File

Below is an example configuration file for ZeRO Stage 2 optimization, designed for fine-tuning large models:

{
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": true,
        "contiguous_gradients": false,
        "reduce_bucket_size": 5e5,
        "sub_group_size": 5e5
    },
    "gradient_accumulation_steps": 4,
    "train_micro_batch_size_per_gpu": 1,
    "gradient_clipping": 1.0
}

Let’s break down the parameters:

(1) `zero_optimization.stage`

Defines the ZeRO optimization stage:
- Stage 2: Optimizes optimizer states and gradients across GPUs, reducing memory usage.
- Use Stage 3 for more aggressive memory savings by offloading parameters to CPU, if applicable.

(2) `overlap_comm`

Default: true
Enables overlapping communication with computation, improving efficiency during distributed training.

(3) `contiguous_gradients`

Default: false
When true, all gradients are stored contiguously in memory.
- Benefit: Faster gradient reductions.
- Drawback: Increases memory usage.
- Recommendation: Set to false if facing OOM issues.

(4) `reduce_bucket_size`

Defines the size of gradient buckets for all-reduce operations.
- Smaller values (e.g., 5e5) reduce memory pressure but may slightly slow down training.
- Larger values improve speed but require more memory.

(5) `sub_group_size`

Controls sub-grouping of gradients during communication.
- Default: A large value (e.g., 1e9), meaning no sub-grouping.
- Recommendation: Reduce to 5e5 or lower for better memory efficiency.

(6) `gradient_accumulation_steps`

Number of steps to accumulate gradients before performing a backward pass.
- Higher values effectively increase the batch size without increasing per-GPU memory load.

(7) `train_micro_batch_size_per_gpu`

Batch size per GPU per step.
- Recommendation: Start with a small value (e.g., 1) and scale up gradually.

2. Handling Out-of-Memory (OOM) Errors

Training large models like Google Gemma-2-2B on GPUs with limited memory (24 GB, such as NVIDIA 3090) often results in OOM errors. Here are optimization strategies:

(1) Reduce `train_micro_batch_size_per_gpu`

Start with 1 and only increase if memory allows.

(2) Lower `reduce_bucket_size` and `sub_group_size`

Decrease both to 1e5 or 5e4. This reduces the memory footprint during gradient reduction at the cost of slightly increased communication overhead.

(3) Enable `offload_optimizer` or `offload_param` (for ZeRO Stage 3)

Offload optimizer states or parameters to CPU if memory remains insufficient.

Example configuration for optimizer offloading:

{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        }
    }
}

(4) Use Gradient Checkpointing

Activates checkpointing for intermediate activations to save memory during backpropagation.

from deepspeed.runtime.activation_checkpointing import checkpointing_config
checkpointing_config(
    partition_activations=True,
    contiguous_memory_optimization=False
)

(5) Mixed Precision Training (`bf16` or `fp16`)

Use bf16 for better memory efficiency with minimal precision loss.

(6) Increase `gradient_accumulation_steps`

Accumulate gradients over more steps to reduce the batch size processed per GPU.

(7) Reduce `max_seq_length`

Shorten sequence length (e.g., 512 or 768 tokens) to decrease memory usage.

3. Practical Example: Fine-Tuning on 4x NVIDIA 3090 GPUs

The following accelerate command illustrates how to combine the above settings for fine-tuning a large model:

accelerate launch \
    --mixed_precision bf16 \
    --num_machines 1 \
    --num_processes 4 \
    --machine_rank 0 \
    --main_process_ip 127.0.0.1 \
    --main_process_port 29400 \
    --use_deepspeed \
    --deepspeed_config_file configs/ds_config.json \
    --model_name_or_path google/gemma-2-2b \
    --tokenizer_name google/gemma-2-2b \
    --max_seq_length 768 \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --learning_rate 5e-6 \
    --num_train_epochs 1 \
    --output_dir output/sft_gemma2

4. Debugging Tips

Enable Detailed Logs: Set wall_clock_breakdown: true in the config file to identify bottlenecks.
NCCL Tuning: Add environment variables to handle communication errors:
```
export NCCL_BLOCKING_WAIT=1
export NCCL_ASYNC_ERROR_HANDLING=1
```

Conclusion

DeepSpeed’s configuration is highly flexible, but tuning requires balancing memory efficiency and computational speed. By adjusting parameters like reduce_bucket_size, gradient_accumulation_steps, and leveraging ZeRO’s offloading capabilities, you can effectively train large models even on memory-constrained GPUs like the NVIDIA 3090.