perf-cpu-offloading
General↓ 0 installsUpdated 19d ago
Validate and use CPU offloading in Megatron Bridge, including layer-level activation offloading and fractional optimizer state offloading with HybridDeviceOptimizer.
SKILL.md preview
--- name: perf-cpu-offloading description: Validate and use CPU offloading in Megatron Bridge, including layer-level activation offloading and fractional optimizer state offloading with HybridDeviceOptimizer. when_to_use: Enabling CPU offload to reduce GPU memory, or investigating a commit that changed CPU offloading config and caused OOM or a crash; 'cpu_offloading', 'optimizer_cpu_offload', 'optimizer_offload_fraction', 'HybridDeviceOptimizer', 'move optimizer to CPU'. --- # CPU Offloading ## References - Stable docs: @docs/training/cpu-offloading.md - Structured metadata: @skills/perf-cpu-offloading/card.yaml ## What It Is Two independent mechanisms to move data from GPU to CPU memory: | Mechanism | Config namespace | What gets offloaded | PP restriction | |---|---|---|---| | Activation offloading | `model.cpu_offloading*` | Activations (and optionally weights) per transformer layer | PP must be 1 | | Optimizer offloading | `optimizer.optimizer_cpu_offload` | Adam optimizer states (momentum + variance) via `HybridDeviceOptimizer` | None | ## Quick Decision | Situation | Recommendation | |---|---| | Large MoE model (30B+), needs PP > 1 | Optimizer offloading — activation offloading is blocked by PP=1 | | Small/medium model, PP=1 fits, activation memory dominates | Activation offloading | | Want tunable memory-speed tradeoff | Optimizer offloading with fractional `optimizer_offload_fraction` | | Throughput is top priority | Don't enable — offloading always adds overhead | | CUDA graphs are needed | Only optimizer offloading — activation offloading is incompatible | | Memory pressure is moderate | Optimizer offload at 25–50% fraction for best efficiency | ## Enablement ### Optimizer CPU offloading (recommended for large models) ```python cfg.optimizer.optimizer_cpu_offload = True cfg.optimizer.optimizer_offload_fraction = 1.0 cfg.optimizer.overlap_cpu_optimizer_d2h_h2d = True ``` CLI overrides: ```bash optimizer.optimizer_cpu_offload=True \ optimizer.optimizer_offload_fraction=0.5 \ optimizer.overlap_cpu_optimizer_d2h_h2d=True ``` ### Activation CPU offloading (small/medium models only) ```python cfg.model.cpu_offloading = True cfg.model.cpu_offloading_num_layers = 16 cfg.model.cpu_offloading_activations = True cfg.model.cpu_offloading_weights = False cfg.model.pipeline_model_parallel_size = 1 cfg.model.recompute_granularity = None cfg.model.cuda_graph_impl = "none" ``` ## Config Parameter Reference ### Optimizer offloading | Parameter | Default | Description | |-----------|---------|-------------| | `optimizer_cpu_offload` | `False` | Master switch | | `optimizer_offload_fraction` | `0.0` | Fraction of optimizer states on CPU (0.0–1.0) | | `overlap_cpu_optimizer_d2h_h2d` | `False` | Overlap GPU↔CPU transfers with compute | | `use_torch_optimizer_for_cpu_offload` | `False` | Use `torch.optim` instead of fused optimizer for CPU portion | ### Activation offloading | Parameter | Default | Description | |-----------|---------|-------------| | `cpu_offloading` | `False` | Master switch | | `cpu_offloading_num_layers` | `0` | Number of transformer layers to offload (0 to num_layers-1) | | `cpu_offloading_activations` | `True` | Offload activations | | `cpu_offloading_weights` | `False` | Offload weights | | `cpu_offloading_double_buffering` | `False` | Double-buffer across layers while reloading | ## Compatibility And Constraints ### Activation offloading - `pipeline_model_parallel_size` must be 1 - `recompute_granularity` must be `None` - Cannot combine with `fine_grained_activation_offloading` - Cannot combine with CUDA graphs - `cpu_offloading_num_layers` must be in `[0, num_layers-1)` ### Optimizer offloading - Requires `use_distributed_optimizer = True` (default in most recipes) - No PP, recompute, or CUDA graph restrictions - `optimizer_offload_fraction` must be in `[0.0, 1.0]` ### Practical: large MoE models Activation offloading is blocked for Qwen3-30B-A3B and similar large MoE models. The PP=1 c …