perf-activation-recompute

General↓ 0 installsUpdated 64d ago
VerifiedCuratedNVIDIA
Validate and use selective and full activation recompute in Megatron Bridge to reduce GPU memory usage at the cost of extra compute.
SKILL.md preview

---
name: perf-activation-recompute
description: Validate and use selective and full activation recompute in Megatron Bridge to reduce GPU memory usage at the cost of extra compute.
when_to_use: Reducing GPU memory via activation recompute, or investigating a commit that changed recompute settings and caused OOM or a regression; 'recompute_granularity', 'recompute_num_layers', 'recompute_modules', 'recompute_method', 'selective recompute', 'full recompute', 'activation memory OOM'.
---

# Activation Recompute

Stable docs: @docs/training/activation-recomputation.md
Card: @skills/perf-activation-recompute/card.yaml

## What It Is

Activation recompute trades GPU compute for memory by discarding intermediate
activations during the forward pass and recomputing them during backward.
Megatron Bridge supports two granularities:

| Granularity | What you specify | What gets recomputed | Memory savings | Compute cost |
|---|---|---|---|---|
| `selective` | `recompute_modules` list (e.g. `core_attn`, `mlp`) | specific submodules within each layer | moderate (module-dependent) | low to high |
| `full` | `recompute_num_layers` + `recompute_method` | entire transformer layers (N layers) | strongest | highest |

Note: MCore names these "selective" (submodule-level) vs "full" (layer-level).
"Full" means recomputing full layers, not the full model — you still choose
how many layers via `recompute_num_layers`.

## Quick Decision

1. **Set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` first** — most
   borderline OOMs are caused by memory fragmentation, not capacity. This
   fixes it at zero cost. See @skills/perf-memory-tuning/SKILL.md.
2. Start with `recompute_granularity=selective`, `recompute_modules=[core_attn]`
   (often already the default in recipes).
3. Add `layernorm` to recompute modules — nearly free compute-wise but saves
   negligible memory. Only helps in extremely borderline cases.
4. Add `mlp` as a last resort — saves ~3 GB but costs ~16% GPU utilization on
   large dense models (Llama3 70B).
5. Use `recompute_granularity=full` only when selective recompute still does
   not fit.

CPU offloading (`cpu_offloading=True`) is an alternative that avoids recompute
cost entirely, but it is **incompatible with PP > 1**.

## Enablement

### Selective recompute (default for most recipes)

```python
cfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn"]
```

### Selective recompute with additional modules

```python
cfg.model.recompute_granularity = "selective"
cfg.model.recompute_modules = ["core_attn", "layernorm"]  # or ["mlp"] or ["mlp", "core_attn"]
```

### Full-layer recompute

```python
cfg.model.recompute_granularity = "full"
cfg.model.recompute_method = "uniform"
cfg.model.recompute_num_layers = 4
```

### Available recompute_modules

| Module | What it recomputes | Compute cost | Memory savings |
|---|---|---|---|
| `core_attn` | attention softmax/dropout/QKV dot product | low (Flash Attention already recomputes internally) | moderate |
| `layernorm` | layer normalization | negligible (~0%) | negligible |
| `mlp` | full FFN block | high (~16% on Llama3 70B, hidden=28672) | ~3 GB |
| `moe` | MoE expert dispatch | varies | varies |
| `moe_act` | MoE activation functions | low | small |
| `shared_experts` | shared expert layers | moderate | moderate |
| `mla_up_proj` | Multi-Latent Attention up projection | moderate | moderate |

### Performance harness CLI

```bash
python scripts/performance/run_performance_workload.py \
  --recompute_granularity selective \
  --recompute_modules core_attn layernorm \
  ...
```

## Compatibility and Constraints

- `recompute_granularity=selective` requires a non-empty `recompute_modules` list
- `recompute_granularity=full` requires `recompute_method` and `recompute_num_layers`
- **Layer-level recompute (`recompute_granularity="full"` +
  `recompute_num_layers`) is incompatible with TE-scoped CUDA graphs.**
  MCore calls this "full" granularity — the name r

…