perf-cuda-graphs

General↓ 0 installsUpdated 64d ago
VerifiedCuratedNVIDIA
Validate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules.
SKILL.md preview

---
name: perf-cuda-graphs
description: Validate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules.
when_to_use: Reducing host-driver overhead via CUDA graphs, or tracing a crash or regression to a CUDA graph config change; 'cuda_graph_impl', 'full iteration graph', 'TE scoped graph', 'graphed callables', 'CUDA graph capture'.
---

# CUDA Graphs

Stable docs: @docs/training/cuda-graphs.md
Card: @skills/perf-cuda-graphs/card.yaml

## What It Is

CUDA graphs capture GPU operations once and replay them with minimal
host-driver overhead. Bridge supports two implementations:

| `cuda_graph_impl` | Mechanism | Scope support |
|---|---|---|
| `"local"` | MCore `FullCudaGraphWrapper` wrapping entire fwd+bwd | `full_iteration` |
| `"transformer_engine"` | TE `make_graphed_callables()` per layer | `attn`, `mlp`, `moe`, `moe_router`, `moe_preprocess`, `mamba` |

## Quick Decision

Start with TE-scoped graphs for most training workloads, then verify replay
timing against eager on the same dispatcher, layout, and container:

- dense models: `attn`, then optionally `mlp`
- dropless MoE: `attn moe_router moe_preprocess`
- VLMs: the same dropless-MoE scope, but only after the real-data path is stable

Use `local` + `full_iteration` only when you specifically want full-iteration
capture and can satisfy the tighter constraints.

For recompute-heavy workloads:

- TE-scoped graphs pair naturally with selective recompute
- full recompute usually pushes you toward `local` full-iteration graphs or away
  from graphs entirely

Related docs:

- @docs/training/cuda-graphs.md
- @docs/training/activation-recomputation.md

## Enablement

### Local full-iteration graph

```python
cfg.model.cuda_graph_impl = "local"
cfg.model.cuda_graph_scope = ["full_iteration"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = True
cfg.rerun_state_machine.check_for_nan_in_loss = False
cfg.ddp.check_for_nan_in_grad = False
```

### TE scoped graph (dense model)

```python
cfg.model.cuda_graph_impl = "transformer_engine"
cfg.model.cuda_graph_scope = ["attn"]           # or ["attn", "mlp"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = True
```

### TE scoped graph (MoE model)

```python
cfg.model.cuda_graph_impl = "transformer_engine"
cfg.model.cuda_graph_scope = ["attn", "moe_router", "moe_preprocess"]
cfg.model.cuda_graph_warmup_steps = 3
cfg.model.use_te_rng_tracker = True
cfg.rng.te_rng_tracker = True
```

### Performance harness CLI

```bash
uv run python scripts/performance/run_script.py \
  -m qwen \
  -mr qwen3_30b_a3b \
  --task pretrain \
  -g h100 \
  -c bf16 \
  -ng 16 \
  --cuda_graph_impl transformer_engine \
  --cuda_graph_scope attn,moe_router,moe_preprocess \
  ...
```

Valid CLI values live in `scripts/performance/argument_parser.py`:
- `VALID_CUDA_GRAPH_IMPLS`: `["none", "local", "transformer_engine"]`
- `VALID_CUDA_GRAPH_SCOPES`: `["full_iteration", "attn", "mlp", "moe", "moe_router", "moe_preprocess", "mamba"]`

The performance harness uses a comma-separated `--cuda_graph_scope` value and
auto-enables `model.use_te_rng_tracker` plus `rng.te_rng_tracker` when
`--cuda_graph_impl` is not `none`.

### Required constraints

- `use_te_rng_tracker = True` (enforced in `gpt_provider.py`)
- `full_iteration` scope only with `cuda_graph_impl = "local"`
- `full_iteration` scope requires `check_for_nan_in_loss = False`
- Do not combine `moe` scope and `moe_router` scope
- Tensor shapes must be static (fixed seq_length, fixed micro_batch_size)
- MoE token-dropless routing limits graphable scope to dense modules
- With `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, set
  `NCCL_GRAPH_REGISTER=0` (MCore enforces for local impl on arch < sm_100;
  TE impl asserts unconditionally)
- CPU offloading is incompatible with CUDA graphs
- `moe_preprocess` scope require

…