perf-cuda-graphs
General↓ 0 installsUpdated 19d ago
Validate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules.
SKILL.md preview
--- name: perf-cuda-graphs description: Validate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules. when_to_use: Reducing host-driver overhead via CUDA graphs, or tracing a crash or regression to a CUDA graph config change; 'cuda_graph_impl', 'full iteration graph', 'TE scoped graph', 'graphed callables', 'CUDA graph capture'. --- # CUDA Graphs Stable docs: @docs/training/cuda-graphs.md Card: @skills/perf-cuda-graphs/card.yaml ## What It Is CUDA graphs capture GPU operations once and replay them with minimal host-driver overhead. Bridge supports two implementations: | `cuda_graph_impl` | Mechanism | Scope support | |---|---|---| | `"local"` | MCore `FullCudaGraphWrapper` wrapping entire fwd+bwd | `full_iteration` | | `"transformer_engine"` | TE `make_graphed_callables()` per layer | `attn`, `mlp`, `moe`, `moe_router`, `moe_preprocess`, `mamba` | ## Quick Decision Start with TE-scoped graphs for most training workloads, then verify replay timing against eager on the same dispatcher, layout, and container: - dense models: `attn`, then optionally `mlp` - dropless MoE: `attn moe_router moe_preprocess` - VLMs: the same dropless-MoE scope, but only after the real-data path is stable Use `local` + `full_iteration` only when you specifically want full-iteration capture and can satisfy the tighter constraints. For recompute-heavy workloads: - TE-scoped graphs pair naturally with selective recompute - full recompute usually pushes you toward `local` full-iteration graphs or away from graphs entirely Related docs: - @docs/training/cuda-graphs.md - @docs/training/activation-recomputation.md ## Enablement ### Local full-iteration graph ```python cfg.model.cuda_graph_impl = "local" cfg.model.cuda_graph_scope = ["full_iteration"] cfg.model.cuda_graph_warmup_steps = 3 cfg.model.use_te_rng_tracker = True cfg.rng.te_rng_tracker = True cfg.rerun_state_machine.check_for_nan_in_loss = False cfg.ddp.check_for_nan_in_grad = False ``` ### TE scoped graph (dense model) ```python cfg.model.cuda_graph_impl = "transformer_engine" cfg.model.cuda_graph_scope = ["attn"] # or ["attn", "mlp"] cfg.model.cuda_graph_warmup_steps = 3 cfg.model.use_te_rng_tracker = True cfg.rng.te_rng_tracker = True ``` ### TE scoped graph (MoE model) ```python cfg.model.cuda_graph_impl = "transformer_engine" cfg.model.cuda_graph_scope = ["attn", "moe_router", "moe_preprocess"] cfg.model.cuda_graph_warmup_steps = 3 cfg.model.use_te_rng_tracker = True cfg.rng.te_rng_tracker = True ``` ### Performance harness CLI ```bash uv run python scripts/performance/run_script.py \ -m qwen \ -mr qwen3_30b_a3b \ --task pretrain \ -g h100 \ -c bf16 \ -ng 16 \ --cuda_graph_impl transformer_engine \ --cuda_graph_scope attn,moe_router,moe_preprocess \ ... ``` Valid CLI values live in `scripts/performance/argument_parser.py`: - `VALID_CUDA_GRAPH_IMPLS`: `["none", "local", "transformer_engine"]` - `VALID_CUDA_GRAPH_SCOPES`: `["full_iteration", "attn", "mlp", "moe", "moe_router", "moe_preprocess", "mamba"]` The performance harness uses a comma-separated `--cuda_graph_scope` value and auto-enables `model.use_te_rng_tracker` plus `rng.te_rng_tracker` when `--cuda_graph_impl` is not `none`. ### Required constraints - `use_te_rng_tracker = True` (enforced in `gpt_provider.py`) - `full_iteration` scope only with `cuda_graph_impl = "local"` - `full_iteration` scope requires `check_for_nan_in_loss = False` - Do not combine `moe` scope and `moe_router` scope - Tensor shapes must be static (fixed seq_length, fixed micro_batch_size) - MoE token-dropless routing limits graphable scope to dense modules - With `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`, set `NCCL_GRAPH_REGISTER=0` (MCore enforces for local impl on arch < sm_100; TE impl asserts unconditionally) - CPU offloading is incompatible with CUDA graphs - `moe_preprocess` scope require …