perf-expert-parallel-overlap
General↓ 0 installsUpdated 19d ago
Validate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.
SKILL.md preview
--- name: perf-expert-parallel-overlap description: Validate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP. when_to_use: Enabling EP overlap to hide dispatch/combine latency, or tracing a throughput regression to an EP overlap config change; 'overlap_moe_expert_parallel_comm', 'delay_wgrad_compute', 'flex dispatcher', 'DeepEP overlap', 'HybridEP overlap'. --- # MoE Expert-Parallel Overlap Skill Stable docs: @docs/training/communication-overlap.md Card: @skills/perf-expert-parallel-overlap/card.yaml ## References - Stable docs: @docs/training/communication-overlap.md - Structured metadata: @skills/perf-expert-parallel-overlap/card.yaml ## What It Is Expert-parallel (EP) overlap hides the cost of token dispatch/combine all-to-all communication by running it concurrently with expert FFN compute. Optionally, delayed expert weight-gradient computation (`delay_wgrad_compute`) provides additional overlap by deferring wgrad to overlap with the next layer's forward. Bridge supports two dispatcher paths: | Dispatcher | Backend | When to use | |---|---|---| | `alltoall` | Standard MoE all-to-all | Default, broadest compatibility | | `flex` | DeepEP or HybridEP | Higher overlap on Ampere/Hopper/Blackwell | ## Quick Decision Use EP overlap when: - the model is MoE with `EP > 1` - expert dispatch/combine communication is a meaningful part of step time - you have memory headroom and are tuning for throughput Prefer: - `alltoall` dispatcher for the first rollout (broader compatibility) - `flex` + DeepEP/HybridEP when running on supported GPUs and seeking additional gains Avoid EP overlap when: - full activation recompute is enabled - `moe_shared_expert_overlap` is enabled - the run is still being brought up for correctness - PyTorch < 2.6.0 Expected outcome: - if all-to-all dispatch is a clear profile bottleneck, overlap can produce a modest to meaningful speedup - if the run is tiny, communication-light, or dominated by another wall, the gain may be negligible ## Enablement ### alltoall dispatcher ```python cfg.comm_overlap.overlap_moe_expert_parallel_comm = True cfg.comm_overlap.delay_wgrad_compute = True cfg.model.moe_shared_expert_overlap = False cfg.model.expert_model_parallel_size = 8 cfg.model.num_moe_experts = 64 cfg.model.moe_token_dispatcher_type = "alltoall" cfg.model.bf16 = True cfg.model.fp16 = False ``` ### flex dispatcher (DeepEP or HybridEP) ```python from megatron.bridge.training.flex_dispatcher_backend import apply_flex_dispatcher_backend cfg.comm_overlap.overlap_moe_expert_parallel_comm = True cfg.comm_overlap.delay_wgrad_compute = True cfg.model.moe_shared_expert_overlap = False apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="deepep") # or: apply_flex_dispatcher_backend(cfg.model, moe_flex_dispatcher_backend="hybridep") ``` ## Compatibility And Constraints - `expert_model_parallel_size > 1` - `num_moe_experts > 1` - `moe_token_dispatcher_type` must be `"alltoall"` or `"flex"` - `moe_shared_expert_overlap = False` - Base precision is BF16 or FP16 - PyTorch `>= 2.6.0` - If `PP > 1`, `virtual_pipeline_model_parallel_size` must be set - `recompute_granularity != "full"`, `recompute_method = None`, `recompute_num_layers = None` - `mtp_num_layers` must be `None` or `1` - `delay_wgrad_compute` requires `overlap_moe_expert_parallel_comm` as a prerequisite - `delay_wgrad_compute` with `overlap_grad_reduce` requires TE >= 2.7.0 - `delay_wgrad_compute` with `gradient_accumulation_fusion` requires TE >= 2.7.0 - CUDA graph `attn` scope + `delay_wgrad_compute` requires TE >= 2.12.0, `gradient_accumulation_fusion = True`, and no attention bias - DeepEP: Ampere, Hopper, B200, B300 GPUs only - HybridEP: Ampere, Hopper, B200, B300, GB200/GB300 with NVL72 ## Minimal Working Config ```python cfg.comm_overlap.overlap_moe_expert_parallel_c …