perf-moe-dispatcher-selection
General↓ 0 installsUpdated 19d ago
Choose the right MoE token dispatcher (`alltoall`, DeepEP, or HybridEP) for the hardware, EP degree, and optimization stage. Summarizes patterns from DSV3, Qwen3, Qwen3-Next, and VLM bring-up work.
SKILL.md preview
--- name: perf-moe-dispatcher-selection description: Choose the right MoE token dispatcher (`alltoall`, DeepEP, or HybridEP) for the hardware, EP degree, and optimization stage. Summarizes patterns from DSV3, Qwen3, Qwen3-Next, and VLM bring-up work. when_to_use: Choosing a MoE token dispatcher, or tracing a MoE regression or crash to a dispatcher config change; 'which dispatcher', 'alltoall vs DeepEP', 'HybridEP', 'MoE dispatcher', 'flex backend', 'EP dispatcher selection'. --- # MoE Dispatcher Selection Guide Stable docs: @docs/training/moe-optimization.md Card: @skills/perf-moe-dispatcher-selection/card.yaml ## Quick Decision ### By hardware | Hardware | First choice | Why | |---|---|---| | H100 | DeepEP, if the runtime package is installed | Strong default for cross-node EP on Hopper | | B200 | DeepEP, if the runtime package is installed | Good first choice unless a platform-specific HybridEP path is available | | GB200 / GB300 NVL72 | HybridEP, if the runtime package is installed | Best fit for NVLink-domain-aware dispatch and lower memory pressure | | Unknown or first bring-up | `alltoall` | Easiest path for correctness and debugging | ### By EP degree | EP size | Guidance | |---|---| | Small EP | Dispatcher choice is usually second-order; start with `alltoall` or DeepEP | | Medium EP | DeepEP often becomes worthwhile | | Large EP | HybridEP is usually the best target on NVL72 systems | ## Model-Family Patterns | Workload | Common best path | Notes | |---|---|---| | DSV3 at large scale | HybridEP on GB200 or GB300, DeepEP on H100 | Dispatcher choice matters more as EP and PP both grow | | Qwen3 235B | DeepEP on H100, HybridEP on GB200 | HybridEP usually wins on GB200 and often uses less memory | | Qwen3 30B | DeepEP | Smaller models still benefit, but the absolute gap is smaller | | Qwen3-Next | Close race in BF16, HybridEP stronger in FP8 or memory-tight runs | Good reminder to test, not assume | | MoE VLMs | Start simple, then test HybridEP on GB200-class systems | Vision workloads are sensitive to both memory and host overhead | ## Rounded Evidence Summary ### Backend availability gate Do not interpret a dispatcher timing until the container has proven that the selected backend package is available. `--moe_flex_dispatcher_backend None` selects the standard `alltoall` dispatcher, while `deepep` and `hybridep` select `moe_token_dispatcher_type="flex"` and then require their corresponding runtime packages at model construction time. If DeepEP or HybridEP is missing, record the import failure as an environment limitation and treat `alltoall` as the only measured correctness fallback for that run. ### Qwen3 30B A3B on H100 A short 2026-05-17 H100 smoke run used Qwen3 30B A3B BF16, 16 GPUs, EP=16, the recipe's Transformer Engine CUDA graph scopes (`moe_router`, `moe_preprocess`), and `model.moe_permute_fusion=false` due to a Triton JIT compatibility issue in the run container. The `alltoall` fallback completed five steps with 45.65 s mean step time after warmup, 132.9 mean TFLOP/s/GPU after warmup, final loss 11.44050, and 61.351 GB peak max allocated memory. DeepEP and HybridEP selected the requested flex backend in the dumped configs but failed before the first iteration because the packages were not installed. This confirms the availability gate; it is not a throughput ranking for flex dispatchers on H100. ### DSV3 on GB200 or GB300 The broad trend is more important than any single row in the tracker: - plain `alltoall` is usually the conservative baseline - DeepEP improves that baseline once EP communication becomes visible - HybridEP adds another step up on NVL72 systems, especially after CUDA graphs, routing improvements, and CPU-side cleanup are already in place In practice, the stack often moves from roughly "low-teens MFU" territory with an untuned baseline into "high-teens to low-20s MFU" territory after the full dispatcher and kernel stack is tuned. ### Qwen3 235B on GB200 For Qwen3 235B, th …