perf-moe-hardware-configs
General↓ 0 installsUpdated 19d ago
Representative MoE training playbooks by hardware platform and model family. Summarizes rounded throughput bands, parallelism patterns, and common tuning stacks.
SKILL.md preview
--- name: perf-moe-hardware-configs description: Representative MoE training playbooks by hardware platform and model family. Summarizes rounded throughput bands, parallelism patterns, and common tuning stacks. when_to_use: Hardware-specific MoE playbooks or throughput estimates; 'MoE on H100', 'GB200 config', 'expected throughput', 'MoE hardware playbook', 'parallelism for B200'. --- # MoE Hardware Configuration Reference Stable docs: @docs/training/moe-optimization.md Card: @skills/perf-moe-hardware-configs/card.yaml ## Quick Platform Playbook | Platform | Typical MoE strategy | What usually matters most | |---|---|---| | H100 | DeepEP + stronger PP + moderate TP | communication overlap and PP efficiency | | B200 | DeepEP + MXFP8 + careful PP layout | container quality and tuned comm settings | | GB200 | HybridEP + partial CUDA graphs + CPU cleanup | host overhead, topology-aware dispatch, memory headroom | | GB300 | HybridEP + newer FP8 and kernel stack | same GB200 playbook, usually with a higher ceiling | ## Rounded Performance Bands These are intentionally rounded so the document stays durable as the tracker moves. Treat them as planning ranges, not exact promises. | Workload family | Hardware | Typical band | Representative shape | |---|---|---|---| | DSV3, large-scale | H100 | low-to-mid hundreds TFLOPS/GPU, high-teens MFU | TP2, EP64, PP8, DeepEP | | DSV3, large-scale | B200 | high-hundreds TFLOPS/GPU, mid-teens MFU | TP1, EP32, PP8, DeepEP | | DSV3, large-scale | GB200 | around 1K TFLOPS/GPU, low-20s MFU | TP1, EP64, PP4, HybridEP | | DSV3, large-scale | GB300 | above the GB200 band, often mid-20s MFU | TP1, EP64, PP4, HybridEP | | Qwen3 235B | H100 | low-300s TFLOPS/GPU, around 30% MFU | TP2, EP32, PP8, DeepEP | | Qwen3 235B | GB200 | high-hundreds TFLOPS/GPU in tuned runs | TP1 or TP2, EP32-64, PP4, HybridEP | | Qwen3 30B | H100 | low-200s TFLOPS/GPU | TP1, EP8, PP1, DeepEP | | Qwen3-Next 80B | GB200 | low-300s TFLOPS/GPU in BF16-class runs | TP1, EP32, PP2, HybridEP | ## Representative Config Families ### DSV3 on H100 ```text Dispatcher: DeepEP TP=2 EP=64 PP=8 VPP=4 Routing: force balance Recompute: light-to-moderate selective recompute Priority: overlap communication and keep PP efficient ``` ### DSV3 on B200 ```text Dispatcher: DeepEP TP=1 EP=32 PP=8 VPP=2 or similar Precision: MXFP8-class Recompute: selective recompute around MLA up-projection and MLP-side modules Priority: container quality, PP layout, and DeepEP SMS tuning ``` ### DSV3 on GB200 or GB300 ```text Dispatcher: HybridEP TP=1 EP=64 PP=4 VPP=4 Precision: MXFP8-class CUDA Graph: attn + moe_router + moe_preprocess Priority: HybridEP, CPU optimization, and graph-friendly static shapes ``` ### Qwen3 235B on H100 ```text Dispatcher: DeepEP TP=2 EP=32 PP=8 VPP=4 Recompute: norm and activation-side selective recompute Priority: communication overlap and router-path cleanup ``` ### Qwen3 235B on GB200 ```text Dispatcher: HybridEP TP=1 or 2 EP=32 to 64 PP=4 CUDA Graph: attn + moe_router + moe_preprocess Recompute: moe_act, mlp, or norm depending on memory pressure Priority: balance throughput against memory headroom ``` ### Qwen3-Next 80B on GB200 ```text Dispatcher: HybridEP TP=1 EP=32 PP=2 VPP around 4 CUDA Graph: attn + moe_router + moe_preprocess Priority: pipeline layout and grouped GEMM quality ``` ## Cross-Cutting Patterns ### PP layout - `E` = embedding - `t` = transformer - `m` = MTP - `L` = loss - `|` = stage boundary The biggest platform difference is usually not just the dispatcher. It is the combination of dispatcher, PP shape, and whether VPP keeps each stage balanced. ### Recompute strategy | Memory pressure | Starting point | |---|---| | low | none or a very narrow selective set | | moderate | `moe_act`, `mlp`, `norm`, or similar selective modules | | high | model-specific up-projection plus selective MoE and MLP modules | | extreme or long-context | full recompute only if the selective path still do …