perf-parallelism-strategies
General↓ 0 installsUpdated 19d ago
Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration.
SKILL.md preview
--- name: perf-parallelism-strategies description: Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration. when_to_use: Choosing or sizing TP/DP/PP/CP/EP degrees, or tracing an OOM or regression to a parallelism config change; 'how to parallelize', 'tensor parallel', 'pipeline parallel', 'parallelism config', 'which parallelism for X GPUs'. --- # Parallelism Strategy Selection Skill For stable background on each parallelism type, see: - @docs/parallelisms.md - @skills/perf-parallelism-strategies/card.yaml ## Decision by Model Size ### Dense models | Model size | GPUs | Recommended starting point | |---|---|---| | < 1B | 1-8 | DP only | | 1-10B | 8-16 | TP=2-4 + DP | | 10-70B | 16-64 | TP=4-8 + PP=2-4 + DP | | 70-175B | 64-256 | TP=8 + PP=4-8 + DP | | 175-500B | 256-1024 | TP=8 + PP=8-16 + CP=2 + DP | ### MoE models MoE parallelism differs from dense models. Because only a fraction of parameters are active per token, TP can often stay at 1 or 2 — the active parameter shard already fits on a single GPU. EP is the primary scaling dimension, with PP handling cross-node layer distribution. | Model (total / active) | TP | PP | EP | Notes | |---|---|---|---|---| | OLMoE 7B / 1B | 1 | 1 | 8 | EP only, fits single node | | Moonlight 16B / 3B | 2 | 1 | 8 | small TP for shared layers | | DeepSeek-V2 236B / 21B | 1 | 4 | 32 | no TP at all | | GLM-4.5 Air 106B / 12B | 1 | 4 | 8 | no TP at all | | Qwen3 30B-A3B | 4 | 2 | 4 | | | GLM-4.5 355B / 32B | 2 | 8 | 16 | | | Qwen3 235B-A22B | 4 | 16 | 8 | CP=2 for pretrain | | DeepSeek-V3 671B / 37B | 2 | 16 | 64 | TP=2, not 8 | | Kimi-K2 1T | 2 | 16 | 32 | | Key patterns: - TP is sized by **active** params, not total params. A 671B MoE with 37B active needs far less TP than a 70B dense model. - EP scales with expert count. Common: EP = num_experts or num_experts / experts_per_gpu. - PP handles depth. Large MoE models use PP=8-16 across nodes. - ETP (expert tensor parallelism) is rarely used. Llama 4 is an exception (ETP=4). These are starting points, not hard rules. Always profile the first iteration to verify memory and communication. ## Decision by Hardware Topology Single node with NVLink: ```python cfg.model.tensor_model_parallel_size = 8 ``` Multiple nodes with InfiniBand: ```python cfg.model.tensor_model_parallel_size = 8 cfg.model.pipeline_model_parallel_size = N ``` Limited network (Ethernet): ```python cfg.model.tensor_model_parallel_size = 4 cfg.model.pipeline_model_parallel_size = M ``` The stable rule is: keep TP within a single NVLink domain. Use PP or DP for cross-node scaling. TP across nodes is almost always a performance loss. ## Decision by Sequence Length | Sequence length | Recommendation | |---|---| | < 2K | standard TP + PP + DP | | 2K-8K | add SP (`sequence_parallel=True`) | | 8K-32K | add CP=2 | | 32K+ | add CP=4-8, consider `a2a+p2p` for large CP | ## Combined Parallelism Enablement 3D parallelism (TP + PP + DP): ```python cfg.model.tensor_model_parallel_size = 4 cfg.model.pipeline_model_parallel_size = 4 cfg.model.sequence_parallel = True ``` 4D parallelism (TP + PP + CP + DP): ```python cfg.model.tensor_model_parallel_size = 8 cfg.model.pipeline_model_parallel_size = 8 cfg.model.context_parallel_size = 2 cfg.model.sequence_parallel = True ``` MoE with EP + PP (e.g. DeepSeek-V2 236B on 128 GPUs): ```python cfg.model.tensor_model_parallel_size = 1 cfg.model.pipeline_model_parallel_size = 4 cfg.model.expert_model_parallel_size = 32 cfg.model.sequence_parallel = False ``` MoE with small TP + PP + EP (e.g. DeepSeek-V3 671B on 256 GPUs): ```python cfg.model.tensor_model_parallel_size = 2 cfg.model.pipeline_model_parallel_size = 16 cfg.model.expert_model_parallel_size = 64 cfg.model.sequence_parallel = True ``` DP size is always implicit: ``` data_parallel_size = world_size / (TP * PP * CP) # dense path expert_da …