perf-parallelism-strategies

General↓ 0 installsUpdated 64d ago
VerifiedCuratedNVIDIA
Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration.
SKILL.md preview

---
name: perf-parallelism-strategies
description: Operational guide for choosing and combining parallelism strategies in Megatron Bridge, including sizing rules, hardware topology mapping, and combined parallelism configuration.
when_to_use: Choosing or sizing TP/DP/PP/CP/EP degrees, or tracing an OOM or regression to a parallelism config change; 'how to parallelize', 'tensor parallel', 'pipeline parallel', 'parallelism config', 'which parallelism for X GPUs'.
---

# Parallelism Strategy Selection Skill

For stable background on each parallelism type, see:

- @docs/parallelisms.md
- @skills/perf-parallelism-strategies/card.yaml

## Decision by Model Size

### Dense models

| Model size | GPUs | Recommended starting point |
|---|---|---|
| < 1B | 1-8 | DP only |
| 1-10B | 8-16 | TP=2-4 + DP |
| 10-70B | 16-64 | TP=4-8 + PP=2-4 + DP |
| 70-175B | 64-256 | TP=8 + PP=4-8 + DP |
| 175-500B | 256-1024 | TP=8 + PP=8-16 + CP=2 + DP |

### MoE models

MoE parallelism differs from dense models. Because only a fraction of
parameters are active per token, TP can often stay at 1 or 2 — the active
parameter shard already fits on a single GPU. EP is the primary scaling
dimension, with PP handling cross-node layer distribution.

| Model (total / active) | TP | PP | EP | Notes |
|---|---|---|---|---|
| OLMoE 7B / 1B | 1 | 1 | 8 | EP only, fits single node |
| Moonlight 16B / 3B | 2 | 1 | 8 | small TP for shared layers |
| DeepSeek-V2 236B / 21B | 1 | 4 | 32 | no TP at all |
| GLM-4.5 Air 106B / 12B | 1 | 4 | 8 | no TP at all |
| Qwen3 30B-A3B | 4 | 2 | 4 | |
| GLM-4.5 355B / 32B | 2 | 8 | 16 | |
| Qwen3 235B-A22B | 4 | 16 | 8 | CP=2 for pretrain |
| DeepSeek-V3 671B / 37B | 2 | 16 | 64 | TP=2, not 8 |
| Kimi-K2 1T | 2 | 16 | 32 | |

Key patterns:

- TP is sized by **active** params, not total params. A 671B MoE with
  37B active needs far less TP than a 70B dense model.
- EP scales with expert count. Common: EP = num_experts or
  num_experts / experts_per_gpu.
- PP handles depth. Large MoE models use PP=8-16 across nodes.
- ETP (expert tensor parallelism) is rarely used. Llama 4 is an
  exception (ETP=4).

These are starting points, not hard rules. Always profile the first
iteration to verify memory and communication.

## Decision by Hardware Topology

Single node with NVLink:

```python
cfg.model.tensor_model_parallel_size = 8
```

Multiple nodes with InfiniBand:

```python
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = N
```

Limited network (Ethernet):

```python
cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = M
```

The stable rule is: keep TP within a single NVLink domain. Use PP or DP
for cross-node scaling. TP across nodes is almost always a performance
loss.

## Decision by Sequence Length

| Sequence length | Recommendation |
|---|---|
| < 2K | standard TP + PP + DP |
| 2K-8K | add SP (`sequence_parallel=True`) |
| 8K-32K | add CP=2 |
| 32K+ | add CP=4-8, consider `a2a+p2p` for large CP |

## Combined Parallelism Enablement

3D parallelism (TP + PP + DP):

```python
cfg.model.tensor_model_parallel_size = 4
cfg.model.pipeline_model_parallel_size = 4
cfg.model.sequence_parallel = True
```

4D parallelism (TP + PP + CP + DP):

```python
cfg.model.tensor_model_parallel_size = 8
cfg.model.pipeline_model_parallel_size = 8
cfg.model.context_parallel_size = 2
cfg.model.sequence_parallel = True
```

MoE with EP + PP (e.g. DeepSeek-V2 236B on 128 GPUs):

```python
cfg.model.tensor_model_parallel_size = 1
cfg.model.pipeline_model_parallel_size = 4
cfg.model.expert_model_parallel_size = 32
cfg.model.sequence_parallel = False
```

MoE with small TP + PP + EP (e.g. DeepSeek-V3 671B on 256 GPUs):

```python
cfg.model.tensor_model_parallel_size = 2
cfg.model.pipeline_model_parallel_size = 16
cfg.model.expert_model_parallel_size = 64
cfg.model.sequence_parallel = True
```

DP size is always implicit:

```
data_parallel_size = world_size / (TP * PP * CP)        # dense path
expert_da

…