live1,247 agents deployedbuilt by a solo devpowered by hermes
← All skillsSign up to install

perf-hierarchical-context-parallel

General0 installsUpdated 19d ago
VerifiedCuratedNVIDIA

Operational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.

SKILL.md preview

---
name: perf-hierarchical-context-parallel
description: Operational guide for enabling hierarchical context parallelism in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.
when_to_use: Scaling context parallelism beyond KV heads, or investigating a commit that changed CP config and caused OOM or a regression; 'hierarchical_context_parallel_sizes', 'a2a+p2p', 'hierarchical CP', 'CP beyond KV heads', 'multi-level CP'.
---

# Hierarchical Context Parallel Skill

This skill covers hierarchical context parallelism: nested context-parallel process
groups used by `cp_comm_type="a2a+p2p"` and configured with
`hierarchical_context_parallel_sizes`.

For what hierarchical CP is, when to use it, and the decision tree
(`a2a+p2p` vs pure `a2a` vs `p2p`), see:

- @docs/training/hierarchical-context-parallel.md
- @skills/perf-hierarchical-context-parallel/card.yaml

## Enablement

Minimal Bridge override:

```python
cfg.model.context_parallel_size = 4
cfg.model.cp_comm_type = "a2a+p2p"
cfg.model.hierarchical_context_parallel_sizes = [2, 2]
cfg.dist.use_decentralized_pg = False
```

Required constraints:

- `prod(hierarchical_context_parallel_sizes) == context_parallel_size`
- `seq_length % (2 * context_parallel_size) == 0`
- Transformer Engine `>= 1.12.0`

## Code Anchors

Upstream config and validation:

```45:54:3rdparty/Megatron-LM/megatron/core/model_parallel_config.py
context_parallel_size: int = 1
"""Splits network input along sequence dimension across GPU ranks."""

hierarchical_context_parallel_sizes: Optional[list[int]] = None
"""Degrees of the hierarchical context parallelism. Users should provide a list to specify 
   the sizes for different levels. Taking the a2a+p2p cp comm type as example, it contains
   groups of two levels, so the first value of the list indicates the group size of the a2a
   communication type, and the second value indicates the group size of the p2p communication
   type.
"""
```

```428:433:3rdparty/Megatron-LM/megatron/training/arguments.py
if args.hierarchical_context_parallel_sizes:
    from numpy import prod
    assert args.context_parallel_size == prod(args.hierarchical_context_parallel_sizes)
if "a2a+p2p" in args.cp_comm_type:
    assert args.hierarchical_context_parallel_sizes is not None, \
    "--hierarchical-context-parallel-sizes must be set when a2a+p2p is used in cp comm"
```

Bridge MPU path:

```613:648:src/megatron/bridge/training/initialize.py
parallel_state.initialize_model_parallel(
    ...
    context_parallel_size=model_config.context_parallel_size,
    hierarchical_context_parallel_sizes=model_config.hierarchical_context_parallel_sizes,
    ...
)
...
return ProcessGroupCollection.use_mpu_process_groups()
```

Bridge decentralized-PG path:

```503:524:src/megatron/bridge/training/initialize.py
pg_collection = ProcessGroupCollection(
    ...
    cp=cp_pg,
    tp_cp=tp_cp_pg,
    hcp=None,
    ep=ep_pg,
    ...
)
```

## Implementation Map

### Config definition

`hierarchical_context_parallel_sizes` is declared in `ModelParallelConfig`:

```
# 3rdparty/Megatron-LM/megatron/core/model_parallel_config.py
hierarchical_context_parallel_sizes: Optional[list[int]] = None
# For a2a+p2p, first value = a2a group size, second value = p2p group size.
# Product must equal context_parallel_size.
```

`cp_comm_type` is declared in `TransformerConfig`:

```
# 3rdparty/Megatron-LM/megatron/core/transformer/transformer_config.py
cp_comm_type: Optional[Union[str, List[str]]] = None
# Can be per-layer (List[str]) or uniform (str).
# Values: "p2p", "all_gather", "a2a", "a2a+p2p"
```

### Validation (MCore)

`TransformerConfig.__post_init__` enforces that `a2a+p2p` requires HCP sizes and the product matches CP.

### Process group creation

`parallel_state.initialize_model_parallel` creates hierarchical CP sub-groups
when HCP sizes are provided via `create_hierarchical_groups`. Bridge currently
gets those groups through the MPU-backed `ProcessGroupCollection`.

### TE integ