resiliency

General↓ 0 installsUpdated 64d ago
VerifiedCuratedNVIDIA
Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine.
SKILL.md preview

---
name: resiliency
description: Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine.
when_to_use: Enabling resiliency features, or investigating a commit that caused training hangs, straggler detection failures, or broken restarts; 'fault tolerance', 'straggler detection', 'hang detection', 'automatic restart', 'in-process restart', 'preemption', 'nvidia-resiliency-ext'.
---

# Resiliency

Stable docs: @docs/training/resiliency.md, @docs/training/checkpointing.md
Card: @skills/resiliency/card.yaml

## Enablement

### Fault tolerance (Slurm only)

#### Option 1: NeMo Run plugin (recommended)

```python
from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin
import nemo_run as run

task = run.Script(...)
run_plugins = [
    FaultTolerancePlugin(
        enable_ft_package=True,
        calc_ft_timeouts=True,
        num_in_job_restarts=3,
        num_job_retries_on_failure=2,
        initial_rank_heartbeat_timeout=1800,
        rank_heartbeat_timeout=300,
    )
]
run.run(task, plugins=run_plugins, executor=executor)
```

| Plugin parameter | Default | Description |
|---|---|---|
| `num_in_job_restarts` | 3 | Max restarts within same job |
| `num_job_retries_on_failure` | 2 | Max new job launches on failure |
| `initial_rank_heartbeat_timeout` | 1800 | First heartbeat timeout (seconds) |
| `rank_heartbeat_timeout` | 300 | Subsequent heartbeat timeout (seconds) |

#### Option 2: Direct config + ft_launcher

```python
from megatron.bridge.training.config import FaultToleranceConfig

cfg.ft = FaultToleranceConfig(
    enable_ft_package=True,
    calc_ft_timeouts=True,
    simulate_fault=False,
    simulated_fault_type="random",
)
```

Launch with `ft_launcher` (not `torchrun`):

```bash
export GROUP_RANK=0  # required for non-Slurm
ft_launcher \
    --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
    --nnodes=${NUM_NODES} --nproc-per-node=${NUM_GPUS_PER_NODE} \
    --ft-rank_section_timeouts=setup:600,step:180,checkpointing:420 \
    --ft-rank_out_of_section_timeout=300 \
    your_training_script.py
```

| Config parameter | Default | Description |
|---|---|---|
| `enable_ft_package` | False | Enable fault tolerance |
| `calc_ft_timeouts` | False | Auto-compute optimal timeouts |
| `simulate_fault` | False | Enable fault simulation for testing |
| `simulated_fault_type` | `"random"` | `"rank_hung"`, `"rank_killed"`, or `"random"` |
| `simulated_fault_rank` | None | Specific rank to fault (random if None) |
| `simulated_fault_base_delay` | 0 | Base delay before simulating fault |

Section-based timeout monitoring covers setup, training steps, checkpointing,
and out-of-section time independently. Timeouts are saved to `ft_state.json`
for subsequent runs when `calc_ft_timeouts=True`.

### NVRx straggler detection

```python
from megatron.bridge.training.config import NVRxStragglerDetectionConfig

cfg.nvrx_straggler = NVRxStragglerDetectionConfig(
    enabled=True,
    report_time_interval=300.0,
    calc_relative_gpu_perf=True,
    calc_individual_gpu_perf=True,
    num_gpu_perf_scores_to_print=5,
    gpu_relative_perf_threshold=0.7,
    gpu_individual_perf_threshold=0.7,
    stop_if_detected=False,
    enable_logging=True,
)
```

| Parameter | Default | Description |
|---|---|---|
| `enabled` | False | Enable straggler detection |
| `report_time_interval` | 300.0 | Seconds between straggler checks |
| `calc_relative_gpu_perf` | True | Compare ranks against each other |
| `calc_individual_gpu_perf` | True | Track per-rank degradation over time |
| `gpu_relative_perf_threshold` | 0.7 | Threshold for relative performance (0-1) |
| `gpu_individual_perf_threshold` | 0.7 | Threshold for individual performance (0-1) |
| `stop_if_detected` | False | Terminate training on straggler |
| `num_gpu_perf_scores_to_print` | 5 | Number of best/worst scores to print |
| `profiling_interval` | 1 | Profiling interval for detector

…