resiliency
General↓ 0 installsUpdated 19d ago
Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine.
SKILL.md preview
---
name: resiliency
description: Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine.
when_to_use: Enabling resiliency features, or investigating a commit that caused training hangs, straggler detection failures, or broken restarts; 'fault tolerance', 'straggler detection', 'hang detection', 'automatic restart', 'in-process restart', 'preemption', 'nvidia-resiliency-ext'.
---
# Resiliency
Stable docs: @docs/training/resiliency.md, @docs/training/checkpointing.md
Card: @skills/resiliency/card.yaml
## Enablement
### Fault tolerance (Slurm only)
#### Option 1: NeMo Run plugin (recommended)
```python
from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin
import nemo_run as run
task = run.Script(...)
run_plugins = [
FaultTolerancePlugin(
enable_ft_package=True,
calc_ft_timeouts=True,
num_in_job_restarts=3,
num_job_retries_on_failure=2,
initial_rank_heartbeat_timeout=1800,
rank_heartbeat_timeout=300,
)
]
run.run(task, plugins=run_plugins, executor=executor)
```
| Plugin parameter | Default | Description |
|---|---|---|
| `num_in_job_restarts` | 3 | Max restarts within same job |
| `num_job_retries_on_failure` | 2 | Max new job launches on failure |
| `initial_rank_heartbeat_timeout` | 1800 | First heartbeat timeout (seconds) |
| `rank_heartbeat_timeout` | 300 | Subsequent heartbeat timeout (seconds) |
#### Option 2: Direct config + ft_launcher
```python
from megatron.bridge.training.config import FaultToleranceConfig
cfg.ft = FaultToleranceConfig(
enable_ft_package=True,
calc_ft_timeouts=True,
simulate_fault=False,
simulated_fault_type="random",
)
```
Launch with `ft_launcher` (not `torchrun`):
```bash
export GROUP_RANK=0 # required for non-Slurm
ft_launcher \
--rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
--nnodes=${NUM_NODES} --nproc-per-node=${NUM_GPUS_PER_NODE} \
--ft-rank_section_timeouts=setup:600,step:180,checkpointing:420 \
--ft-rank_out_of_section_timeout=300 \
your_training_script.py
```
| Config parameter | Default | Description |
|---|---|---|
| `enable_ft_package` | False | Enable fault tolerance |
| `calc_ft_timeouts` | False | Auto-compute optimal timeouts |
| `simulate_fault` | False | Enable fault simulation for testing |
| `simulated_fault_type` | `"random"` | `"rank_hung"`, `"rank_killed"`, or `"random"` |
| `simulated_fault_rank` | None | Specific rank to fault (random if None) |
| `simulated_fault_base_delay` | 0 | Base delay before simulating fault |
Section-based timeout monitoring covers setup, training steps, checkpointing,
and out-of-section time independently. Timeouts are saved to `ft_state.json`
for subsequent runs when `calc_ft_timeouts=True`.
### NVRx straggler detection
```python
from megatron.bridge.training.config import NVRxStragglerDetectionConfig
cfg.nvrx_straggler = NVRxStragglerDetectionConfig(
enabled=True,
report_time_interval=300.0,
calc_relative_gpu_perf=True,
calc_individual_gpu_perf=True,
num_gpu_perf_scores_to_print=5,
gpu_relative_perf_threshold=0.7,
gpu_individual_perf_threshold=0.7,
stop_if_detected=False,
enable_logging=True,
)
```
| Parameter | Default | Description |
|---|---|---|
| `enabled` | False | Enable straggler detection |
| `report_time_interval` | 300.0 | Seconds between straggler checks |
| `calc_relative_gpu_perf` | True | Compare ranks against each other |
| `calc_individual_gpu_perf` | True | Track per-rank degradation over time |
| `gpu_relative_perf_threshold` | 0.7 | Threshold for relative performance (0-1) |
| `gpu_individual_perf_threshold` | 0.7 | Threshold for individual performance (0-1) |
| `stop_if_detected` | False | Terminate training on straggler |
| `num_gpu_perf_scores_to_print` | 5 | Number of best/worst scores to print |
| `profiling_interval` | 1 | Profiling interval for detector
…