run-on-slurm

General↓ 0 installsUpdated 64d ago
VerifiedCuratedNVIDIA
How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.
SKILL.md preview

---
name: run-on-slurm
description: How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.
when_to_use: Submitting a SLURM job; writing or debugging an sbatch script; configuring multi-node distributed training; setting MASTER_ADDR / MASTER_PORT / WORLD_SIZE; diagnosing a SLURM job failure; 'how do I run on the cluster', 'sbatch', 'multi-node training'.
---

# Run Megatron-LM on SLURM

## Prerequisites

- A SLURM cluster login with submission rights to a GPU partition.
- Megatron-LM checked out on a filesystem visible to all nodes in the allocation (NFS, Lustre, or similar). All nodes must reach the same paths for code, data, checkpoints, and output.
- `uv` installed; run `uv sync --extra training --extra dev` (or `--extra lts`) on the worktree once before submission so the `.venv` is materialized and visible to every node.

## Minimal sbatch script

Save as `run_megatron.slurm` in the worktree:

```bash
#!/bin/bash
#SBATCH --job-name=megatron
#SBATCH --account=<SLURM_ACCOUNT>
#SBATCH --partition=<SLURM_PARTITION>
#SBATCH --nodes=<NODES>
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=<GPUS_PER_NODE>
#SBATCH --time=<HH:MM:SS>
#SBATCH --output=logs/%x-%j.out
#SBATCH --error=logs/%x-%j.err

set -euo pipefail
cd <MEGATRON_WORKTREE>

export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n1)
export MASTER_PORT=${MASTER_PORT:-29500}
export NNODES=${SLURM_NNODES}
export GPUS_PER_NODE=<GPUS_PER_NODE>
export WORLD_SIZE=$((NNODES * GPUS_PER_NODE))

# Set CUDA_DEVICE_MAX_CONNECTIONS only when your configuration requires it
# (see the section below). Example for pre-Blackwell with TP>1 or CP>1
# (non-FSDP):
#   export CUDA_DEVICE_MAX_CONNECTIONS=1

srun --ntasks=${NNODES} --ntasks-per-node=1 bash -c '
  # NODE_RANK comes from SLURM_NODEID with one task per node.
  NODE_RANK=${SLURM_NODEID}
  uv run python -m torch.distributed.run \
    --nnodes='"${NNODES}"' \
    --nproc-per-node='"${GPUS_PER_NODE}"' \
    --node-rank=${NODE_RANK} \
    --master-addr='"${MASTER_ADDR}"' \
    --master-port='"${MASTER_PORT}"' \
    pretrain_gpt.py \
      <MEGATRON_ARGS>
'
```

Submit:

```bash
mkdir -p logs && JOB_ID=$(sbatch --parsable run_megatron.slurm)
echo "Submitted ${JOB_ID}"
```

## Multi-node rules

- Submit from the worktree you intend to run, or `cd` to it in the script. All nodes must reach the same path on a shared filesystem (NFS, Lustre, or similar) — node-local paths will not be visible to peer ranks.
- Use one `torchrun` worker group across all nodes; do not start independent single-node jobs.
- `--nproc-per-node` should equal the number of visible GPUs per node.
- Write checkpoints, tensorboard data, and structured logs to shared storage.

## CUDA_DEVICE_MAX_CONNECTIONS

The right value depends on your hardware and parallelism mode. Do not export it unconditionally:

- **Pre-Blackwell (Hopper, Ampere) with TP>1 or CP>1, non-FSDP:** set to `1`. The relevant code path asserts on this — you will get an assertion error if it is not `1`, not a silent deadlock.
- **Blackwell:** not required; setting it has no effect.
- **Torch-FSDP2 or Megatron-FSDP:** must NOT be `1`. Leave the env var unset, or set it to a value greater than `1`.
- **`overlap_moe_expert_parallel_comm` enabled:** set to `32`.

Set it explicitly in the sbatch script when your configuration calls for it.

## Containers

Many sites run Megatron-LM inside a container (enroot/pyxis on some clusters, singularity on others). If you do, the uv-managed `.venv` must live on a path that is visible from inside the container, and the container image must provide the CUDA / NCCL / torch versions the repo expects (see `docker/.ngc_version.dev` and `.ngc_version.lts`). The skeleton above stays the same; wrap the `srun` invocat

…