nemo-rl-e2e-testing

General↓ 0 installsUpdated 64d ago

VerifiedCuratedNVIDIA

External NeMo-RL end-to-end validation workflow for Megatron-Bridge model/provider changes, including downstream compatibility checks, external RL lifecycle behavior, Megatron policy setup, HF import/export, checkpoint/resume, non-colocated vLLM refit, delta weight transfer, optional LoRA/generation variants, and questions such as "does this model work in NeMo-RL", "run NeMo-RL e2e", or "external RL loop validation". Covers running NeMo-RL Megatron policy jobs from a Bridge checkout, choosing GR

SKILL.md preview

---
name: nemo-rl-e2e-testing
description: External NeMo-RL end-to-end validation workflow for Megatron-Bridge model/provider changes, including downstream compatibility checks, external RL lifecycle behavior, Megatron policy setup, HF import/export, checkpoint/resume, non-colocated vLLM refit, delta weight transfer, optional LoRA/generation variants, and questions such as "does this model work in NeMo-RL", "run NeMo-RL e2e", or "external RL loop validation". Covers running NeMo-RL Megatron policy jobs from a Bridge checkout, choosing GRPO/SFT/checkpoint/non-colocated refit variants, setting PYTHONPATH so NeMo-RL imports the local Bridge tree, and reporting pass/fail evidence.
when_to_use: Adding or changing a Megatron-Bridge model/provider and needing downstream NeMo-RL compatibility validation; checking non-vanilla Bridge provider paths; testing PEFT/LoRA, checkpoint behavior, non-colocated vLLM refit, or explicitly requested advanced variants through NeMo-RL; 'does this model work in NeMo-RL', 'run NeMo-RL e2e', 'external RL loop validation'.
---

# NeMo-RL E2E Testing

Validate a Megatron-Bridge model or training API change through NeMo-RL's Megatron backend. This catches integration issues that Bridge-only tests miss: NeMo-RL-owned rollout scheduling, reward handling, policy/reference setup, HF import/export through Bridge, optimizer setup, checkpoint ownership, and policy-to-generation weight transfer.

Use this as an external compatibility smoke test after the focused Bridge tests for the model/provider change pass.

This is not a replacement for Bridge model parity tests. A NeMo-RL GRPO or SFT run proves that Bridge can survive an external RL training loop; architecture correctness still comes from Bridge import/export, logits, roundtrip, and model-specific inference tests.

## Scope

Think in coverage levels. Start with Level 0 and add only the levels justified by the change.

| Level | Required when | What it proves |
|---|---|---|
| 0: Megatron policy GRPO smoke | Any new provider or provider config change that claims NeMo-RL compatibility | NeMo-RL can import the local Bridge provider, build a Megatron policy, initialize optimizer/scheduler state, run rollout/ref/logprob wiring, and finish a short GRPO job |
| 1: LoRA/checkpoint variant | Checkpointing, HF export, optimizer state, resume behavior, or a NeMo-RL-supported PEFT path changed | NeMo-RL can save through its checkpoint schedule, resume without losing training state, and, when PEFT is enabled in that NeMo-RL checkout, apply Bridge LoRA hooks |
| 2: Non-colocated vLLM refit | HF export, weight mapping, policy-to-generation refit, delta compression, packed transfer, or vLLM update behavior changed | Bridge-exported weights can be transferred from the Megatron policy worker into separate vLLM generation workers |
| 3: Optional Megatron generation backend | Only when the NeMo-RL checkout still supports `policy.generation.backend=megatron` and the change explicitly targets that path | NeMo-RL can use Megatron for both policy and generation rather than only vLLM generation |
| 4: Parallelism stress | TP/PP/CP/EP, sequence parallel, MoE dispatch, pipeline stage layout, or distributed optimizer behavior changed | Provider settings remain correct under non-trivial Megatron parallel state |
| 5: Architecture-specific e2e | VLM, audio, MoE, MTP/draft models, FP8/QAT/ModelOpt, quantized weights, or custom layers are involved | The architecture-specific runtime path is exercised, not just a text-only dense GRPO smoke |
| 6: Learning signal | Optimizer, scheduler, loss, reward, PEFT trainability, gradient flow, or training stability changed | Metrics move in the expected direction over a short run and do not silently produce zero/NaN/unstable updates |

The default Level 0 target is NeMo-RL's maintained Megatron GRPO functional:

```bash
uv run bash tests/functional/grpo_megatron.sh
```

This is intentionally small. It exercises NeMo-RL's external RL loop without making

…