live1,247 agents deployedbuilt by a solo devpowered by hermes
← All skillsSign up to install

bump-dependency

General0 installsUpdated 19d ago
VerifiedCuratedNVIDIA

Bump a pinned dependency (TransformerEngine, Megatron-LM, NRX, etc.), regenerate the lockfile, open a PR, and drive it to green by attaching a watchdog to the "CICD NeMo" workflow and quarantining failing functional tests as flaky until the run is green.

SKILL.md preview

---
name: bump-dependency
description: Bump a pinned dependency (TransformerEngine, Megatron-LM, NRX, etc.), regenerate the lockfile, open a PR, and drive it to green by attaching a watchdog to the "CICD NeMo" workflow and quarantining failing functional tests as flaky until the run is green.
when_to_use: Bumping a dependency pin in `pyproject.toml` or `uv.lock` and shepherding the PR to green. 'bump TE', 'bump transformer-engine', 'update TE pin', 'bump submodule', 'update lock file', 'bump dependency PR', 'watch CI for a bump', 'quarantine flaky tests after bump', 'run all tests for this bump'.
---

# Bump Dependency

End-to-end workflow for shipping a dependency bump in Megatron Bridge.
Optimised for the case where TE, MCore, or another GPU-heavy pin moves
forward — which often surfaces flakes that have to be quarantined before
the PR can land.

The pipeline is always: **edit → relock → push → /ok to test → watchdog →
quarantine on red → re-trigger → repeat until green**.

## When to reach for this skill

- Bumping a git-source pin in `pyproject.toml` `override-dependencies`
  (e.g. `transformer-engine @ git+...@<ref>`).
- Bumping the `3rdparty/Megatron-LM` submodule.
- Any change that touches `uv.lock` and needs the full L0 + L1 matrix to
  prove out before merge.

For pure dep additions/removals without a CI loop, the
`build-and-dependency` skill is enough.

## Required context

Read first, then follow the steps below:

- @CONTRIBUTING.md — PR title/label policy, DCO sign-off
- @skills/build-and-dependency/SKILL.md — `uv lock` mechanics, container choice
- @skills/cicd/SKILL.md — how `copy-pr-bot` and `/ok to test` work
- @skills/testing/SKILL.md — `active/` vs `flaky/` directory layout, `git mv` quarantine recipe

## Step 1 — Worktree and edit

Create a worktree off `main` per @CLAUDE.md. Then, **before any `uv lock`**:

```bash
git submodule update --init 3rdparty/Megatron-LM
```

The submodule must be initialised in the worktree or `uv lock` errors
with "not a Python project" on the MCore path.

Edit the pin. For TE the canonical knob is the override line in
`pyproject.toml`:

```toml
override-dependencies = [
    ...
    "transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@<new-ref>",
    ...
]
```

Use a **branch name** (`release_v2.15`) only when you want to track a
moving tip; use a full SHA for reproducibility. TE branches use
`release_vX.Y` (underscore), not `release/vX.Y`. Verify with
`git ls-remote https://github.com/NVIDIA/TransformerEngine.git`.

## Step 2 — Regenerate the lockfile

Run `uv lock` inside the project container per
@skills/build-and-dependency/SKILL.md "Regenerating uv.lock". Then
confirm only the intended packages moved:

```bash
git diff --stat pyproject.toml uv.lock
```

If the diff carries changes you didn't ask for (transitive movements you
can't explain), stop and investigate before pushing. Note that
`override-dependencies` carries CVE floors that float — unrelated
packages bumping by a patch version is expected; accept those, don't
revert them.

## Step 3 — Commit and push

Sign-off + signed-commit + PR title format per @CONTRIBUTING.md and
@skills/cicd/SKILL.md "Commit and PR Workflow". For a bump:

```bash
git add pyproject.toml uv.lock
git commit -S -s -m "[build] chore: bump <package> to <ref>"
git push -u origin <branch-name>
```

A signed commit (`-S`) lets `copy-pr-bot` trigger CI without manual
`/ok to test` for the first push — but you'll still post `/ok to test`
on every subsequent SHA in this loop (Step 5).

## Step 4 — Open the PR

Title and labels per @CONTRIBUTING.md. Two bump-specific requirements:

- Apply `needs-more-tests` — **mandatory** for a bump; expands the matrix
  from L0 to L0+L1.
- For a high-blast-radius bump (TE, MCore submodule, anything that
  touches CUDA kernels), also apply `full-test-suite` to pull L2 into
  the PR run. L2 covers VL models, checkpoint conversion, and heavy
  quantization which otherwise only run on schedule.

The PR