bump-base-image
General↓ 0 installsUpdated 19d ago
Bump the NVIDIA PyTorch base image (`nvcr.io/nvidia/pytorch:<YY.MM>-py3`) used by Megatron-LM CI. Covers the two pin sites (GitHub CI in `docker/.ngc_version.dev` and GitLab CI in `.gitlab/stages/01.build.yml`), the post-bump CI loop (re-run functional tests, refresh golden values, mark broken tests), and the gotchas that bit PRs #4611 and #4688.
SKILL.md preview
--- name: bump-base-image description: Bump the NVIDIA PyTorch base image (`nvcr.io/nvidia/pytorch:<YY.MM>-py3`) used by Megatron-LM CI. Covers the two pin sites (GitHub CI in `docker/.ngc_version.dev` and GitLab CI in `.gitlab/stages/01.build.yml`), the post-bump CI loop (re-run functional tests, refresh golden values, mark broken tests), and the gotchas that bit PRs #4611 and #4688. when_to_use: User wants to upgrade the PyTorch container (e.g. "bump base image to 26.04"); CI is failing after a previous bump because the GitLab pin was missed; functional tests are failing with `lm loss` / `num-zeros` / `iteration-time` drift right after a container bump; a functional test hangs, times out, or OOMs after a bump; the user mentions `.ngc_version.dev`, `nvcr.io/nvidia/pytorch`, "container base image", or "Update Docker image version". --- # Bump the PyTorch base image End-to-end workflow for moving Megatron-LM's CI to a newer `nvcr.io/nvidia/pytorch:<YY.MM>-py3` container. The most common failure mode is forgetting that **GitHub CI and GitLab CI have separate pins** — a bump that only touches the former lands green, then breaks GitLab CI on `main` and forces an immediate follow-up PR. Always update both in the same PR. ## Inputs to gather from the user 1. **Target tag**, e.g. `26.04-py3`. NVIDIA NGC PyTorch containers are released as `nvcr.io/nvidia/pytorch:YY.MM-py3`. 2. **Scope** — usually `dev` only. The `lts` pin (`docker/.ngc_version.lts`, plus the `IMAGE_TYPE: lts` rows in GitLab) is bumped on a different cadence; only touch it if the user explicitly asks. 3. **Workflow run ID** (optional but typical) — after the first CI run, the user will provide a GitHub Actions run ID for golden-value refresh. ## Workflow ``` - [ ] Step 1: Update the GitHub CI pin (docker/.ngc_version.dev) - [ ] Step 2: Update the GitLab CI pin (.gitlab/stages/01.build.yml) - [ ] Step 3: Open the PR with the `Run functional tests` label - [ ] Step 4: Re-run failing tests via `/ok to test <commit-sha>` - [ ] Step 5: For golden-value drift → refresh with the `update-golden-values` skill - [ ] Step 6: For hangs / real regressions → mark tests `mr-broken` and file tracking issues - [ ] Step 7: Verify both pins are in sync before merging ``` ### Step 1 — GitHub CI pin `docker/.ngc_version.dev` is a single-line file consumed by `docker/Dockerfile.ci.dev` (via `FROM_IMAGE_NAME=$(cat docker/.ngc_version.dev)`). Overwrite it: ```bash echo 'nvcr.io/nvidia/pytorch:<YY.MM>-py3' > docker/.ngc_version.dev ``` The file has no trailing newline historically; preserving or adding one is fine — the build args treat the value as `$(cat ...)`. Do **not** touch `docker/.ngc_version.lts` unless bumping LTS too. ### Step 2 — GitLab CI pin GitLab CI does **not** read `docker/.ngc_version.dev`. It hardcodes `BASE_IMAGE` in a `parallel: matrix:` block. Update the two `IMAGE_TYPE: dev` rows (one per platform): ```yaml # .gitlab/stages/01.build.yml — under test:pre_build_image -> parallel.matrix - IMAGE: CI_MCORE_DEV_IMAGE FILE: Dockerfile.ci.dev IMAGE_TYPE: dev BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # amd64 row PLATFORM: amd64 - IMAGE: CI_MCORE_DEV_IMAGE FILE: Dockerfile.ci.dev IMAGE_TYPE: dev BASE_IMAGE: nvcr.io/nvidia/pytorch:<YY.MM>-py3 # arm64 row PLATFORM: arm64 ``` Leave the `IMAGE_TYPE: lts` rows alone. Quick sanity check before commit: ```bash rg -n '^\s*BASE_IMAGE: nvcr\.io/nvidia/pytorch:' .gitlab/stages/01.build.yml # expect: lts pin × 2 unchanged, dev pin × 2 == new tag ``` ### Step 3 — Open the PR - Title convention: `chore: Update Docker image version to <YY.MM>-py3` (see #4611). - **Apply the `Run functional tests` label** before the first push. This unlocks the full functional matrix on the PR; without it the bump only runs the standard GH PR checks and you'll miss the drift. - Push as draft first if you're still iterating; the bot will auto-draft otherwise. ### Step 4 — Re-running CI on a new commit For PRs from fork …