nightly-sync
General↓ 0 installsUpdated 19d ago
Domain knowledge for the nightly main-to-dev sync workflow. Covers merge strategy, CI architecture, failure investigation, and known issues.
SKILL.md preview
--- name: nightly-sync description: Domain knowledge for the nightly main-to-dev sync workflow. Covers merge strategy, CI architecture, failure investigation, and known issues. when_to_use: Working on the nightly sync PR; investigating a nightly sync failure; resolving merge conflicts between main and dev; 'nightly sync failed', 'main-to-dev merge', 'sync bot'. --- # Nightly Sync: Main to Dev This skill is read by the automated sync bot during the nightly-sync-main-to-dev workflow. It contains all domain knowledge for merging main into dev, resolving conflicts, iterating on CI, and shipping the PR. --- ## Phase 1: Create the Sync Branch and Merge ### Branch Setup 1. Create branch `$BRANCH` from `origin/dev` 2. Merge: `git merge origin/main -X theirs --no-edit` 3. If conflicts remain (e.g. add/add), resolve by favoring main ### Preserving Dev-Only Additions Do NOT blanket-override all shared files with main's version. Dev has features not yet in main (new classes, new modules, new tests). The merge preserves both sides' non-conflicting additions — only intervene where there is an actual conflict. ### Squash-Merge Chain Detection Dev often develops features as a chain of PRs (PR1 → PR2 → PR3) where each builds on the last. When PR1 is squash-merged to main, git sees main's squashed version and dev's original commits as unrelated changes. `-X theirs` will pick main's PR1 code and silently discard PR2/PR3's improvements on dev. After the merge, check for this pattern: 1. For each file where `-X theirs` resolved a conflict, run `git log --oneline origin/dev -- <file>` to see if dev has commits that came AFTER the code main is bringing in. 2. If dev has follow-up commits (bug fixes, refactors, extensions), **favor dev's version** for those sections. 3. If the conflict is just main bringing in a clean copy of what dev already has (no follow-ups), main's version is fine. Practical check: run `git diff origin/dev -- <file>` on conflicted files. If dev's code was removed or reverted, investigate whether dev's version is the more evolved one. Real examples from PR #4291: - `emerging_optimizers.py`: Main's version was MORE complete — it squash-merged dev's PRs plus added more. `-X theirs` was correct. - `distrib_optimizer.py`: Main overwrote dev's `GroupedQuantizedTensor` support. Had to restore `_is_distopt_quantized_param` and the expanded `_expand_quantized_param_shard_for_cast` loop while keeping main's NVFP4 additions. This required a surgical merge combining sections from both. Key insight: squash-merge chains can go in EITHER direction. Sometimes main is ahead (it squash-merged dev's work + more), sometimes dev is ahead (it has follow-up PRs). Always diff both ways before deciding which version to favor. ### Files to Override from Main These files have known semantic conflicts where dev's versions reference args or APIs that main removed or renamed. Take main's version with `git checkout origin/main -- <file>`: - `megatron/training/training.py` — references dev-only args - `megatron/training/initialize.py` — references dev-only args - `megatron/training/utils.py` — references dev-only args - `megatron/training/datasets/data_samplers.py` — references dev-only args - `megatron/core/optimizer/layer_wise_optimizer.py` — constructor signature **Caveat for ALL overrides:** After taking main's version of any file, you MUST run the API Mismatch Detection procedure (see below) on that file. Taking main's caller code while keeping dev's callee implementations is the #1 source of sync bugs. **IMPORTANT: Do NOT take main's `pyproject.toml`, `uv.lock`, or `docker/Dockerfile.ci.dev`.** These three files are a tightly coupled triple — the Dockerfile's `uv sync` command must match the dependency groups in `pyproject.toml`, and `uv.lock` must be consistent with both. Main's versions are missing dev-only dependencies (e.g. `fast-hadamard-transform`, correct TransformerEngine revision) and the `--group no_pypi_ …