live1,247 agents deployedbuilt by a solo devpowered by hermes
← All skillsSign up to install

update-golden-values

General0 installsUpdated 19d ago
VerifiedCuratedNVIDIA

Refresh golden values from a GitHub Actions workflow run (failing-only or all jobs), score the change with average normalized relative differences, and produce a PR-ready summary. Use when the user asks to update goldens for a CI run, refresh golden values from a workflow ID, or generate a golden-value diff summary for a PR description.

SKILL.md preview

---
name: update-golden-values
description: Refresh golden values from a GitHub Actions workflow run (failing-only or all jobs), score the change with average normalized relative differences, and produce a PR-ready summary. Use when the user asks to update goldens for a CI run, refresh golden values from a workflow ID, or generate a golden-value diff summary for a PR description.
when_to_use: User provides a GitHub Actions workflow run ID and asks to refresh golden values; user asks to update goldens for "failing tests only" or "all tests"; user asks for a per-metric relative-difference summary of the golden-value diff; user wants a PR description blurb after running download_golden_values.py.
---

# Update golden values + relative-diff summary

End-to-end workflow for refreshing golden values from a GitHub Actions workflow run, scoring the update with a per-metric average normalized relative difference, and writing a PR-ready summary.

The skill orchestrates two scripts that already live in the repo:

- `tests/test_utils/python_scripts/download_golden_values.py` — pulls artifacts from a workflow run and overwrites `tests/functional_tests/test_cases/**/golden_values_*.json`.
- `tests/test_utils/python_scripts/compare_golden_values_kl.py` — diffs the working-tree goldens against `git HEAD` and reports per-metric `avg_rel_diff = mean((old − new) / old)`. (Filename keeps the legacy `_kl` suffix; the script no longer computes KL divergence.)

## Inputs to gather from the user

1. **GitHub Actions workflow run ID** (e.g. `25341543542`). It's the numeric ID in the run URL.
2. **Source**: should be `github` for this workflow. (`gitlab` is supported by the download script but uses a different env path.)
3. **Scope** — accept one of:
   - `only-failing` → run with `--only-failing` (download from failing/cancelled jobs only). Use this for "fix the broken tests" workflows.
   - `all` → run without `--only-failing` (download from every job that produced golden values). Use this when the user wants a full refresh.

   If the user doesn't specify, ask. Don't silently default.

## Workflow

```
- [ ] Step 1: Set up env (token + venv with deps)
- [ ] Step 2: Reset prior golden-value edits
- [ ] Step 3: Download goldens (scope = only-failing | all)
- [ ] Step 4: Run relative-diff comparison + capture CSV
- [ ] Step 5: Produce summary blurb
```

### Step 1 — Environment

The download script needs `GITHUB_TOKEN`. If the user has the `gh` CLI authenticated, derive it; do NOT export the token into a long-lived shell or commit it.

```bash
# token (one-shot, scoped to the command)
export GITHUB_TOKEN="$(gh auth token)"

# python deps (the script imports click, gitlab, requests)
python3 -m venv /tmp/gv_venv
/tmp/gv_venv/bin/pip install --quiet click python-gitlab requests
```

Reuse `/tmp/gv_venv` if it already exists. The comparison script only depends on `click` (also in the venv).

### Step 2 — Reset prior edits (only if user re-runs)

If the working tree already has prior golden-value modifications you want to discard before re-downloading:

```bash
git checkout -- tests/functional_tests/test_cases/
git ls-files --others --exclude-standard tests/functional_tests/test_cases/ \
  | while IFS= read -r f; do rm -f "$f"; done
```

Skip this step when the user explicitly wants to layer a new download on top of an in-progress branch.

### Step 3 — Download

Build the command from the user-provided scope:

```bash
# scope = only-failing (default for "fix broken tests")
/tmp/gv_venv/bin/python tests/test_utils/python_scripts/download_golden_values.py \
  --source github --pipeline-id <WORKFLOW_RUN_ID> --only-failing

# scope = all (full refresh; omit the flag)
/tmp/gv_venv/bin/python tests/test_utils/python_scripts/download_golden_values.py \
  --source github --pipeline-id <WORKFLOW_RUN_ID>
```

When `--only-failing` is set, the GitHub path filters at `_fetch_and_filter_artifacts` on `matched_job["conclusion"] == "success"`, so only failing/cancelled jo