build-repo-context

Name: build-repo-context
Author: UKGovernmentBEIS/inspect_evals

$npx mdskill add UKGovernmentBEIS/inspect_evals/build-repo-context

Extracts institutional knowledge from GitHub history into a shared context document.

Helps agents understand repo conventions and avoid common mistakes.
Integrates with GitHub CLI to fetch PRs, issues, and review comments.
Decides scope by checking existing document headers for date and PR ranges.
Outputs an updated markdown file containing distilled knowledge and patterns.

SKILL.md

.github/skills/build-repo-contextView on GitHub ↗

---
name: build-repo-context
description: Crawl repository PRs, issues, and review comments to distill institutional knowledge into a shared knowledge base. Run periodically by "context agents" to maintain agent_artefacts/repo_context/REPO_CONTEXT.md. Trigger only on specific request.
---

# Build Repo Context

Crawl GitHub history (PRs, issues, review comments) and distill institutional knowledge into `agent_artefacts/repo_context/REPO_CONTEXT.md`. This document helps worker agents understand repo conventions, common mistakes, and known tech debt before making changes.

## Workflow

### 1. Setup

1. Create `agent_artefacts/repo_context/` if it doesn't exist
2. Read existing `agent_artefacts/repo_context/REPO_CONTEXT.md` if present (will be updated, not replaced)

### 2. Identify What's New

Use the header of `REPO_CONTEXT.md` to determine what to process. The header contains the last-updated date and PR range (e.g., `PRs processed: #965-#1050`).

- **First run** (no `REPO_CONTEXT.md`): Fetch the most recent 50 merged PRs + all open issues
- **Incremental runs**: Fetch PRs merged after the highest PR number in the header, and issues updated since the last-updated date

Use the `gh` CLI to list candidates:

```bash
# First run: recent merged PRs
gh pr list --state merged --limit 50 --json number,title,labels,additions,deletions,reviewDecision,mergedAt

# Incremental: PRs merged since last crawl
gh pr list --state merged --search "merged:>YYYY-MM-DD" --limit 50 --json number,title,labels,additions,deletions,reviewDecision,mergedAt

# Open issues
gh issue list --state open --limit 100 --json number,title,labels,createdAt,updatedAt
```

### 3. Triage

Fast pass over PR titles and metadata. **Skip** these categories (they rarely contain design insights):

- Dependency bumps (titles matching `bump`, `update dependencies`, `renovate`, `dependabot`)
- Changelog-only updates (titles matching `changelog`, `scriv`)
- Bot-generated PRs with no review comments
- PRs with fewer than 5 lines changed and no review comments

**Prioritize** PRs that have:

- Review comments (especially multiple rounds — that's where design discussion lives)
- Changes touching shared utilities (`src/inspect_evals/utils/`, `CONTRIBUTING.md`, `BEST_PRACTICES.md`, `AGENTS.md`)

**Cap at 50 PRs per run** to keep execution time reasonable.

### 4. Extract

For each selected PR, fetch:

```bash
# PR body and metadata
gh pr view <N> --json body,title,labels,files,reviewDecision,comments,reviews

# Review comments (inline code review feedback)
gh api repos/{owner}/{repo}/pulls/<N>/comments --paginate

# Issue comments (general discussion)
gh api repos/{owner}/{repo}/issues/<N>/comments --paginate
```

For open issues, fetch body and comments similarly.

**Link traversal**: If a comment references another PR/issue (e.g., "see #123" or "fixed in #456"), continue to crawl recursively up to 3 hops in total. Do not recurse to an existing PR/issue in the chain to prevent loops.

### 5. Distill

This is the core intellectual work. For each PR/issue, extract **actionable insights** in these categories:

- **Design decisions**: What architectural choice was made and why? What alternatives were rejected?
- **Reviewer corrections**: What mistakes did reviewers catch? These reveal common pitfalls.
- **Established conventions**: What patterns were deliberately chosen that future contributors should follow?
- **Tech debt acknowledged**: What shortcuts were taken intentionally? What should NOT be "fixed" without discussion?
- **Common agent mistakes**: If review comments mention agent-generated code issues, capture the pattern.

**Quality requirements for each insight**:

- Must cite source PR/issue number (e.g., "Per PR #973...")
- Must be actionable ("Do X" / "Don't do Y"), not descriptive ("PR #123 added X")
- Must add nuance beyond what CONTRIBUTING.md and BEST_PRACTICES.md already state
- Must be relevant to future contributors, not just historically interesting
- Must be broadly applicable beyond a single issue or evaluation. If the context is excessively narrow, leave it out.
- Must reflect team convention, not a single maintainer's code style or proposal. If in doubt, leave it out.

**Skip**:

- Bot comments (dependabot, renovate, CI status checks)
- Feature announcements without design implications
- Trivial PRs (typo fixes, version bumps) unless they reveal a convention
- Duplicate insights already captured in REPO_CONTEXT.md

### 6. Merge Into REPO_CONTEXT.md

Integrate new insights into the existing document structure. **Do not just append** — place each insight in the appropriate section and deduplicate:

- If a new insight updates or supersedes an existing one, replace it
- If a section is getting too long, distill further (combine related insights)
- Update the header metadata (last updated date, PR watermark)
- Keep total document size between 500-1000 lines (aggressive distillation if over)

**Each insight appears in exactly one section** — do not repeat the same rule across multiple sections with different framing (see step 7).

### 7. Deduplicate & Consolidate

After merging, review the full document for **cross-section duplication**. This is critical — incremental runs naturally introduce duplication because the same convention surfaces in multiple PR reviews (e.g., "use `@pytest.mark.docker`" might appear as a reviewer correction, an established convention, AND a testing recipe).

**Process**:

1. For each insight, search the entire document for overlapping content. Look for insights that cover the same topic even if phrased differently.
2. Keep each insight in **exactly one location** — the most specific section that fits. Prefer this priority:
- "Rules & Conventions" for mandatory practices ("always do X", "never do Y")
- "Testing Recipes" for detailed how-to patterns (mock setup, test structure)
- "Known Tech Debt" for acknowledged issues that should not be fixed without discussion
- "CI/Tooling" for build/CI/tooling specifics
- "Open Issues" for bugs and design direction
3. Remove the duplicate occurrences, keeping the most complete/specific version.
4. Combine related insights that are split across bullets into a single, richer bullet.

**Common duplication patterns to watch for**:

- The same pytest marker rule appearing in both "Rules" and "Testing Recipes"
- Reviewer corrections that duplicate established conventions (merge into the convention)
- Agent mistakes that are just the inverse of an established convention (keep only the convention)
- API usage patterns appearing in both rules and recipes (keep the rule brief, detail in recipes)

## Bounding Rules

| Rule | Limit |
| ---------------------------- | ------------------------------------------- |
| First run scope | Most recent 50 merged PRs + all open issues |
| Incremental run scope | New items since last crawl |
| Max PRs per run | 50 |
| Link traversal depth | 3 hops |
| Target REPO_CONTEXT.md size | 500-1000 lines |
| Max issues per run | 100 |

## Insight Quality Guidelines

These are critical — the value of REPO_CONTEXT.md depends on insight quality:

1. **Every insight must cite its source** PR or issue number. It is acceptable to cite multiple sources for the same insight.
2. **Insights must be actionable**: "Do X" / "Don't do Y", not "PR #123 added X"
3. **Don't duplicate existing docs**: Only add nuance that CONTRIBUTING.md and BEST_PRACTICES.md miss
4. **Skip noise**: Bot comments, feature announcements without design implications, trivial PRs
5. **Focus on**: Reviewer corrections, design trade-offs, rejected alternatives, acknowledged tech debt, common agent mistakes
6. **Be specific**: "Use `hf_dataset()` wrapper instead of raw `load_dataset()` for HuggingFace datasets (PR #842)" is better than "Use the right dataset loading function"
7. **Date-stamp volatile insights**: If an insight might become stale (e.g., "Currently X is broken"), include the date so agents can verify

## Expected Output

After running this workflow:

```text
agent_artefacts/repo_context/
└── REPO_CONTEXT.md # Distilled institutional knowledge (committed)
```

## Verification Checklist

After each run, verify:

1. `REPO_CONTEXT.md` exists and has well-structured content
2. Insights cite source PR/issue numbers
3. Insights are actionable, not merely descriptive
4. **No duplicate insights across sections** — search for key terms (e.g., `sample ID`, `get_model`, `@pytest.mark`) and confirm each appears in exactly one place
5. Document stays under ~1000 lines
6. Header metadata (date, PR range) is updated
7. Incremental runs don't reprocess already-crawled PRs

More from UKGovernmentBEIS/inspect_evals

Skill	Description
check-trajectories-workflow	Use Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.
ci-maintenance-workflow	CI and GitHub Actions maintenance workflows — fix a failing test from a CI URL, fix a failing smoke test, add @pytest.mark.slow markers to slow tests, or review a PR against agent-checkable standards. Use when user asks to fix a failing test, fix a smoke test, mark slow tests, or review a PR. Trigger when the user asks you to run the "Write a PR For A Failing Test", "Fix A Failing Smoke Test", "Mark Slow Tests", or "Review PR According to Agent-Checkable Standards" workflow.
code-quality-fix-all	Fix code quality issues identified in a code quality review stored in agent_artefacts/code_quality/<topic>/. Systematically addresses issues found by the code-quality-review-all skill for ANY code quality topic, with validation and testing at each step. Use when user asks to fix issues from a code quality review, or asks to fix issues from agent_artefacts/code_quality/<topic>.
code-quality-review-all	Review all evaluations in the repository against a single code quality standard. Checks ALL evals against ONE standard for periodic quality reviews. Use when user asks to review/audit/check all evaluations for a specific topic or standard. Do NOT use for reviewing a single eval (use eval-quality-workflow instead) or for test coverage (use ensure-test-coverage instead).
create-eval	Redirect to the inspect-evals-template for creating new evaluations. New evals are no longer created in this repository — they live in standalone repos. Use when user asks to create/implement/build a new evaluation.
ensure-test-coverage	Ensure test coverage for a single evaluation - both reviewing existing tests and creating missing ones. Analyzes testable components, checks tests against repository conventions, reports coverage gaps, and creates or improves tests. Use when user asks to check/review/create/add/ensure tests for an eval. Use whenever you are asked to review an evaluation that contains tests, or whenever you need to write a suite of tests. Do NOT use for fixing a specific failing CI test (use ci-maintenance-workflow instead).
eval-quality-workflow	Fix or review a single evaluation against all EVALUATION_CHECKLIST.md standards. Use "fix" mode to refactor an eval into compliance, or "review" mode to assess compliance without making changes. Use when user asks to fix, review, or check an evaluation's quality. Trigger when the user asks you to run the "Fix An Evaluation" or "Review An Evaluation" workflow. Do NOT use for reviewing ALL evals against a single code quality standard (use code-quality-review-all instead).
eval-report-workflow	Create an evaluation report for a README by selecting models, estimating costs, running evaluations, and formatting results tables. Use when user asks to make/create/generate an evaluation report. Trigger when the user asks you to run the "Make An Evaluation Report" workflow.
eval-validity-review	Review a single evaluation's validity — whether its claims hold up, whether its name is accurate, whether samples can be both succeeded and failed at, and whether scoring measures ground truth. Use when user asks to check validity of an eval, or as part of the Master Checklist workflow. Do NOT use for code quality or test coverage (use eval-quality-workflow or ensure-test-coverage instead).
generate-asset-actions	Generate asset-actions.yaml from ASSETS.yaml by classifying assets into priority tiers. Use when the user asks to regenerate, update, or refresh the asset actions.