code-quality-fix-all

Name: code-quality-fix-all
Author: UKGovernmentBEIS/inspect_evals

$npx mdskill add UKGovernmentBEIS/inspect_evals/code-quality-fix-all

Fix code quality issues from reviews with validation and testing.

Resolves problems found in code quality review artifacts.
Depends on code-quality-review-all skill and agent_artefacts.
Selects targets via user questions and complexity filters.
Delivers corrected code with step-by-step validation checks.

SKILL.md

.github/skills/code-quality-fix-allView on GitHub ↗

---
name: code-quality-fix-all
description: Fix code quality issues identified in a code quality review stored in agent_artefacts/code_quality/<topic>/. Systematically addresses issues found by the code-quality-review-all skill for ANY code quality topic, with validation and testing at each step. Use when user asks to fix issues from a code quality review, or asks to fix issues from agent_artefacts/code_quality/<topic>.
---

# Code Quality Fix All

Fix code quality issues identified in a code quality review. This skill systematically addresses issues found by the `code-quality-review-all` skill for ANY code quality topic, with validation and testing at each step.

## Expected Arguments

When invoked, this skill expects the path to a code quality topic as an argument (e.g., `agent_artefacts/code_quality/private_api_imports`).

If not provided, the skill will ask the user for the topic path. Within the topic path, there are several files:

- README.md - contains description of the issue and examples of how to fix it
- results.json - contains list of all identified issues
- SUMMARY.md - contains summary of the identified issues

**Filters and options** are specified interactively after the skill starts by using the `AskUserQuestion` tool to present options unless specified otherwise in arguments.

- Which issue types to target:
  - all
  - specific types
  - Fix complexity level (easy only, medium and below, or all)
- Which evaluations to fix (all, specific ones, evaluations with small number of issues)
- Maximum number of issues to fix in this run

## Workflow

### Phase 1: Understanding the Topic and Planning

1. **Read topic documentation**
   - Read the README.md to understand:
     - What code quality issue this topic addresses
     - Why it matters (stability, maintainability, etc.)
     - How to detect the issue
     - How to fix the issue (fix patterns, examples)
   - Read results.json to get all identified issues
   - Identify which issues are in scope based on arguments

2. **Analyze and categorize issues**
   - Analyze fix complexity based on:
     - issue_description
     - suggested_fix from results.json
     - Fix examples in README.md
   - Classify as:
     - **Easy**: Single-line changes, clear fix pattern in README
     - **Medium**: Multi-line changes, well-documented fix approach
     - **Hard**: No clear fix pattern, requires research or copying code
   - Group issues by evaluation and issue type
   - Generate statistics for presenting to user

3. **Ask user for filtering preferences**
   - Use `AskUserQuestion` tool to ask:
     - Which evaluations to fix? (all / specific ones / most affected)
     - Which issue types to target? (all / specific types)
     - Fix complexity level? (easy only / easy+medium / all)
     - Max issues per run? (all / limit to specific number)
   - Apply filters based on user responses
   - Present filtered plan with:
     - Number of issues to fix
     - Breakdown by evaluation and issue type
     - Complexity distribution
     - Ask for final confirmation to proceed

4. **Validate understanding of fixes**
   - For each unique issue type in scope:
     - Check if README.md documents how to fix it
     - Look for "Good Examples" and "Bad Examples" sections
     - Check "suggested_fix" field in results.json
   - If fix approach is unclear for any issue type:
     - Research the correct approach
     - Update `<topic>/README.md` with findings
     - Ask user for guidance if still uncertain

### Phase 2: Pre-Fix Validation

For each issue to be fixed:

1. **Read and understand context**
   - Read the entire file containing the issue (not just the line)
   - Understand how the problematic code is used
   - Look for related issues in the same file
   - Check for patterns that might affect the fix (e.g., multiple occurrences)
   - Identify any cascading changes needed (related imports, type hints, etc.)

2. **Validate the suggested fix**
   - Review the "suggested_fix" from results.json
   - Check against fix patterns in README.md
   - Verify the fix won't break functionality
   - For complex fixes:
     - Check if dependencies/alternatives actually exist
     - Validate that replacement code follows same patterns
     - Consider edge cases

3. **Estimate change scope**
   - Count how many lines will change for this fix
   - Identify if cascading changes are needed
   - Determine if multiple files need updating
   - **If changes exceed 100 lines for a single issue:**
     - Alert user with:
       - Issue details
       - Why the change is large
       - What will change
     - Get explicit approval before proceeding

### Phase 3: Applying Fixes

1. **Apply fixes systematically**

- Create a new branch to apply fixes to, with a name like agent/<short_description_of_issue>
- Process one evaluation at a time
- Within each evaluation, group by issue type
- For each fix:
  - Use Edit tool to apply the change
  - Follow the suggested_fix guidance
  - Apply fix patterns from README.md
  - Handle related issues in same file together
  - Add comments if the fix requires it (e.g., copied code attribution)
- Track what was fixed

1. **Verify changes compile/parse**
   - After fixing each file, validate:
     - File is syntactically valid (Python can parse it)
     - No obvious import errors introduced
     - Code follows repository patterns
   - If validation fails:
     - Investigate the issue
     - Attempt to fix validation error
     - Rollback change if cannot be resolved

2. **Track progress**
   - Maintain list of:
     - Issues successfully fixed (file, line, issue type)
     - Issues that couldn't be fixed (with reasons)
     - Evaluations that have been modified
     - Files that were changed

### Phase 4: Testing and Validation

1. **Run linting**
   - Run repository's linter on modified files (ruff, flake8, mypy, etc.)
   - Check for:
     - Import errors
     - Type checking errors
     - Style violations introduced
   - Fix any linting issues that result from changes
   - If linting issues can't be fixed, document them

2. **Run unit tests**
   - Identify test files for each modified evaluation
   - Run unit tests for affected evaluations using pytest:

     **Basic test commands:**

     ```bash
     # Install relevant packages in the event of import failure
     uv sync --extra test

     # Run tests for a specific evaluation
     uv run pytest tests/<evaluation_name>/

     # Run a specific test file
     uv run pytest tests/test_file.py

     # Run a specific test
     uv run pytest tests/test_file.py::TestClass::test_method

     # Run slow tests (excluded by default)
     uv run pytest --runslow tests/

     # Skip dataset download tests
     uv run pytest -m 'not dataset_download' tests/

     # Run only slow tests
     uv run pytest -m slow tests/
     ```

     **Test markers to be aware of:**
     - `@pytest.mark.slow` - Tests taking >10 seconds
     - `@pytest.mark.dataset_download` - Tests that download datasets
     - `@pytest.mark.docker` - Tests using Docker
     - `@pytest.mark.huggingface` - HuggingFace-related tests

     - Focus on tests for the specific evaluation
     - Look for test failures or errors
   - **IMPORTANT**: Do NOT run full evaluations (they take too long) unless user explicitly requests it

3. **Handle test failures**
   - For each test failure:
     - Read test output carefully
     - Determine if failure is caused by the fix
     - Check if it's a pre-existing failure
   - If caused by fix:
     - Try to adjust the fix to make tests pass
     - If cannot be resolved, rollback the change
     - Document the issue for user review
   - If pre-existing:
     - Note it but don't block on it
     - Inform user

### Phase 5: Re-Review and Handle Remaining Issues

1. **Update results.json with fix status**
   - For each issue that was fixed, add `"fix_status"` field after `"suggested_fix"`:

     ```json
     {
      ...
       "suggested_fix": "...",
       "fix_status": "fixed - please review"
     }
     ```

   - For issues that couldn't be fixed, add explanation:

     ```json
     "fix_status": "not fixed - reason: ..."
     ```

   - **IMPORTANT**: Do NOT remove any entries from results.json - only add/update "fix_status"
   - The code-quality-review-all skill owns results.json and is responsible for removing entries

2. **Re-run code quality review**
   - **IMPORTANT**: Use Task tool to spawn subagent running code-quality-review-all skill
   - Pass the same topic path
   - This will update results.json with current state
   - Compare results before and after to identify:
     - Issues that are now resolved (no longer appear)
     - New issues that may have been introduced
     - Issues that still remain despite fix attempts

3. **Fix remaining issues if in scope**
   - For each new or remaining in-scope issue:
     - Investigate why previous fix didn't work
     - Attempt alternative fix approach
     - Update "fix_status" with attempt results
   - Repeat this process until no more in-scope issues can be fixed

4. **Update topic's README.md**
   - Add any knowledge that you have discovered that will be useful in detecting or fixing topic-related issues in the future
   - Do not remove examples of bad code or patterns that were fixed - they will be useful in future reviews and fixes of future evaluations.

5. **Update SUMMARY.md**
   - Add a "Recent Fixes" section with:
     - Date of fix run
     - Number of issues fixed
     - Which evaluations were updated
   - Keep historical data (don't remove past information)
   - Update recommendations to reflect remaining work

6. **Run markdown linters**
   - Use `uv run pre-commit run markdownlint-fix` to fix markdown linting issues

### Phase 6: Create PR Description and Present Results

1. **Create/Update PR description (cumulative)**
   - Read existing `PR_DESCRIPTION.md` if it exists (from previous runs)
   - **Cumulative tracking**: PR description represents ALL changes from branch base, not just this run
   - If PR_DESCRIPTION.md exists:
     - Parse existing content to extract previous runs' data
     - Append information from this run
     - Update cumulative statistics
   - If PR_DESCRIPTION.md doesn't exist (first run):
     - Create new file
   - Format for GitHub/GitLab pull request with:
     - **Summary**: Brief overview of the code quality topic and total fixes (2-3 sentences)
     - **Overall Changes** (cumulative from all runs):
       - Total issues fixed across all runs by type
       - Total evaluations affected
       - Total files modified
     - **Fix Sessions**: List each run session with:
       - Date/time of run
       - Issues fixed in that session
       - Complexity level targeted (easy/medium/all)
     - **Fixed Issues** (cumulative): Table or list with all file paths and issue types from all runs
     - **Testing** (from latest run):
       - Which tests were run
       - Pass/fail status
       - Any test issues encountered
     - **Remaining Issues** (current state):
       - Count of issues still open
       - Brief note on what remains
     - **Review Notes** (cumulative):
       - Any complications or special considerations from any run
       - Areas that need extra attention during review
   - Use proper markdown formatting for PR readability
   - **IMPORTANT**: Do NOT commit PR_DESCRIPTION.md - it's only for creating the PR
   - Example structure:

     ```markdown
     ## Summary
     Fix private API imports code quality issues across evaluations.

     ## Overall Changes
     - Total issues fixed: 25
     - Evaluations affected: 8

     ## Fix Sessions

     ### Session 1: 2026-01-18 10:30 (Easy issues)
     - Fixed 10 easy issues
     - Targeted: Easy complexity, All evaluations

     ### Session 2: 2026-01-18 14:15 (Medium issues)
     - Fixed 15 medium issues
     - Targeted: Medium complexity, Specific evaluations

     ## Fixed Issues
     [Table of all fixed issues from all sessions]

     ## Testing
     [Latest test results]

     ## Remaining Issues
     6 issues remain (4 hard, 2 require investigation)

     ## Review Notes
     - Session 1: All tests passed
     - Session 2: One test required adjustment in fortress/scorer.py
     ```

2. **Present results to user**
   - Show high-level summary for **this run**:
     - X issues fixed in this session
     - Y issues remain
     - Z tests passed
   - Show **cumulative progress** from PR_DESCRIPTION.md:
     - Total issues fixed across all runs
     - Number of fix sessions completed
   - Show before/after statistics from SUMMARY.md
   - List modified files from this run
   - Display content of PR_DESCRIPTION.md for user review
   - **Do NOT automatically commit** - let user review changes

3. **Offer next steps**
   - **Create commit and PR**: Offer to:
     - Commit all changes (source files, results.json, SUMMARY.md)
     - Create pull request with description from PR_DESCRIPTION.md
     - Note: PR_DESCRIPTION.md itself is NOT committed (it's just for PR description)
     - The PR description includes **cumulative changes from all fix sessions** on this branch
   - **Run more fixes**: If issues remain, suggest running skill again with different filters
     - Running again will **append** to PR_DESCRIPTION.md, creating cumulative tracking
     - This allows iterative fixing: easy issues first, then medium, then hard
   - **Manual review needed**: List any issues that require manual attention

## Important Guidelines

### Safety First

- **Never batch all fixes blindly** - Validate each fix type before applying en masse
- **Always read before editing** - Understand context before changing code
- **Verify fixes don't break functionality** - Run tests incrementally
- **Be conservative** - Skip fixes you're uncertain about rather than risk breaking code
- **Get approval for large changes** - Alert user when fixes exceed 100 lines
- **Have rollback strategy** - Be able to revert if fixes cause problems

### Context is Critical

- **Understand the quality issue** - Read README.md thoroughly
- **Understand why code was written that way** - There might be good reasons
- **Look for patterns** - Similar issues often need similar fixes
- **Check related code** - Fixes might require updating nearby code
- **Read existing comments** - Developers might have documented why they used certain patterns

### Validation at Every Step

- **Verify fix patterns from README** - Don't guess how to fix
- **Check suggested_fix in results.json** - Use provided guidance
- **Validate changes compile** - Ensure code parses after changes
- **Run linters** - Catch style and import issues
- **Run tests** - Detect regressions immediately
- **Re-run review** - Verify fixes actually resolve issues

### Communication

- **Show plan before executing** - Let user see what will be fixed
- **Alert for large changes** - Get approval for fixes >10 lines
- **Report uncertainties** - Flag issues where fix approach is unclear
- **Show progress** - Keep user informed during fixes
- **Explain failures** - Document why certain issues couldn't be fixed
- **Provide detailed reports** - Create comprehensive fix reports

### What NOT to Do

- **Don't assume you know how to fix** - Always consult README.md and results.json
- **Don't remove entries from results.json** - Only add/update "fix_status" field
- **Don't replace PR_DESCRIPTION.md** - Append to it to maintain cumulative history across runs
- **Don't run full evaluations** - Only run unit tests (evaluations are slow)
- **Don't commit automatically** - Let user review changes first
- **Don't commit PR_DESCRIPTION.md** - It's only for creating the PR
- **Don't fix issues you can't validate** - Skip rather than risk breaking
- **Don't ignore test failures** - Investigate or rollback
- **Don't make unrelated changes** - Only fix the specific quality issues
- **Don't assume all issues of same type are identical** - Context matters

More from UKGovernmentBEIS/inspect_evals

Skill	Description
build-repo-context	Crawl repository PRs, issues, and review comments to distill institutional knowledge into a shared knowledge base. Run periodically by "context agents" to maintain agent_artefacts/repo_context/REPO_CONTEXT.md. Trigger only on specific request.
check-trajectories-workflow	Use Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.
ci-maintenance-workflow	CI and GitHub Actions maintenance workflows — fix a failing test from a CI URL, fix a failing smoke test, add @pytest.mark.slow markers to slow tests, or review a PR against agent-checkable standards. Use when user asks to fix a failing test, fix a smoke test, mark slow tests, or review a PR. Trigger when the user asks you to run the "Write a PR For A Failing Test", "Fix A Failing Smoke Test", "Mark Slow Tests", or "Review PR According to Agent-Checkable Standards" workflow.
code-quality-review-all	Review all evaluations in the repository against a single code quality standard. Checks ALL evals against ONE standard for periodic quality reviews. Use when user asks to review/audit/check all evaluations for a specific topic or standard. Do NOT use for reviewing a single eval (use eval-quality-workflow instead) or for test coverage (use ensure-test-coverage instead).
create-eval	Redirect to the inspect-evals-template for creating new evaluations. New evals are no longer created in this repository — they live in standalone repos. Use when user asks to create/implement/build a new evaluation.
ensure-test-coverage	Ensure test coverage for a single evaluation - both reviewing existing tests and creating missing ones. Analyzes testable components, checks tests against repository conventions, reports coverage gaps, and creates or improves tests. Use when user asks to check/review/create/add/ensure tests for an eval. Use whenever you are asked to review an evaluation that contains tests, or whenever you need to write a suite of tests. Do NOT use for fixing a specific failing CI test (use ci-maintenance-workflow instead).
eval-quality-workflow	Fix or review a single evaluation against all EVALUATION_CHECKLIST.md standards. Use "fix" mode to refactor an eval into compliance, or "review" mode to assess compliance without making changes. Use when user asks to fix, review, or check an evaluation's quality. Trigger when the user asks you to run the "Fix An Evaluation" or "Review An Evaluation" workflow. Do NOT use for reviewing ALL evals against a single code quality standard (use code-quality-review-all instead).
eval-report-workflow	Create an evaluation report for a README by selecting models, estimating costs, running evaluations, and formatting results tables. Use when user asks to make/create/generate an evaluation report. Trigger when the user asks you to run the "Make An Evaluation Report" workflow.
eval-validity-review	Review a single evaluation's validity — whether its claims hold up, whether its name is accurate, whether samples can be both succeeded and failed at, and whether scoring measures ground truth. Use when user asks to check validity of an eval, or as part of the Master Checklist workflow. Do NOT use for code quality or test coverage (use eval-quality-workflow or ensure-test-coverage instead).
generate-asset-actions	Generate asset-actions.yaml from ASSETS.yaml by classifying assets into priority tiers. Use when the user asks to regenerate, update, or refresh the asset actions.