run-judges
$
npx mdskill add closedloop-ai/claude-plugins/run-judgesRun parallel judges to validate plans, code, or PRDs.
- Evaluates implementation plans, code artifacts, or PRDs automatically.
- Depends on specialized judge agents configured by artifact type.
- Selects evaluation batches based on plan, code, or PRD categories.
- Outputs structured JSON files with aggregated CaseScore results.
SKILL.md
.github/skills/run-judgesView on GitHub ↗
---
name: run-judges
description: Orchestrate parallel judge agent execution, aggregate CaseScore results, write plan-judges.json, code-judges.json, or prd-judges.json, and validate output. Supports evaluating implementation plans (16 judges), code artifacts (11 judges), or PRD artifacts (4 judges) via --artifact-type parameter.
context: fork
---
# Run Judges Skill
## Purpose
Execute specialized judge agents in parallel to evaluate implementation plan quality (16 judges), code quality (11 judges), or PRD quality (4 judges). Aggregates results into `$CLOSEDLOOP_WORKDIR/plan-judges.json` (plan), `$CLOSEDLOOP_WORKDIR/code-judges.json` (code), or `$CLOSEDLOOP_WORKDIR/prd-judges.json` (prd) with validated output format.
## Parameters
**--workdir**: Path to the working directory containing judge artifacts (optional)
- Resolved in order: `--workdir` argument → `$CLOSEDLOOP_WORKDIR` environment variable → `.closedloop-ai/judges` (default, relative to current working directory)
- The directory is created automatically if it does not exist
- All output files (`plan-judges.json`, `code-judges.json`, `prd-judges.json`, `judge-input.json`, `perf.jsonl`, etc.) are written to this resolved directory
**--artifact-type**: Artifact category to evaluate (plan | code | prd), default: plan
- **plan** (default): Evaluate implementation plan with 16 judges, 4 batches, output to plan-judges.json
- **code**: Evaluate implemented code with 11 judges, 3 batches, output to code-judges.json
- **prd**: Evaluate PRD document with 4 judges, single parallel batch, output to prd-judges.json
## Judge Input Contract (`judge-input.json`)
The judge input contract is maintained in:
`skills/run-judges/references/judge-input-contract.md` (resolve to an absolute path at runtime via `Glob`)
This keeps orchestration flow readable while preserving a single source of truth for contract fields and semantics.
## Task Context
You are orchestrating quality evaluation for a ClosedLoop artifact (implementation plan, code, or PRD). Your responsibilities:
**For plan artifacts (default):**
1. Launch context-manager-for-judges agent to prepare compressed plan context
2. Build `judge-input.json` with plan task/context mapping
3. Launch all 16 judge agents in parallel batches
4. Aggregate their CaseScore outputs into a valid EvaluationReport
5. Write the report to `$CLOSEDLOOP_WORKDIR/plan-judges.json`
6. Validate output structure and completeness
**For code artifacts (--artifact-type code):**
1. Launch context-manager-for-judges agent to prepare compressed context
2. Build `judge-input.json` with code task/context mapping
3. Launch 11 judge agents in parallel batches
4. Aggregate their CaseScore outputs into a valid EvaluationReport
5. Write the report to `$CLOSEDLOOP_WORKDIR/code-judges.json`
6. Validate output structure and completeness
**For PRD artifacts (--artifact-type prd):**
1. Check `$CLOSEDLOOP_WORKDIR/prd.md` exists (graceful exit if missing)
2. Build `judge-input.json` with evaluation_type: "prd" and primary_artifact pointing to prd.md
3. Launch all 4 PRD judges in a single parallel batch
4. Aggregate all 4 CaseScores into a valid EvaluationReport
5. Write the report to `$CLOSEDLOOP_WORKDIR/prd-judges.json`
6. Validate output structure and completeness
**Success criteria:**
- All judges executed (or error CaseScores generated for failures)
- Valid JSON written to appropriate output file
- Validation script passes with zero errors
---
## Threshold Overrides
The run-judges skill supports per-artifact-type threshold customization via JSON configuration files. This allows you to adjust evaluation strictness for different artifact types (e.g., applying a lower threshold for test-judge when evaluating code vs plan).
### Configuration Schema
Threshold overrides are defined in a JSON file with the following structure:
```json
{
"overrides": {
"artifact_type:judge_name": <threshold_float>
}
}
```
Where:
- **Key format**: `"artifact_type:judge_name"` (e.g., `"code:test-judge"`, `"plan:technical-accuracy-judge"`)
- **Value**: Threshold as a float in range `[0.0, 1.0]`
**Example configuration:**
```json
{
"overrides": {
"code:test-judge": 0.75,
"plan:technical-accuracy-judge": 0.85
}
}
```
### Loading Precedence
The skill checks the following locations in order, using the first valid configuration found:
1. **Run-specific overrides** (highest precedence):
- Path: `$CLOSEDLOOP_WORKDIR/.closedloop-ai/settings/threshold-overrides.json`
- Use case: Override thresholds for a specific ClosedLoop run
2. **Repo-level defaults** (fallback):
- Path: `<project-root>/.closedloop-ai/settings/threshold-overrides.json`
- Use case: Set project-wide threshold defaults
3. **Hardcoded defaults** (graceful degradation):
- If no configuration file exists at any location, use built-in defaults
- No error is raised for missing configuration files
### Default Overrides
The following default overrides apply when evaluating code artifacts:
| Judge | Code Threshold | Plan Threshold | Rationale |
|-------|----------------|----------------|-----------|
| `test-judge` | 0.75 | 0.8 | Code may have tests written separately from implementation, lower threshold accounts for incremental test development |
All other judges use the same threshold (typically 0.8) across artifact types.
### Validation and Error Handling
When loading threshold overrides, the skill applies the following validation rules:
**Schema Validation:**
- Configuration must contain an `"overrides"` key
- Each key must match the pattern `artifact_type:judge_name`
- Each value must be a float in range `[0.0, 1.0]`
- Keys must reference valid artifact types (`plan`, `code`, `prd`) and judge names
**Error Behavior:**
- **Malformed JSON**: Log warning and continue with hardcoded defaults
```
Warning: Invalid threshold-overrides.json, skipping overrides: {error}
```
- **Invalid schema**: Log warning and continue with hardcoded defaults
- **File not found**: Silently use defaults (no warning logged)
**Error recovery ensures the skill always completes judge execution**, even if threshold configuration is incorrect.
### Integration with Judge Execution
When executing judges:
1. **Before launching judge batches**: Load threshold overrides from the precedence chain
2. **Merge with defaults**: Loaded overrides take precedence over hardcoded defaults
3. **Apply per-judge**: Each judge receives its artifact-type-specific threshold via the evaluation context
4. **CaseScore validation**: Thresholds are used to determine `final_status` (pass/fail) based on metric scores
**When artifact type is code**:
- Load threshold overrides before executing judge batches
- Apply code-specific thresholds to each judge's evaluation criteria
- Merge loaded overrides with defaults (loaded values take precedence)
---
## Performance Instrumentation (Mandatory)
You MUST emit a `pipeline_step` event to `$CLOSEDLOOP_WORKDIR/perf.jsonl` at the **end** of each phase below. This keeps perf telemetry in the canonical schema and adds nested metadata for judge/sub-agent work.
**Context:** `CLOSEDLOOP_WORKDIR`, `CLOSEDLOOP_RUN_ID`, and `CLOSEDLOOP_ITERATION` are set by the run-loop. `CLOSEDLOOP_PARENT_STEP` and `CLOSEDLOOP_PARENT_STEP_NAME` are set as env vars on the `claude` invocation by run-loop; they are inherited by all Bash tool calls — no sourcing needed.
Use `sub_step` as numeric phase order and optional `sub_step_name` to capture the judge/sub-agent name when applicable (for batch-level phases where many judges run, use the batch label).
**Sub-step numbering:**
| Artifact | sub_step | sub_step_name |
|----------|----------|-----------------|
| plan | 0 | context_manager |
| plan | 1–4 | batch_1 … batch_4 |
| plan | 5 | aggregate |
| plan | 6 | validate |
| code | 0 | context_manager |
| code | 1–3 | batch_1 … batch_3 |
| code | 4 | aggregate |
| code | 5 | validate |
| prd | 0 | context_prep (skipped — prd mode does not use context-manager-for-judges) |
| prd | 1 | prd_judges |
| prd | 2 | aggregate |
| prd | 3 | validate |
**Start of phase (run Bash once at the beginning of each phase):** Set the two sub-step variables at the top for the current phase, then run the block. It writes start time to a temp file so the end-of-phase Bash can compute duration. `CLOSEDLOOP_PARENT_STEP` and `CLOSEDLOOP_PARENT_STEP_NAME` are already in the environment (set by run-loop on the `claude` invocation).
```bash
# Set these two values for the current phase:
SUB_STEP_NUM=0
SUB_STEP_LABEL="context_manager" # context_manager | batch_1 … | aggregate | validate
mkdir -p "$CLOSEDLOOP_WORKDIR/.closedloop-ai"
{
echo "SUB_STEP=${SUB_STEP_NUM}"
echo "SUB_STEP_NAME=${SUB_STEP_LABEL}"
echo "PARENT_STEP=${CLOSEDLOOP_PARENT_STEP:-0}"
echo "PARENT_STEP_NAME=${CLOSEDLOOP_PARENT_STEP_NAME:-unknown}"
echo "STARTED_AT=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
echo "START_EPOCH=$(date +%s)"
} > "$CLOSEDLOOP_WORKDIR/.closedloop-ai/perf-substep-start.env"
```
**End of phase (run Bash once at the end of each phase, after the phase work is done):** Read start time, compute duration, append one line to `perf.jsonl`, then remove the temp file.
```bash
source "$CLOSEDLOOP_WORKDIR/.closedloop-ai/perf-substep-start.env"
END_EPOCH=$(date +%s)
ENDED_AT=$(date -u +%Y-%m-%dT%H:%M:%SZ)
DURATION=$((END_EPOCH - START_EPOCH))
jq -n -c \
--arg event "pipeline_step" \
--arg run_id "${CLOSEDLOOP_RUN_ID:-unknown}" \
--argjson iteration "${CLOSEDLOOP_ITERATION:-0}" \
--argjson step "$PARENT_STEP" \
--arg step_name "$PARENT_STEP_NAME" \
--argjson sub_step "$SUB_STEP" \
--arg sub_step_name "$SUB_STEP_NAME" \
--arg started_at "$STARTED_AT" \
--arg ended_at "$ENDED_AT" \
--argjson duration_s "$DURATION" \
--argjson exit_code 0 \
--argjson skipped false \
'{event:$event,run_id:$run_id,iteration:$iteration,step:$step,step_name:$step_name,sub_step:$sub_step,sub_step_name:$sub_step_name,started_at:$started_at,ended_at:$ended_at,duration_s:$duration_s,exit_code:$exit_code,skipped:$skipped}' >> "$CLOSEDLOOP_WORKDIR/perf.jsonl"
rm -f "$CLOSEDLOOP_WORKDIR/.closedloop-ai/perf-substep-start.env"
```
**Order of operations per phase:** Run the "start of phase" Bash first (set `SUB_STEP_NUM` and `SUB_STEP_LABEL` at the top, then run the block), then perform the phase work, then run the "end of phase" Bash.
---
## Execution Workflow
### Working Directory Resolution
**Before any other step**, resolve the working directory and export it as `CLOSEDLOOP_WORKDIR`:
```bash
# Resolve working directory (precedence: --workdir arg > env var > default)
if [ -n "$ARG_WORKDIR" ]; then
WORKDIR="$ARG_WORKDIR"
elif [ -n "$CLOSEDLOOP_WORKDIR" ]; then
WORKDIR="$CLOSEDLOOP_WORKDIR"
else
WORKDIR="$(pwd)/.closedloop-ai/judges"
fi
mkdir -p "$WORKDIR"
export CLOSEDLOOP_WORKDIR="$WORKDIR"
```
Where `$ARG_WORKDIR` is the value passed via `--workdir` in the invocation prompt. All subsequent references to `$CLOSEDLOOP_WORKDIR` use this resolved value.
---
### Agents Snapshot (Pre-Step)
**Before any judge execution**, ensure a snapshot of judge agent definitions exists in `$CLOSEDLOOP_WORKDIR/agents-snapshot/`. This preserves the exact agent versions used for each evaluation run.
**Action:** Run the snapshot script via Bash:
```bash
bash "${CLAUDE_PLUGIN_ROOT}/skills/run-judges/scripts/ensure_agents_snapshot.sh" "$CLOSEDLOOP_WORKDIR"
```
The script is idempotent — it skips if `manifest.json` already exists.
**Error handling:** If the script fails or is not found, log a warning and continue — snapshot failure must not block judge execution.
---
### Step 0: Mandatory Contract Pre-Read
Before any prerequisite checks or judge launches:
1. Resolve the contract file path using `Glob` with:
- `**/skills/run-judges/references/judge-input-contract.md`
2. Read the resolved `judge-input-contract.md` file in full.
3. Apply the contract requirements when constructing `$CLOSEDLOOP_WORKDIR/judge-input.json`.
4. If the file is missing, ambiguous (multiple matches), or unreadable, fail fast with a clear error (do not proceed with judge execution).
### Prerequisites Check
**Performance:** At the start of this phase run the "start of phase" Bash with `SUB_STEP_NUM=0` and `SUB_STEP_LABEL=context_manager` for both plan and code modes. At the end of the phase run the "end of phase" Bash.
**Before starting, verify required inputs exist:**
**For plan artifacts (default):**
```bash
# Validate input files exist
if [ ! -f "$CLOSEDLOOP_WORKDIR/prd.md" ]; then
echo "WARNING: $CLOSEDLOOP_WORKDIR/prd.md not found. Skipping judges."
exit 0 # Graceful skip - do not fail workflow
fi
if [ ! -f "$CLOSEDLOOP_WORKDIR/plan.json" ]; then
echo "WARNING: $CLOSEDLOOP_WORKDIR/plan.json not found. Skipping judges."
exit 0
fi
```
**Investigation log resolution (plan mode):**
After validating `prd.md` and `plan.json`, resolve supporting context for plan judges:
1. **Use existing file first**
- If `$CLOSEDLOOP_WORKDIR/investigation-log.md` exists, use it as-is.
2. **Check `@code:pre-explorer` availability before invoking**
- Perform an explicit capability probe for `@code:pre-explorer` in the active Claude/plugin environment.
- Treat "unknown agent", "agent not found", or plugin resolution errors as **pre-explorer unavailable**.
- Recommended probe pattern:
- Attempt a minimal `Task()` call targeting `@code:pre-explorer`.
- If the platform rejects the agent type before execution, classify as unavailable and continue to internal fallback.
3. **If available, invoke pre-explorer**
- Launch `@code:pre-explorer` with `WORKDIR=$CLOSEDLOOP_WORKDIR` to generate missing pre-exploration artifacts.
- Re-check for `$CLOSEDLOOP_WORKDIR/investigation-log.md` after completion.
4. **If unavailable or invocation failed, run internal fallback**
- Generate `investigation-log.md` with a lightweight local-only investigation.
- Keep it fast and deterministic (no external web research).
- Internal fallback should:
- Read `prd.md` and extract top entities/actions as search seeds.
- Run targeted `Glob`/`Grep` against the local repository for likely implementation files.
- Record top relevant files and short rationale under `Files Discovered` / `Key Findings`.
- Add requirement-to-code evidence links under `Requirements Mapping`.
- Use the canonical sections:
- `## Search Strategy`
- `## Files Discovered`
- `## Key Findings`
- `## Requirements Mapping`
- `## Uncertainties`
5. **Never block plan context preparation on investigation context**
- If log generation still fails, emit a warning and continue.
6. **Prepare plan-context.json via context-manager-for-judges**
- Launch `@judges:context-manager-for-judges` with `artifact_type=plan`.
- Verify `$CLOSEDLOOP_WORKDIR/plan-context.json` exists.
- If missing after invocation, log warning and activate **compatibility mode** for this run:
- Compatibility mode allows one emergency fallback to raw `plan.json` + `prd.md`.
- Use compatibility mode only when context generation fails.
7. **Plan-mode source-of-truth policy**
- Normal mode: `plan-context.json` is primary and required.
- Compatibility mode: `plan.json` + `prd.md` may be used for this run only.
8. **Build plan-mode `judge-input.json`**
- Set `evaluation_type` = `plan`.
- Set `task` to plan quality evaluation objective (16-plan-judge workflow).
- Set `primary_artifact` to `plan-context.json` in normal mode.
- In compatibility mode, set primary to `plan.json` and include `prd.md` as supporting.
- Include `investigation-log.md` as supporting artifact when available.
- Set `source_of_truth` ordering from primary to secondary artifacts.
**For code artifacts (--artifact-type code):**
```bash
# Resolve investigation context for code judges (best effort)
if [ ! -f "$CLOSEDLOOP_WORKDIR/investigation-log.md" ]; then
echo "INFO: investigation-log.md missing. Attempting best-effort generation via @code:pre-explorer..."
# Launch @code:pre-explorer with WORKDIR=$CLOSEDLOOP_WORKDIR
# If unavailable/fails, continue with warning (non-blocking for code judges)
fi
# Launch context-manager-for-judges agent to prepare compressed context
# This agent reads code artifacts (git diff, changed-files.json, etc.)
# and produces code-context.json with token-budgeted compression
# investigation-log.md is optional secondary context for code judging
if [ ! -f "$CLOSEDLOOP_WORKDIR/investigation-log.md" ]; then
echo "WARNING: investigation-log.md unavailable. Continuing code judges with code-context.json only."
fi
# Verify code-context.json exists after context manager completes
if [ ! -f "$CLOSEDLOOP_WORKDIR/code-context.json" ]; then
echo "ERROR: Context preparation failed - code-context.json not found"
# Abort with error CaseScore for all judges
# Generate error report with final_status=3, justification="Context preparation failed"
exit 1
fi
# Build code-mode judge-input.json
# - evaluation_type: "code"
# - task: code quality evaluation objective (11-code-judge workflow)
# - primary_artifact: code-context.json
# - supporting_artifacts: investigation-log.md (optional), plus any other run artifacts
# - source_of_truth: ["code_context", ...]
```
**For PRD artifacts (--artifact-type prd):**
PRD mode does NOT use context-manager-for-judges. Context preparation is lightweight: verify the PRD document exists, then build judge-input.json directly from it.
```bash
# PRD mode context prep: check prd.md exists
if [ ! -f "$CLOSEDLOOP_WORKDIR/prd.md" ]; then
echo "WARNING: $CLOSEDLOOP_WORKDIR/prd.md not found. Skipping PRD judges."
exit 0 # Graceful exit — do not fail parent workflow
fi
# Build prd-mode judge-input.json
# - evaluation_type: "prd"
# - task: PRD quality evaluation objective (prd-auditor + 3 critics)
# - primary_artifact: $CLOSEDLOOP_WORKDIR/prd.md
# - supporting_artifacts: [] (none required)
# - source_of_truth: ["prd"]
```
**PRD context prep notes:**
- Missing `prd.md` results in a WARNING and graceful exit (code 0), not an error
- No context manager is launched; `judge-input.json` is built directly with `primary_artifact` pointing to `$CLOSEDLOOP_WORKDIR/prd.md`
- Performance: emit sub_step=0 (context_prep, skipped=true) perf event immediately, then proceed to sub_step=1 (prd_judges)
**If required files are missing:**
- Plan mode: Exit gracefully with code 0 (do not fail parent workflow)
- Code mode: Exit with error if context preparation fails
- PRD mode: Exit gracefully with code 0 if prd.md is not found
## Artifact Type Configuration
The run-judges skill supports three artifact types with different judge configurations:
### Plan Artifacts (Default)
- **Judges**: 16 total
- **Batches**: 4 sequential batches (max 4 concurrent per batch)
- **Output**: `plan-judges.json`
- **Report ID**: `{RUN_ID}-plan-judges`
- **Validation**: `--category plan` (16 judges expected)
### Code Artifacts (--artifact-type code)
- **Judges**: 11 total (excludes goal-alignment-judge, verbosity-judge)
- **Batches**: 3 sequential batches (max 4 concurrent per batch)
- **Output**: `code-judges.json`
- **Report ID**: `{RUN_ID}-code-judges`
- **Validation**: `--category code` (11 judges expected)
**Code Judge Batches:**
**Batch 1: Core Principles (4 judges)**
- `judges:dry-judge`
- `judges:ssot-judge`
- `judges:kiss-judge`
- `judges:code-organization-judge`
**Batch 2: Best Practices + SOLID Principles (4 judges)**
- `judges:custom-best-practices-judge`
- `judges:readability-judge`
- `judges:solid-isp-dip-judge`
- `judges:solid-liskov-substitution-judge`
**Batch 3: Technical Quality + Testing (3 judges)**
- `judges:solid-open-closed-judge`
- `judges:technical-accuracy-judge`
- `judges:test-judge`
### PRD Artifacts (--artifact-type prd)
- **Judges**: 4 total (parallel batch)
- **Execution**: single parallel batch
- **Output**: `prd-judges.json`
- **Report ID**: `{RUN_ID}-prd-judges`
- **Validation**: `--category prd` (4 judges expected)
- **Canonical input**: `$CLOSEDLOOP_WORKDIR/prd.md`
**PRD Execution:**
**Batch 1: All PRD Judges (sub_step=1)**
- `judges:prd-auditor` — structural completeness audit of the PRD
- `judges:prd-dependency-judge` — evaluates dependency clarity and completeness
- `judges:prd-testability-judge` — evaluates requirement testability
- `judges:prd-scope-judge` — evaluates scope definition and boundary clarity
---
### Step 1: Launch Judge Agents in Parallel
**Performance:** For each batch/phase, run "start of phase" Bash before launching the batch and "end of phase" Bash after the batch completes. Plan: batch_1=sub_step 1, batch_2=sub_step 2, batch_3=sub_step 3, batch_4=sub_step 4. Code: batch_1=sub_step 1, batch_2=sub_step 2, batch_3=sub_step 3. PRD: prd_judges=sub_step 1.
**Constraint:** The Task tool supports maximum 4 concurrent agents per batch.
**Action:** Launch judges in sequential batches based on artifact type.
<judge_batches>
### Plan Artifact Judge Batches (16 judges, 4 batches)
**Batch 1: Core Principles (DRY/SSOT/KISS + Organization)**
| Agent Type | Evaluates |
|------------|-----------|
| `judges:dry-judge` | Don't Repeat Yourself violations |
| `judges:ssot-judge` | Single Source of Truth violations |
| `judges:kiss-judge` | Keep It Simple violations |
| `judges:code-organization-judge` | File and folder structure organization |
**Batch 2: Best Practices + Response Quality**
| Agent Type | Evaluates |
|------------|-----------|
| `judges:custom-best-practices-judge` | Adherence to custom best practices documents |
| `judges:goal-alignment-judge` | Alignment with stated health goals |
| `judges:readability-judge` | Plan readability, clarity, structure, template adherence |
| `judges:verbosity-judge` | Verbosity calibration to problem complexity |
**Batch 3: SOLID Principles**
| Agent Type | Evaluates |
|------------|-----------|
| `judges:solid-isp-dip-judge` | Interface Segregation & Dependency Inversion Principles |
| `judges:solid-liskov-substitution-judge` | Liskov Substitution Principle adherence |
| `judges:solid-open-closed-judge` | Open/Closed Principle adherence |
| `judges:technical-accuracy-judge` | Technical accuracy (API usage, algorithms) |
**Batch 4: Plan Grounding + Testing**
| Agent Type | Evaluates |
|------------|-----------|
| `judges:test-judge` | Test coverage, assertions, structure, best practices |
| `judges:brownfield-accuracy-judge` | Reuse vs reimplementation, integration-point accuracy, scope accuracy against investigation findings |
| `judges:codebase-grounding-judge` | File-path/module-reference accuracy and existing-code awareness grounded in investigation findings |
| `judges:convention-adherence-judge` | Alignment with established naming, structural, and tooling conventions in the codebase |
### PRD Artifact Judge Batch (4 judges, single parallel batch)
**Batch 1: All PRD Judges (sub_step=1)**
| Agent Type | Evaluates |
|------------|-----------|
| `judges:prd-auditor` | Structural completeness, section coverage, clarity |
| `judges:prd-dependency-judge` | Dependency clarity and completeness |
| `judges:prd-testability-judge` | Requirement testability and measurability |
| `judges:prd-scope-judge` | Scope definition and boundary clarity |
</judge_batches>
<prompt_template>
### Preamble Injection
**Before invoking each judge, prepend the common and artifact-specific preambles:**
1. **Locate preamble files**:
- `skills/artifact-type-tailored-context/preambles/common_input_preamble.md`
- `skills/artifact-type-tailored-context/preambles/{artifact_type}_preamble.md`
- Use Glob tool to find: `**/artifact-type-tailored-context/preambles/*.md`
- Validate both files exist (fail with error CaseScore if either is missing)
2. **Read preamble content**:
- Read `common_input_preamble.md`
- Read `{artifact_type}_preamble.md`
- Validate combined preamble size is reasonable for judge context (target: < 8000 characters)
3. **Concatenate**:
- `common_input_preamble + "\n\n---\n\n" + artifact_preamble + "\n\n---\n\n" + judge_prompt`
- `common_input_preamble.md` is the only runtime source of judge input-loading contract text; judge-specific agent files should not duplicate that contract.
4. **Pass to judge**: Use concatenated prompt as judge's full prompt
**If either preamble file is missing:**
- Generate error CaseScore with `final_status=3`, `justification="Preamble file not found: {path}"`
- Continue with other judges
### Prompt Templates
**For plan artifacts:**
```
WORKDIR=$CLOSEDLOOP_WORKDIR. Read $CLOSEDLOOP_WORKDIR/judge-input.json first.
Evaluate according to `task` and `source_of_truth` ordering.
Treat the envelope's `primary_artifact` as authoritative.
If `fallback_mode.active=true`, use fallback artifacts specified in the envelope.
```
**For code artifacts:**
```
WORKDIR=$CLOSEDLOOP_WORKDIR. Read $CLOSEDLOOP_WORKDIR/judge-input.json first.
Evaluate according to `task` and `source_of_truth` ordering.
Treat the envelope's `primary_artifact` as authoritative.
Apply your {judge_name} criteria to assess code quality.
```
**For PRD artifacts:**
```
WORKDIR=$CLOSEDLOOP_WORKDIR. Read $CLOSEDLOOP_WORKDIR/judge-input.json first.
Evaluate according to `task` and `source_of_truth` ordering.
Treat the envelope's `primary_artifact` ($CLOSEDLOOP_WORKDIR/prd.md) as the authoritative PRD document.
Apply your {judge_name} criteria to assess PRD quality.
```
</prompt_template>
---
### Expected Output Format
<expected_output>
Each judge returns a **CaseScore** JSON object:
```json
{
"type": "case_score",
"case_id": "dry-judge",
"final_status": 1,
"metrics": [
{
"metric_name": "dry_score",
"threshold": 0.8,
"score": 0.85,
"justification": "Plan follows DRY principles..."
}
]
}
```
**Status Code Semantics:**
| Code | Meaning | When to Use |
|------|---------|-------------|
| `1` | Pass | Score meets or exceeds threshold |
| `2` | Fail | Score below threshold |
| `3` | Error | Judge execution failed |
</expected_output>
---
### Error Handling Protocol
<error_handling>
**CRITICAL REQUIREMENT:** If a judge Task call fails, you MUST construct an error CaseScore.
**Error CaseScore Template:**
```json
{
"type": "case_score",
"case_id": "{judge-name}",
"final_status": 3,
"metrics": [
{
"metric_name": "{metric}_score",
"threshold": 0.8,
"score": 0.0,
"justification": "Judge execution failed: {error message}"
}
]
}
```
**Continue-on-failure semantics:**
- Even if ALL judges fail, you MUST aggregate error CaseScores
- Always produce a complete report with 16 CaseScore entries (plan), 11 CaseScore entries (code), or 4 CaseScore entries (prd)
- Never abort the workflow due to judge failures
</error_handling>
---
### Step 2: Aggregate Results into EvaluationReport
**Performance:** Run "start of phase" with sub_step 5 (plan), 4 (code), or 2 (prd), sub_step_name=aggregate. Emit 'end of phase' after the aggregation step regardless of file write outcome.
**Task:** Collect all CaseScore outputs and structure them into an `EvaluationReport`.
<output_structure>
**Output file logic:**
```python
if artifact_type == 'code':
report_filename = 'code-judges.json'
report_id = f'{RUN_ID}-code-judges'
elif artifact_type == 'prd':
report_filename = 'prd-judges.json'
report_id = f'{RUN_ID}-prd-judges'
else:
report_filename = 'plan-judges.json'
report_id = f'{RUN_ID}-plan-judges'
output_path = $CLOSEDLOOP_WORKDIR / report_filename
```
**Plan artifact report structure (plan-judges.json):**
```json
{
"report_id": "{RUN_ID}-plan-judges",
"timestamp": "2024-02-03T15:45:30Z",
"stats": [
{ /* CaseScore from dry-judge */ },
{ /* CaseScore from ssot-judge */ },
{ /* CaseScore from kiss-judge */ },
{ /* CaseScore from code-organization-judge */ },
{ /* CaseScore from custom-best-practices-judge */ },
{ /* CaseScore from goal-alignment-judge */ },
{ /* CaseScore from readability-judge */ },
{ /* CaseScore from verbosity-judge */ },
{ /* CaseScore from solid-isp-dip-judge */ },
{ /* CaseScore from solid-liskov-substitution-judge */ },
{ /* CaseScore from solid-open-closed-judge */ },
{ /* CaseScore from technical-accuracy-judge */ },
{ /* CaseScore from test-judge */ },
{ /* CaseScore from brownfield-accuracy-judge */ },
{ /* CaseScore from codebase-grounding-judge */ },
{ /* CaseScore from convention-adherence-judge */ }
]
}
```
**Code artifact report structure (code-judges.json):**
```json
{
"report_id": "{RUN_ID}-code-judges",
"timestamp": "2024-02-03T15:45:30Z",
"stats": [
{ /* CaseScore from dry-judge */ },
{ /* CaseScore from ssot-judge */ },
{ /* CaseScore from kiss-judge */ },
{ /* CaseScore from code-organization-judge */ },
{ /* CaseScore from custom-best-practices-judge */ },
{ /* CaseScore from readability-judge */ },
{ /* CaseScore from solid-isp-dip-judge */ },
{ /* CaseScore from solid-liskov-substitution-judge */ },
{ /* CaseScore from solid-open-closed-judge */ },
{ /* CaseScore from technical-accuracy-judge */ },
{ /* CaseScore from test-judge */ }
]
}
```
**PRD artifact report structure (prd-judges.json):**
```json
{
"report_id": "{RUN_ID}-prd-judges",
"timestamp": "2024-02-03T15:45:30Z",
"stats": [
{ /* CaseScore from prd-auditor */ },
{ /* CaseScore from prd-dependency-judge */ },
{ /* CaseScore from prd-testability-judge */ },
{ /* CaseScore from prd-scope-judge */ }
]
}
```
**Field requirements:**
| Field | Format | How to Derive |
|-------|--------|---------------|
| `report_id` | `{RUN_ID}-plan-judges`, `{RUN_ID}-code-judges`, or `{RUN_ID}-prd-judges` | Extract RUN_ID from `$CLOSEDLOOP_WORKDIR` directory name, append suffix based on artifact type |
| `timestamp` | ISO 8601 | Generate with `date -u +%Y-%m-%dT%H:%M:%SZ` |
| `stats` | Array[CaseScore] | 16 CaseScore objects for plan, 11 for code, 4 for prd (one per judge) |
</output_structure>
---
### Step 3: Validate Output (MANDATORY)
**Performance:** Run "start of phase" with sub_step 6 (plan), 5 (code), or 3 (prd), sub_step_name=validate. Emit 'end of phase' after each validation attempt regardless of exit code, then apply failure recovery logic.
**CRITICAL:** You MUST run the validation script after writing the judge report. Do not consider the task complete until validation passes.
<validation_workflow>
**Step 3.1: Locate the Validation Script**
The script is in this skill's `scripts/` directory:
```bash
SCRIPT_PATH="scripts/validate_judge_report.py"
```
**Step 3.2: Ensure uv is Installed**
```bash
if ! command -v uv &> /dev/null; then
# Install uv — alternatives: brew install uv, pip install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
fi
```
**Step 3.3: Run Validation**
```bash
# CRITICAL: Run from script's directory so uv can find inline dependencies
cd "$(dirname "$SCRIPT_PATH")"
# Determine category based on artifact type
CATEGORY="plan" # default
if [ "$ARTIFACT_TYPE" = "code" ]; then
CATEGORY="code"
elif [ "$ARTIFACT_TYPE" = "prd" ]; then
CATEGORY="prd"
fi
# Run validation with appropriate category
uv run "$SCRIPT_PATH" --workdir "$CLOSEDLOOP_WORKDIR" --category "$CATEGORY"
```
**Argument requirements:**
- `--workdir` must be the **absolute path** to `$CLOSEDLOOP_WORKDIR`
- `--category` must be `plan` (16 judges), `code` (11 judges), or `prd` (4 judges)
- This is where `plan-judges.json`, `code-judges.json`, or `prd-judges.json` is located
</validation_workflow>
---
### Validation Checks
<validation_checks>
The script validates using strict Pydantic models:
| Check | Requirement |
|-------|-------------|
| **JSON syntax** | Valid JSON format |
| **Required fields** | report_id, timestamp, stats array |
| **Judge coverage** | All expected judges present (16 for plan, 11 for code, 4 for prd) |
| **Status values** | final_status ∈ {1, 2, 3} |
| **Metric completeness** | Each judge has ≥1 metric |
| **Report ID format** | Ends with '-judges' (plan), '-code-judges' (code), or '-prd-judges' (prd) |
**Expected judge case_ids for plan artifacts (16 total):**
```
brownfield-accuracy-judge
code-organization-judge
codebase-grounding-judge
convention-adherence-judge
custom-best-practices-judge
dry-judge
goal-alignment-judge
kiss-judge
readability-judge
solid-isp-dip-judge
solid-liskov-substitution-judge
solid-open-closed-judge
ssot-judge
technical-accuracy-judge
test-judge
verbosity-judge
```
**Expected judge case_ids for code artifacts (11 total):**
```
code-organization-judge
custom-best-practices-judge
dry-judge
kiss-judge
readability-judge
solid-isp-dip-judge
solid-liskov-substitution-judge
solid-open-closed-judge
ssot-judge
technical-accuracy-judge
test-judge
```
**Note:** Code artifacts exclude: goal-alignment-judge, verbosity-judge
**Expected judge case_ids for PRD artifacts (4 total):**
```
prd-auditor
prd-dependency-judge
prd-testability-judge
prd-scope-judge
```
**Note:** All 4 PRD judges run in a single parallel batch.
</validation_checks>
---
### Validation Exit Codes
| Code | Meaning | Action |
|------|---------|--------|
| `0` | Valid | Task complete ✓ |
| `1` | Invalid | Read error, fix report JSON, re-validate |
---
### If Validation Fails
<failure_recovery>
**Follow this sequence:**
1. **Read error message** - Understand what failed
2. **Fix report JSON** - Correct the specific validation error
3. **Re-run validation** - Repeat until exit code 0
4. **Never skip validation** - Do not mark task complete until validation passes
</failure_recovery>
---
## Reference: Pydantic Models
<pydantic_schema>
The validation script uses these strict Pydantic models:
```python
class MetricStatistics(BaseModel):
"""A single metric evaluation result."""
metric_name: str
threshold: Optional[float] = None
score: float
justification: str
class CaseScore(BaseModel):
"""Score for a single judge evaluation."""
type: Optional[str] = "case_score"
case_id: str
final_status: int # 1=pass, 2=fail, 3=error
metrics: List[MetricStatistics]
class EvaluationReport(BaseModel):
"""Top-level report containing all judge evaluations."""
report_id: str
timestamp: str
stats: List[CaseScore]
```
**Model constraints:**
- `ConfigDict(strict=True)` enforces exact type matching
- `final_status` validator rejects values outside {1, 2, 3}
</pydantic_schema>
---
## Success Checklist
<completion_criteria>
Before marking this task complete, verify:
**For all artifact types:**
- [ ] **Agents snapshot** - `agents-snapshot/manifest.json` exists in `$CLOSEDLOOP_WORKDIR` (created if missing, skipped if present)
**For plan artifacts (default):**
- [ ] **Input validation** - prd.md and plan.json exist (or graceful skip)
- [ ] **Context preparation** - context-manager-for-judges launched with `artifact_type=plan`
- [ ] **Plan context validation** - `plan-context.json` exists, or compatibility mode explicitly activated
- [ ] **Judge input contract** - `judge-input.json` exists with required fields
- [ ] **Investigation context resolution** - `investigation-log.md` reused, generated via pre-explorer, or best-effort generated internally
- [ ] **Parallel execution** - All 16 judges launched in 4 batches (max 4 per batch)
- [ ] **Result aggregation** - Valid EvaluationReport with 16 CaseScore entries
- [ ] **File output** - `plan-judges.json` written to `$CLOSEDLOOP_WORKDIR`
- [ ] **Validation passed** - Script exits with code 0 using `--category plan`
**For code artifacts (--artifact-type code):**
- [ ] **Context preparation** - context-manager-for-judges agent launched successfully
- [ ] **Context validation** - code-context.json exists at `$CLOSEDLOOP_WORKDIR`
- [ ] **Judge input contract** - `judge-input.json` exists with required fields
- [ ] **Investigation context resolution** - `investigation-log.md` reused or generated best-effort; missing file does not block code judging
- [ ] **Preamble injection** - common_input_preamble.md + code_preamble.md prepended to all judge prompts
- [ ] **Parallel execution** - All 11 judges launched in 3 batches (max 4 per batch)
- [ ] **Result aggregation** - Valid EvaluationReport with 11 CaseScore entries
- [ ] **File output** - `code-judges.json` written to `$CLOSEDLOOP_WORKDIR`
- [ ] **Report ID format** - report_id ends with '-code-judges'
- [ ] **Validation passed** - Script exits with code 0 using `--category code`
**For PRD artifacts (--artifact-type prd):**
- [ ] **prd.md existence check** - `$CLOSEDLOOP_WORKDIR/prd.md` found, or graceful exit with WARNING (code 0)
- [ ] **No context manager** - context-manager-for-judges is NOT launched for prd mode
- [ ] **Judge input contract** - `judge-input.json` written with `evaluation_type="prd"` and `primary_artifact=$CLOSEDLOOP_WORKDIR/prd.md`
- [ ] **Parallel execution** - All 4 PRD judges launched in a single parallel batch (sub_step=1)
- [ ] **Result aggregation** - Valid EvaluationReport with 4 CaseScore entries (sub_step=2)
- [ ] **File output** - `prd-judges.json` written to `$CLOSEDLOOP_WORKDIR`
- [ ] **Report ID format** - report_id ends with '-prd-judges'
- [ ] **Validation passed** - Script exits with code 0 using `--category prd` (sub_step=3)
</completion_criteria>
---
## Troubleshooting Guide
<troubleshooting>
| Error Message | Root Cause | Solution |
|---------------|------------|----------|
| "Report file does not exist" | File not written to correct location | Verify `$CLOSEDLOOP_WORKDIR` is set; check write path matches artifact type (plan-judges.json, code-judges.json, or prd-judges.json) |
| "Invalid JSON" | Syntax error in output file | Run `python3 -m json.tool "$CLOSEDLOOP_WORKDIR/{plan,code,prd}-judges.json"` to identify syntax error |
| "Missing expected judges" | Incomplete batch execution | Verify all batches launched (4 for plan, 3 for code, 1 for prd); check error CaseScores for failures; plan expects 16 judges, code expects 11, prd expects 4 |
| "final_status must be 1, 2, or 3" | Invalid status code | Use only: 1 (pass), 2 (fail), 3 (error) |
| "report_id should end with '-plan-judges'" | Incorrect ID format for plan | Use pattern: `{RUN_ID}-plan-judges` for plan artifacts |
| "report_id should end with '-code-judges'" | Incorrect ID format for code | Use pattern: `{RUN_ID}-code-judges` for code artifacts |
| "Judge {name} has no metrics" | Empty metrics array | Each CaseScore must have ≥1 MetricStatistics entry |
| "Context preparation failed" | context-manager-for-judges failed | Check context-manager agent output; verify artifact files exist |
| "judge-input.json missing" | Orchestrator did not generate envelope | Build `$CLOSEDLOOP_WORKDIR/judge-input.json` before launching judges |
| "judge-input schema invalid" | Missing required envelope fields | Ensure required fields: `evaluation_type`, `task`, `primary_artifact`, `supporting_artifacts`, `source_of_truth`, `fallback_mode`, `metadata` |
| "plan-context.json not found" | plan context manager did not produce output | Run `@judges:context-manager-for-judges` with `artifact_type=plan`; if still missing, activate one-run compatibility fallback to `plan.json` + `prd.md` |
| "Preamble file not found" | Missing common or artifact preamble .md file | Verify both `skills/artifact-type-tailored-context/preambles/common_input_preamble.md` and `skills/artifact-type-tailored-context/preambles/{artifact_type}_preamble.md` exist |
| "pre-explorer unavailable" | `@code:pre-explorer` not installed/resolvable | Log warning and use internal fallback investigation to create `investigation-log.md` |
| "investigation-log.md missing after fallback" | Both pre-explorer and internal fallback failed | Log warning and continue; do not block context preparation |
| "investigation-log.md missing in code mode" | pre-explorer unavailable or generation failed during code preflight | Log warning and continue with `code-context.json` only (non-blocking) |
| "Invalid --artifact-type value" | Unsupported artifact type | Use only 'plan', 'code', or 'prd' |
| "prd.md not found" | PRD document missing from workdir | Emit WARNING and exit gracefully (code 0); do not fail the parent workflow |
| "report_id should end with '-prd-judges'" | Incorrect ID format for prd | Use pattern: `{RUN_ID}-prd-judges` for PRD artifacts |
</troubleshooting>
---
## Error Handling Requirements
### Invalid Artifact Type
If `--artifact-type` value is not 'plan', 'code', or 'prd':
- Fail immediately with clear error message
- Do not attempt judge execution
- Exit with non-zero status
### Context Manager Timeout (Code Mode)
If context-manager-for-judges agent exceeds 5 minutes:
- Abort judge execution
- Generate error CaseScores for all 11 judges
- Each error CaseScore: `final_status=3`, `justification="Context preparation timeout"`
- Write complete report with all error CaseScores
### Context Manager Timeout (Plan Mode)
If context-manager-for-judges agent exceeds 5 minutes in plan mode:
- Attempt one emergency compatibility fallback to raw `plan.json` + `prd.md`
- If fallback files are unavailable, abort plan judge execution and emit clear error
### Individual Judge Failures
If a single judge Task call fails during execution:
- **Do not abort** the entire workflow
- Generate error CaseScore for that judge only
- Continue with remaining judges in batch and subsequent batches
- Include error CaseScore in final aggregated report
### Plan Mode Execution Flow
When `--artifact-type` is not specified or equals 'plan':
- Execute standard 16-judge plan logic
- Launch 4 batches with existing judge assignments
- Write to `plan-judges.json` (not `code-judges.json`)
- Launch context-manager-for-judges for plan context preparation
- Use `plan-context.json` as primary input; use one-run compatibility fallback only if context preparation fails
- Build and pass `judge-input.json` envelope to judges
- Prepend preambles to judge prompts
- Use default validation with `--category plan`
This is the standard plan mode flow; orchestrators must support context-manager launch, judge-input.json construction, and preamble injection. The compatibility fallback (raw `plan.json` + `prd.md`) activates only when context preparation fails (e.g., context-manager timeout), not for orchestrators that have not been updated.
### PRD Mode Execution Flow
When `--artifact-type prd` is specified:
- Check `$CLOSEDLOOP_WORKDIR/prd.md` exists; emit WARNING and exit gracefully (code 0) if missing
- Do NOT launch context-manager-for-judges
- Build `judge-input.json` with `evaluation_type="prd"` and `primary_artifact=$CLOSEDLOOP_WORKDIR/prd.md`
- Launch all 4 PRD judges in a single parallel batch (sub_step=1)
- Aggregate all 4 CaseScores (sub_step=2) and write to `prd-judges.json`
- Validate with `--category prd` (sub_step=3)
---