read-eval-logs

Name: read-eval-logs
Author: UKGovernmentBEIS/inspect_evals

$npx mdskill add UKGovernmentBEIS/inspect_evals/read-eval-logs

Inspect log files using Python API and CLI commands.

Analyze evaluation logs without pre-written scripts.
Integrates with Inspect evaluation logs and Python API.
Executes commands to list, dump, convert, and view logs.
Delivers results through JSON output and interactive viewers.

SKILL.md

.github/skills/read-eval-logsView on GitHub ↗

---
name: read-eval-logs
description: View and analyse Inspect evaluation log files using the Python API. Trigger whenever you need to look at a .eval file yourself without using pre-written scripts.
---

# Analysing Eval Log Files

This skill covers how to view and analyse Inspect evaluation log files using the Python API and CLI commands.

## Quick Reference

### CLI Commands

```bash
# List all logs in the default log directory (./logs or INSPECT_LOG_DIR)
uv run inspect log list --json

# List logs with specific status
uv run inspect log list --json --status success
uv run inspect log list --json --status error

# List retryable logs (error/cancelled without subsequent success)
uv run inspect log list --json --retryable

# Dump a log file as JSON (works with any format: .eval or .json)
uv run inspect log dump <log_file_path>

# Convert between log formats
uv run inspect log convert source.json --to eval --output-dir log-output
uv run inspect log convert logs/ --to eval --output-dir logs-eval

# Get JSON schema for log files
uv run inspect log schema
```

### Interactive Log Viewer

```bash
# Start the interactive log viewer (updates automatically)
uv run inspect view
```

## Python API

### Key Imports

```python
from inspect_ai.log import (
    # Listing and reading
    list_eval_logs,
    read_eval_log,
    read_eval_log_sample,
    read_eval_log_samples,
    read_eval_log_sample_summaries,
    
    # Writing
    write_eval_log,
    
    # Utilities
    retryable_eval_logs,
    recompute_metrics,
    resolve_sample_attachments,
    
    # Types
    EvalLog,
    EvalLogInfo,
    EvalSample,
    EvalSampleSummary,
)
```

### Listing Logs

```python
# List all logs in default directory
logs = list_eval_logs()

# List logs in a specific directory
logs = list_eval_logs(log_dir="./experiment-logs")

# Filter by format
logs = list_eval_logs(formats=["eval"])  # Only .eval files

# Filter with a custom function (receives header-only EvalLog)
logs = list_eval_logs(filter=lambda log: log.status == "success")

# Non-recursive listing
logs = list_eval_logs(recursive=False)
```

### Reading Logs

```python
# Read full log
log = read_eval_log("path/to/logfile.eval")

# Read header only (fast, excludes samples)
log = read_eval_log("path/to/logfile.eval", header_only=True)

# Read with attachments resolved
log = read_eval_log("path/to/logfile.eval", resolve_attachments=True)
```

### Reading Samples

```python
# Read a single sample by ID and epoch
sample = read_eval_log_sample("path/to/logfile.eval", id=42, epoch=1)

# Read a single sample by UUID
sample = read_eval_log_sample("path/to/logfile.eval", uuid="sample-uuid")

# Stream all samples (memory efficient - one at a time)
for sample in read_eval_log_samples("path/to/logfile.eval"):
    process(sample)

# Stream samples from incomplete logs
for sample in read_eval_log_samples("path/to/logfile.eval", all_samples_required=False):
    process(sample)

# Read sample summaries (fast, includes scoring info)
summaries = read_eval_log_sample_summaries("path/to/logfile.eval")
```

### Filtering Samples

```python
# Read only samples with errors using summaries for filtering
errors = []
for summary in read_eval_log_sample_summaries(log_file):
    if summary.error is not None:
        errors.append(
            read_eval_log_sample(log_file, summary.id, summary.epoch)
        )
```

## EvalLog Structure

The `EvalLog` object contains:

| Field | Type | Description |
| ----- | ---- | ----------- |
| `version` | `int` | File format version (currently 2) |
| `status` | `str` | `"started"`, `"success"`, `"cancelled"`, or `"error"` |
| `eval` | `EvalSpec` | Task, model, creation time, config |
| `plan` | `EvalPlan` | Solvers and generation config |
| `results` | `EvalResults` | Aggregate scores and metrics |
| `stats` | `EvalStats` | Runtime, model usage statistics |
| `error` | `EvalError` | Error info if `status == "error"` |
| `samples` | `list[EvalSample]` | Individual samples (if not header_only) |
| `reductions` | `list[EvalSampleReductions]` | Multi-epoch reductions |
| `location` | `str` | URI where log was read from |

### Always Check Status

```python
log = read_eval_log("path/to/logfile.eval")
if log.status == "success":
    # Safe to analyse results
    for score in log.results.scores:
        print(f"{score.name}: {score.metrics}")
```

## EvalSample Structure

Each sample contains:

| Field | Type | Description |
| ----- | ---- | ----------- |
| `id` | `int \| str` | Unique sample ID |
| `epoch` | `int` | Epoch number |
| `input` | `str \| list[ChatMessage]` | Sample input |
| `target` | `str \| list[str]` | Expected target(s) |
| `messages` | `list[ChatMessage]` | Full conversation history |
| `output` | `ModelOutput` | Model's output |
| `scores` | `dict[str, Score]` | Scores from scorers |
| `metadata` | `dict[str, Any]` | Sample metadata |
| `store` | `dict[str, Any]` | State at end of execution |
| `events` | `list[Event]` | Transcript events |
| `error` | `EvalError` | Error if sample failed |
| `total_time` | `float` | Total sample runtime |
| `model_usage` | `dict[str, ModelUsage]` | Token usage |

## Common Analysis Patterns

### Get Aggregate Metrics

```python
log = read_eval_log(log_file, header_only=True)
if log.results:
    for score in log.results.scores:
        print(f"Scorer: {score.name}")
        for metric_name, metric in score.metrics.items():
            print(f"  {metric_name}: {metric.value}")
```

### Analyse Failed Samples

```python
log = read_eval_log(log_file)
if log.samples:
    failed = [s for s in log.samples if s.error is not None]
    for sample in failed:
        print(f"Sample {sample.id}: {sample.error.message}")
```

### Extract Model Usage

```python
log = read_eval_log(log_file, header_only=True)
for model, usage in log.stats.model_usage.items():
    print(f"{model}: {usage.input_tokens} in, {usage.output_tokens} out")
```

### Compare Multiple Runs

```python
logs = list_eval_logs(filter=lambda l: l.eval.task == "my_task")
for log_info in logs:
    log = read_eval_log(log_info, header_only=True)
    if log.results and log.results.scores:
        accuracy = log.results.scores[0].metrics.get("accuracy")
        print(f"{log.eval.model}: {accuracy.value if accuracy else 'N/A'}")
```

### Find Retryable Logs

```python
all_logs = list_eval_logs()
retryable = retryable_eval_logs(all_logs)
for log_info in retryable:
    print(f"Can retry: {log_info.name}")
```

## Log File Formats

| Type | Description |
| ---- | ----------- |
| `.eval` | Binary format, ~1/8 size of JSON, fast incremental access |
| `.json` | Text format, human-readable, slower for large files |

Both formats are fully supported by the API and can be intermixed.

## Working with Large Logs

For logs too large to fit in memory:

1. **Use `.eval` format** - supports compression and incremental access
2. **Read header only** - `read_eval_log(log_file, header_only=True)`
3. **Stream samples** - `read_eval_log_samples()` yields one at a time
4. **Use summaries** - `read_eval_log_sample_summaries()` for quick overview

## Modifying Logs

```python
# Read, modify, and write back
log = read_eval_log(log_file)
# ... modify log ...
write_eval_log(log)  # Uses log.location automatically

# Write to a new location
write_eval_log(log, location="new_path.eval")

# Recompute metrics after score edits
recompute_metrics(log)
write_eval_log(log)
```

## Environment Variables

| Variable | Description |
| -------- | ----------- |
| `INSPECT_LOG_DIR` | Default log directory (default: `./logs`) |
| `INSPECT_EVAL_LOG_FILE_PATTERN` | Log filename pattern (e.g., `{task}_{model}_{id}`) |

## Related Tools

- **[Inspect Scout](https://meridianlabs-ai.github.io/inspect_scout/)** - Transcript analysis
- **[Inspect Viz](https://meridianlabs-ai.github.io/inspect_viz/)** - Data visualization
- **Log Dataframes** - Extract pandas DataFrames from logs (see `inspect_ai.analysis`)

More from UKGovernmentBEIS/inspect_evals

Skill	Description
build-repo-context	Crawl repository PRs, issues, and review comments to distill institutional knowledge into a shared knowledge base. Run periodically by "context agents" to maintain agent_artefacts/repo_context/REPO_CONTEXT.md. Trigger only on specific request.
check-trajectories-workflow	Use Inspect Scout to analyze agent trajectories from evaluation log files. Runs default and custom scanners to detect external failures, formatting issues, reward hacking, and ethical refusals. Use when user asks to check/analyze agent trajectories. Trigger when the user asks you to run the "Check Agent Trajectories" workflow.
ci-maintenance-workflow	CI and GitHub Actions maintenance workflows — fix a failing test from a CI URL, fix a failing smoke test, add @pytest.mark.slow markers to slow tests, or review a PR against agent-checkable standards. Use when user asks to fix a failing test, fix a smoke test, mark slow tests, or review a PR. Trigger when the user asks you to run the "Write a PR For A Failing Test", "Fix A Failing Smoke Test", "Mark Slow Tests", or "Review PR According to Agent-Checkable Standards" workflow.
code-quality-fix-all	Fix code quality issues identified in a code quality review stored in agent_artefacts/code_quality/<topic>/. Systematically addresses issues found by the code-quality-review-all skill for ANY code quality topic, with validation and testing at each step. Use when user asks to fix issues from a code quality review, or asks to fix issues from agent_artefacts/code_quality/<topic>.
code-quality-review-all	Review all evaluations in the repository against a single code quality standard. Checks ALL evals against ONE standard for periodic quality reviews. Use when user asks to review/audit/check all evaluations for a specific topic or standard. Do NOT use for reviewing a single eval (use eval-quality-workflow instead) or for test coverage (use ensure-test-coverage instead).
create-eval	Redirect to the inspect-evals-template for creating new evaluations. New evals are no longer created in this repository — they live in standalone repos. Use when user asks to create/implement/build a new evaluation.
ensure-test-coverage	Ensure test coverage for a single evaluation - both reviewing existing tests and creating missing ones. Analyzes testable components, checks tests against repository conventions, reports coverage gaps, and creates or improves tests. Use when user asks to check/review/create/add/ensure tests for an eval. Use whenever you are asked to review an evaluation that contains tests, or whenever you need to write a suite of tests. Do NOT use for fixing a specific failing CI test (use ci-maintenance-workflow instead).
eval-quality-workflow	Fix or review a single evaluation against all EVALUATION_CHECKLIST.md standards. Use "fix" mode to refactor an eval into compliance, or "review" mode to assess compliance without making changes. Use when user asks to fix, review, or check an evaluation's quality. Trigger when the user asks you to run the "Fix An Evaluation" or "Review An Evaluation" workflow. Do NOT use for reviewing ALL evals against a single code quality standard (use code-quality-review-all instead).
eval-report-workflow	Create an evaluation report for a README by selecting models, estimating costs, running evaluations, and formatting results tables. Use when user asks to make/create/generate an evaluation report. Trigger when the user asks you to run the "Make An Evaluation Report" workflow.
eval-validity-review	Review a single evaluation's validity — whether its claims hold up, whether its name is accurate, whether samples can be both succeeded and failed at, and whether scoring measures ground truth. Use when user asks to check validity of an eval, or as part of the Master Checklist workflow. Do NOT use for code quality or test coverage (use eval-quality-workflow or ensure-test-coverage instead).