read-eval-logs
$
npx mdskill add UKGovernmentBEIS/inspect_evals/read-eval-logsInspect log files using Python API and CLI commands.
- Analyze evaluation logs without pre-written scripts.
- Integrates with Inspect evaluation logs and Python API.
- Executes commands to list, dump, convert, and view logs.
- Delivers results through JSON output and interactive viewers.
SKILL.md
.github/skills/read-eval-logsView on GitHub ↗
---
name: read-eval-logs
description: View and analyse Inspect evaluation log files using the Python API. Trigger whenever you need to look at a .eval file yourself without using pre-written scripts.
---
# Analysing Eval Log Files
This skill covers how to view and analyse Inspect evaluation log files using the Python API and CLI commands.
## Quick Reference
### CLI Commands
```bash
# List all logs in the default log directory (./logs or INSPECT_LOG_DIR)
uv run inspect log list --json
# List logs with specific status
uv run inspect log list --json --status success
uv run inspect log list --json --status error
# List retryable logs (error/cancelled without subsequent success)
uv run inspect log list --json --retryable
# Dump a log file as JSON (works with any format: .eval or .json)
uv run inspect log dump <log_file_path>
# Convert between log formats
uv run inspect log convert source.json --to eval --output-dir log-output
uv run inspect log convert logs/ --to eval --output-dir logs-eval
# Get JSON schema for log files
uv run inspect log schema
```
### Interactive Log Viewer
```bash
# Start the interactive log viewer (updates automatically)
uv run inspect view
```
## Python API
### Key Imports
```python
from inspect_ai.log import (
# Listing and reading
list_eval_logs,
read_eval_log,
read_eval_log_sample,
read_eval_log_samples,
read_eval_log_sample_summaries,
# Writing
write_eval_log,
# Utilities
retryable_eval_logs,
recompute_metrics,
resolve_sample_attachments,
# Types
EvalLog,
EvalLogInfo,
EvalSample,
EvalSampleSummary,
)
```
### Listing Logs
```python
# List all logs in default directory
logs = list_eval_logs()
# List logs in a specific directory
logs = list_eval_logs(log_dir="./experiment-logs")
# Filter by format
logs = list_eval_logs(formats=["eval"]) # Only .eval files
# Filter with a custom function (receives header-only EvalLog)
logs = list_eval_logs(filter=lambda log: log.status == "success")
# Non-recursive listing
logs = list_eval_logs(recursive=False)
```
### Reading Logs
```python
# Read full log
log = read_eval_log("path/to/logfile.eval")
# Read header only (fast, excludes samples)
log = read_eval_log("path/to/logfile.eval", header_only=True)
# Read with attachments resolved
log = read_eval_log("path/to/logfile.eval", resolve_attachments=True)
```
### Reading Samples
```python
# Read a single sample by ID and epoch
sample = read_eval_log_sample("path/to/logfile.eval", id=42, epoch=1)
# Read a single sample by UUID
sample = read_eval_log_sample("path/to/logfile.eval", uuid="sample-uuid")
# Stream all samples (memory efficient - one at a time)
for sample in read_eval_log_samples("path/to/logfile.eval"):
process(sample)
# Stream samples from incomplete logs
for sample in read_eval_log_samples("path/to/logfile.eval", all_samples_required=False):
process(sample)
# Read sample summaries (fast, includes scoring info)
summaries = read_eval_log_sample_summaries("path/to/logfile.eval")
```
### Filtering Samples
```python
# Read only samples with errors using summaries for filtering
errors = []
for summary in read_eval_log_sample_summaries(log_file):
if summary.error is not None:
errors.append(
read_eval_log_sample(log_file, summary.id, summary.epoch)
)
```
## EvalLog Structure
The `EvalLog` object contains:
| Field | Type | Description |
| ----- | ---- | ----------- |
| `version` | `int` | File format version (currently 2) |
| `status` | `str` | `"started"`, `"success"`, `"cancelled"`, or `"error"` |
| `eval` | `EvalSpec` | Task, model, creation time, config |
| `plan` | `EvalPlan` | Solvers and generation config |
| `results` | `EvalResults` | Aggregate scores and metrics |
| `stats` | `EvalStats` | Runtime, model usage statistics |
| `error` | `EvalError` | Error info if `status == "error"` |
| `samples` | `list[EvalSample]` | Individual samples (if not header_only) |
| `reductions` | `list[EvalSampleReductions]` | Multi-epoch reductions |
| `location` | `str` | URI where log was read from |
### Always Check Status
```python
log = read_eval_log("path/to/logfile.eval")
if log.status == "success":
# Safe to analyse results
for score in log.results.scores:
print(f"{score.name}: {score.metrics}")
```
## EvalSample Structure
Each sample contains:
| Field | Type | Description |
| ----- | ---- | ----------- |
| `id` | `int \| str` | Unique sample ID |
| `epoch` | `int` | Epoch number |
| `input` | `str \| list[ChatMessage]` | Sample input |
| `target` | `str \| list[str]` | Expected target(s) |
| `messages` | `list[ChatMessage]` | Full conversation history |
| `output` | `ModelOutput` | Model's output |
| `scores` | `dict[str, Score]` | Scores from scorers |
| `metadata` | `dict[str, Any]` | Sample metadata |
| `store` | `dict[str, Any]` | State at end of execution |
| `events` | `list[Event]` | Transcript events |
| `error` | `EvalError` | Error if sample failed |
| `total_time` | `float` | Total sample runtime |
| `model_usage` | `dict[str, ModelUsage]` | Token usage |
## Common Analysis Patterns
### Get Aggregate Metrics
```python
log = read_eval_log(log_file, header_only=True)
if log.results:
for score in log.results.scores:
print(f"Scorer: {score.name}")
for metric_name, metric in score.metrics.items():
print(f" {metric_name}: {metric.value}")
```
### Analyse Failed Samples
```python
log = read_eval_log(log_file)
if log.samples:
failed = [s for s in log.samples if s.error is not None]
for sample in failed:
print(f"Sample {sample.id}: {sample.error.message}")
```
### Extract Model Usage
```python
log = read_eval_log(log_file, header_only=True)
for model, usage in log.stats.model_usage.items():
print(f"{model}: {usage.input_tokens} in, {usage.output_tokens} out")
```
### Compare Multiple Runs
```python
logs = list_eval_logs(filter=lambda l: l.eval.task == "my_task")
for log_info in logs:
log = read_eval_log(log_info, header_only=True)
if log.results and log.results.scores:
accuracy = log.results.scores[0].metrics.get("accuracy")
print(f"{log.eval.model}: {accuracy.value if accuracy else 'N/A'}")
```
### Find Retryable Logs
```python
all_logs = list_eval_logs()
retryable = retryable_eval_logs(all_logs)
for log_info in retryable:
print(f"Can retry: {log_info.name}")
```
## Log File Formats
| Type | Description |
| ---- | ----------- |
| `.eval` | Binary format, ~1/8 size of JSON, fast incremental access |
| `.json` | Text format, human-readable, slower for large files |
Both formats are fully supported by the API and can be intermixed.
## Working with Large Logs
For logs too large to fit in memory:
1. **Use `.eval` format** - supports compression and incremental access
2. **Read header only** - `read_eval_log(log_file, header_only=True)`
3. **Stream samples** - `read_eval_log_samples()` yields one at a time
4. **Use summaries** - `read_eval_log_sample_summaries()` for quick overview
## Modifying Logs
```python
# Read, modify, and write back
log = read_eval_log(log_file)
# ... modify log ...
write_eval_log(log) # Uses log.location automatically
# Write to a new location
write_eval_log(log, location="new_path.eval")
# Recompute metrics after score edits
recompute_metrics(log)
write_eval_log(log)
```
## Environment Variables
| Variable | Description |
| -------- | ----------- |
| `INSPECT_LOG_DIR` | Default log directory (default: `./logs`) |
| `INSPECT_EVAL_LOG_FILE_PATTERN` | Log filename pattern (e.g., `{task}_{model}_{id}`) |
## Related Tools
- **[Inspect Scout](https://meridianlabs-ai.github.io/inspect_scout/)** - Transcript analysis
- **[Inspect Viz](https://meridianlabs-ai.github.io/inspect_viz/)** - Data visualization
- **Log Dataframes** - Extract pandas DataFrames from logs (see `inspect_ai.analysis`)