e2e-scenario-testing

Name: e2e-scenario-testing
Author: obra/dotfiles

$npx mdskill add obra/dotfiles/e2e-scenario-testing

Verify that a *running* application does what it claims, by driving its real interface the way a user would. The unit of work is a **scenario card**: a short markdown test written for an agent to execute — not a Playwright/expect script. Cards are high-level enough that a small UI shuffle doesn't invalidate them, but precise enough that two agents running the same card reach the same verdict.

SKILL.md

.github/skills/e2e-scenario-testingView on GitHub ↗

---
name: e2e-scenario-testing
description: Use when verifying a running application end-to-end through its real interface — a web UI, a CLI, or a TUI — by writing and executing agent-run "scenario cards" against a freshly built instance with falsifiable assertions. Trigger on "test it end to end", "prove the UI actually works", "write/run a scenario", or after a change touches a user-facing surface that unit tests can't fully cover. Not for unit tests, pure code review, or API-only checks.
---

# End-to-end scenario testing

Verify that a *running* application does what it claims, by driving its real
interface the way a user would. The unit of work is a **scenario card**: a
short markdown test written for an agent to execute — not a Playwright/expect
script. Cards are high-level enough that a small UI shuffle doesn't invalidate
them, but precise enough that two agents running the same card reach the same
verdict.

A green unit test proves the wiring in isolation. A scenario proves the wiring
*as assembled and rendered*. They catch different bugs — write the card even
when the unit tests pass.

## When to use this

- A feature touches a user-facing surface (button, palette command, status
  indicator, keybinding, rendered message) and you want proof it works live.
- The user asks to "test it end to end" / "prove the UI works" / "run a scenario."
- You changed a layer (projection, capability gate, renderer) whose effect is
  only observable in the assembled UI.

Don't use it for logic with no UI surface (unit-test that), or when a
production gate makes the live path unreachable (see *Over-specification* below).

## The card format

One card = one `.md` file. Keep these sections; collapse any to one line when
the scenario is simple. Don't pad.

```markdown
# <area>-<behavior>: one-line title

**What this covers**: the feature + the specific commits/IDs it exercises.
If something else breaks this, it should be caught here.

## Pre-state
What must be true before starting: a freshly built instance running, auth/creds
in place, a clean workdir. Give the exact commands to reach it.

## Steps
Numbered actions described by **intent**, each with the concrete command or
tool call and a real UI label (prefer labels the user sees over brittle
selectors like `#nav > li:nth-child(3)`).

## Expected
For each step, what you should observe — and the **falsification condition**:
"if you see X instead, the test fails." Silence is not success.

## Cleanup
Idempotent teardown so reruns are hermetic. Never touch state you didn't create.

## Sharp edges
Footguns, timing/ordering caveats, nondeterminism noted while recording.
```

## Running a card

1. **Build fresh from the code under test.** The single most common mistake is
   testing a stale binary. Rebuild every layer your change touches (server,
   client, embedded assets) and confirm the running instance is the new one,
   not a process someone left up yesterday.
2. **Isolate.** Run in a hermetic workdir. If the app holds a host-level
   singleton (a lock, a fixed port, a shared state dir), point the test
   instance at its own copy — e.g. override `$HOME`/state-dir/port — so it can
   neither collide with nor pollute (nor be polluted by) a real instance.
   Symlink shared read-only inputs (creds, tokens); keep mutable state separate.
3. **Drive the surface** (recipes below).
4. **Assert against the authoritative record, not just the pixels.** The UI can
   lie or lag; the on-disk state / log / database is ground truth. Cross-check
   the rendered claim against it when an assertion is ambiguous.
5. **Capture evidence** — a screenshot, the captured pane, the on-disk artifact.
6. **Clean up** — shut down what you spawned, remove scratch dirs, leave
   pre-existing instances running and untouched.

## Driving a web UI (browser)

Use a Chrome/CDP browser tool. After authenticated navigation, drive the page
through `eval` against the app's own JS entry points rather than synthesizing
clicks where possible — it's more robust to layout change.

- **Optimistic-vs-settled** assertions: fire the action but *don't await it*,
  take a synchronous DOM snapshot (the pending placeholder is there *now*),
  then await and snapshot again. Without the no-await capture you can't tell
  "rendered then reconciled" from "never rendered."
- Return a **plain string** from `eval` (join your findings with `\n`); some
  bridges stringify a returned object as `[object Object]`.
- Inspect internal state via the app's singleton (`window.<App>?.state`, etc.)
  when the DOM is ambiguous.

## Driving a CLI / TUI (tmux)

Each scenario gets its own named tmux session (cleanup needs a deterministic
name). Fix the size for deterministic capture; prefer the app's plain-text/inline
mode if it has one.

```bash
tmux new-session -d -s <name> -x 200 -y 50 "<cmd> 2>/tmp/<name>-stderr.log"
tmux send-keys -t <name> -l "literal text"   # -l = no key-name parsing (paths, slashes)
tmux send-keys -t <name> Enter
tmux capture-pane -t <name> -p                # -p = plain text; add -e only for styling
```

- Always `-l` for user-typed strings; without it `/foo/bar` parses as escapes.
- Poll `capture-pane` for a state string; grep the **glyph/word**, not the color.
- Redirect stderr to a file — panics and debug probes land there, not the pane.

## Hard-won principles

- **Falsification, always.** Every assertion states what failure looks like. A
  step that can't fail proves nothing. When watching for an outcome, make sure
  your check would fire on the failure path, not just the happy path.
- **Verify the *right* surface.** The same concept often exists at several
  layers (an internal capability vs. the REST projection of it; a model field
  vs. the rendered chip). Confirm your assertion reads the surface that actually
  carries the signal — a "missing" value is often present one layer over.
- **Present but not visible ≠ absent.** Scrollable bodies, virtualized lists,
  and auto-scroll-to-bottom routinely push a real element out of the capture
  window. Before concluding something didn't render, scroll/expand to where it
  should be. Confirm via a sibling read (a status command reading the same
  state) when the visual is hard to capture.
- **Executing the card tests the card.** Expect to find bugs in your own
  scenario — a wrong selector, a wrong layer, an assertion the UI can't show.
  Fix the card as you go; a card that "passes" because its check was vacuous is
  worse than none.
- **Over-specification trap.** A card can describe a path that production gating
  prevents (e.g. a keybind that's a no-op in the current mode). Confirm the gate
  in the source rather than fighting it through the UI; verify the underlying
  behavior with a unit test and note the gate in the card.
- **Cleanup is part of the test.** A half-shutdown fleet makes the next run's
  polling return false positives. Make teardown idempotent and scoped to what
  you created.

## Finishing

Report each assertion as pass/fail with the concrete observation (the rendered
text, the on-disk value), not "looks good." If a card fails, capture the
evidence and either fix the bug or file it; don't soften the verdict.

More from obra/dotfiles

Skill	Description
chronicle	\|
maintaining-documentation	Use when documentation needs creating, checking, or maintaining — docs may have drifted from code, pre-release doc verification, updating docs after finishing code work, adding or enforcing project terminology, deciding where a new doc should live, or a routine re-audit of previously verified docs.
roborev-design-review	Request a design review for a commit and present the results
roborev-design-review-branch	Request a design review for all commits on the current branch and present the results
roborev-fix	Use when the user asks to fix open reviews, invokes /roborev-fix, or provides job IDs; do not use when the user only pastes review findings with no request to discover or close reviews
roborev-refine	Iterative review-fix loop for the current branch — reviews via daemon, fixes inline, re-reviews until passing or max iterations reached
roborev-respond	Add a comment to a roborev code review and close it
roborev-review	Request a code review for a commit and present the results
roborev-review-branch	Request a code review for all commits on the current branch and present the results
syncing-obsidian	Use when reading or writing files in an Obsidian vault that is synced with obsync. Ensures changes are pulled before reading and pushed after writing. Also covers sync status, file history, version restore, and diagnosing sync issues.