monte-carlo-incident-response

Name: monte-carlo-incident-response
Author: monte-carlo-data/mc-agent-toolkit

$npx mdskill add monte-carlo-data/mc-agent-toolkit/monte-carlo-incident-response

Execute end-to-end incident response for data failures.

Handles active alerts, broken tables, and pipeline failures.
Depends on Monte Carlo skills for triage, root cause, and remediation.
Decides workflow steps based on alert context and user intent.
Delivers a sequenced investigation and fix plan to the user.

SKILL.md

.github/skills/monte-carlo-incident-responseView on GitHub ↗

---
name: monte-carlo-incident-response
description: Orchestrate incident response — triage, root cause, remediate, prevent recurrence. USE WHEN active alerts, data broken, stale, pipeline failure, or investigate and fix a data incident.
when_to_use: |
Invoke when the user has an active data incident to handle — alerts firing, a table looks stale or broken, a pipeline failed, or they want to investigate root cause on a named table.
Example triggers: "my orders table is stale, figure out why", "I have an unresolved alert on X, help me investigate", "alerts are firing — what should I do?", "investigate the most critical alert".

Covers the full workflow: triage (classify/prioritize alerts) → root cause analysis (lineage, freshness history, query changes) → remediation → prevent recurrence.

Do NOT invoke for coverage or "what should I monitor" requests (use proactive-monitoring instead) or for creating a specific monitor on a known table (use monitoring-advisor).
bucket: Agent-routing
version: 1.0.0
---

# Monte Carlo Incident Response Workflow

This workflow orchestrates the full lifecycle of a data incident by sequencing
existing Monte Carlo skills. It does not contain investigation or remediation
logic itself — each step loads the relevant skill's SKILL.md which has the
actual instructions.

## When to activate this workflow

Activate when:

- Context detection routes here (active alerts detected + incident intent)
- User invokes `/mc-incident-response`
- User asks to "respond to an incident", "handle this alert", "triage and fix"
- User describes a data quality problem: "data is broken", "table is stale", "alert firing"

## When NOT to activate this workflow

- User wants to create monitors or check coverage without an active incident — use proactive monitoring workflow
- User is editing a dbt model — defer to `prevent` skill (auto-activates via hooks)
- User wants to check table health without an incident context — use `asset-health` directly
- A skill is already active and handling the user's request

---

## Workflow Steps

```
Step 1 (conditional): Triage — when user has multiple/unknown alerts
Step 2: Root Cause Analysis — the core investigation
Step 3: Remediation — fix or escalate
Step 4 (optional): Prevent Recurrence — add monitoring
```

### Determine entry point

Before starting, determine which step to enter based on the user's context:

- **User has no specific alert** ("I have alerts firing", "what's going on?") → Start at **Step 1: Triage**
- **User has a specific alert ID or table** ("alert ABC-123", "stg_payments is stale") → Skip to **Step 2: Root Cause Analysis**
- **User knows the root cause** ("the ETL job failed, help me fix it") → Skip to **Step 3: Remediation**
- **Ambiguous** → Ask: "Do you have a specific alert or table you want to investigate, or should I check your recent alerts first?"

---

### Step 1: Triage (conditional)

**Skill:** Read and follow `../automated-triage/SKILL.md`

**Goal:** Fetch recent alerts, score them by confidence and impact, identify which ones need investigation.

**When to run:** Only when the user doesn't already have a specific alert or incident to investigate. This step helps narrow down "I have alerts" into "these specific alerts need attention."

**Scope MCP calls tightly.** On large accounts, broad queries return hundreds of results, overflow the tool-result token limit, spill to disk, and force chunk reads — burning user tokens and exhausting the turn budget. Minimum scoping for tools this workflow touches:

- `get_alerts` → time filter (`created_after`, default last 7 days) + at least one of `warehouse`, `table_names`, `severity`
- `search` → needed to resolve a table name to its MCON (`get_table` requires MCON). Always pass `limit` (e.g. 5), the table name as `query`, and filter by `warehouse_uuid` or `database`/`schema`. `warehouse_types` alone is too broad. If multiple matches return: (1) auto-pick the match whose `warehouse_display_name` matches the user's named warehouse — do NOT stop to ask; (2) failing that, prefer the `is_key_asset: true` match; (3) only ask the user when none of these resolve it
- `get_monitors` → filter by `mcons` or `warehouse_uuid`

If scope is missing, ask the user before calling: "Which warehouse?", "How far back — today, this week?", "Any specific severity?".

**Transition to Step 2:** Once high-priority alert(s) are identified, tell the user:

> "I've identified [N] high-priority alerts. Let me investigate the root cause of [specific alert/table]. Moving to root cause analysis."

Then proceed to Step 2 with the identified alert context.

---

### Step 2: Root Cause Analysis

**Skill:** Read and follow `../analyze-root-cause/SKILL.md`

**Goal:** Investigate why the issue occurred — trace lineage, check ETL changes, analyze query modifications, profile data.

**This is the core step.** Most workflow entries start here.

**Investigate linearly — do not re-call tools.** Walk through the investigation once: (1) find the table, (2) fetch its alerts and freshness, (3) check lineage, (4) check recent queries/ETL. Call each tool at most once per table. If a tool result is insufficient, move to the next signal rather than re-calling with different params — burning turns on redundant calls exhausts the budget before the root cause is reached.

**Transition to Step 3:** When the root cause is identified (or the investigation reaches its limit), summarize findings and tell the user:

> "Root cause identified: [summary]. Would you like me to help remediate this, or is the investigation sufficient?"

If the user wants to proceed, move to Step 3. If they say "that's enough", stop.

---

### Step 3: Remediation

**Skill:** Read and follow `../remediation/SKILL.md`

**Goal:** Fix the issue using available tools, or escalate with full context if the fix requires actions outside the agent's capability.

**Transition to Step 4:** After remediation is complete (fix applied or escalation documented), offer prevention:

> "The issue has been [fixed/escalated]. The root cause was [X]. Want me to help add a monitor to detect this type of issue earlier next time?"

If the user says yes, move to Step 4. If no, the workflow is complete.

---

### Step 4: Prevent Recurrence (optional)

**Skill:** Read and follow `../monitoring-advisor/SKILL.md`

When loading monitoring-advisor for this step, frame the request as direct monitor creation — not coverage analysis. The user already knows what they want to monitor (the thing that just broke). Example framing:

> "Based on the incident, I recommend adding a [freshness/volume/validation] monitor on [table]. Let me create the monitor configuration."

**Goal:** Add or update a monitor to catch this class of issue in the future.

**Do not force this step.** It is optional — offer it after remediation, and respect if the user declines.

---

## Orchestration Rules

- **Users can enter at any step.** The entry point section above determines where to start.
- **Each step loads the actual skill's SKILL.md** via relative path. This workflow does not replicate skill logic — it sequences it.
- **Context carries forward** through conversation naturally. Alert IDs, table names, root cause findings from earlier steps are available to later steps without explicit state passing.
- **No state tracking or hooks.** This is purely prompt-driven sequencing.
- **User can exit anytime.** If they say "that's enough" or "stop", respect it immediately.
- **Do not skip back.** The workflow moves forward. If the user wants to re-investigate after remediation, they can start a new workflow or invoke a skill directly.

More from monte-carlo-data/mc-agent-toolkit

Skill	Description
automated-triage	Triage Monte Carlo alerts interactively or build an automated workflow. Fetch, score, and troubleshoot alerts using MCP tools now, or design a reusable workflow that runs on a schedule.
connection-auth-rules	Build a Connection Auth Rules for a Monte Carlo connection type. Fetches live connector schemas and transform steps from the apollo-agent repo.
generate-validation-notebook	Generate SQL validation notebooks for dbt changes. Pass a GitHub PR URL or local dbt repo path.
monte-carlo-analyze-root-cause	\|
monte-carlo-asset-health	Check the health of a data table/asset using Monte Carlo. Activates on "how is table X", "check health of X", "is X healthy", "status of X", "check on X table", or any health/status question about a data asset.
monte-carlo-context-detection	Route data-related requests to the right Monte Carlo skill or workflow. USE WHEN alerts, incidents, data broken, stale, coverage gaps, data quality, or any ambiguous data observability request.
monte-carlo-instrument-agent	Instrument a new AI agent in a Python codebase for Monte Carlo Agent Observability. Detects AI libraries, installs the Monte Carlo OpenTelemetry SDK, and proposes tracing setup and decorator placements as diffs. Asks before editing any file.
monte-carlo-manage-mac	Create, edit, validate, and import Monitors-as-Code YAML files. CLI-first; falls back to MC MCP tools, then manual validation.
monte-carlo-monitoring-advisor	Analyze data coverage, create monitors for warehouse tables and AI agents. Covers coverage gaps, use-case analysis, data monitor creation, and agent observability.
monte-carlo-performance-diagnosis	\|