slack-incident-workflow
$
npx mdskill add automateyournetwork/netclaw/slack-incident-workflowCoordinates network incident response workflows directly in Slack
- Manages incident declaration, tracking, escalation, and post-mortems in Slack
- Uses Slack API for channel management, status updates, and resource sharing
- Analyzes incident context from channel history, pinned items, and user availability
- Delivers structured incident updates and actionable next steps in real-time
SKILL.md
.github/skills/slack-incident-workflowView on GitHub ↗
---
name: slack-incident-workflow
description: "Manage network incident response workflows in Slack - incident channels, status updates, escalation, resolution tracking, and post-incident review coordination. Use when declaring a network incident, coordinating outage response in Slack, tracking incident status, or running a post-incident review."
license: Apache-2.0
user-invocable: true
metadata:
{ "openclaw": { "requires": { "bins": ["python3"] } } }
---
# Slack Incident Workflow
## Slack OAuth Scopes Used
| Scope | Purpose |
|-------|---------|
| `assistant:write` | Act as App Agent in incident threads |
| `chat:write` | Post incident updates |
| `channels:join` | Join incident channels |
| `channels:history` | Read channel context for investigation |
| `groups:history` | Access private incident channels |
| `groups:read` | View private channel info |
| `pins:read` | Reference pinned runbooks/procedures |
| `bookmarks:read` | Access saved incident resources |
| `bookmarks:write` | Save incident artifacts |
| `files:write` | Attach logs, configs, diagrams |
| `reactions:write` | Track incident status via reactions |
| `users:read` | Identify on-call engineers |
| `users.profile:read` | Check engineer availability |
| `dnd:read` | Respect Do Not Disturb before paging |
## Incident Lifecycle in Slack
### Phase 1: Detection & Declaration
When a critical alert triggers (from slack-network-alerts skill or human report):
```
:rotating_light: *INCIDENT DECLARED — Network Outage*
*Severity:* P1 — Service Impacting
*Detected:* 2024-02-21 14:32 UTC
*Reporter:* NetClaw (automated) / @engineer1 (manual)
*Symptoms:*
• R1 unreachable (ping 0%)
• 47 downstream routes lost
• 3 OSPF adjacencies down
• BGP peer to ISP: IDLE
*Impact:*
• Site A has no WAN connectivity
• Estimated affected users: ~200
*Incident Commander:* [awaiting claim — react with :raised_hand: to take IC]
*ServiceNow:* [CR/INC pending]
━━━ *All investigation updates in this thread* ━━━
```
### Phase 2: Triage & Assignment
When an engineer reacts with :raised_hand::
```
:busts_in_silhouette: *Incident Team Formed*
*IC:* @engineer1 (claimed at 14:35 UTC)
*NetClaw:* Automated investigation assistant
*Triage Checklist:*
:white_check_mark: Alert generated and posted
:white_check_mark: Incident declared (P1)
:white_large_square: IC assigned → :white_check_mark: @engineer1
:white_large_square: ServiceNow incident created
:white_large_square: Upstream device checked
:white_large_square: Blast radius confirmed
:white_large_square: Customer communication sent
_NetClaw beginning automated investigation..._
```
### Phase 3: Automated Investigation
NetClaw runs diagnostics and posts results in the thread:
```
:mag: *Automated Investigation — Step 1/4*
_Checking upstream device R2 for connectivity to R1..._
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL \
"python3 -u $PYATS_MCP_SCRIPT" pyats_ping_from_network_device \
'{"device_name":"R2","command":"ping 10.1.1.1 repeat 10"}'
```
Post each step result:
```
:mag: *Investigation Results — Step 1/4*
*Ping from R2 to R1 (10.1.1.1):* 0% success — R1 unreachable from upstream
:mag: *Investigation Results — Step 2/4*
*R2 interface Gi1 (toward R1):* up/up, 0 CRC errors, last input 4 min ago
→ Physical layer looks OK from R2 side
:mag: *Investigation Results — Step 3/4*
*R2 OSPF neighbors:* R1 missing from neighbor table (was FULL)
→ OSPF adjacency lost, DR election may be in progress
:mag: *Investigation Results — Step 4/4*
*R2 logs (last 30 min):*
```
14:31:47: %OSPF-5-ADJCHG: Nbr 1.1.1.1 on Gi1 from FULL to DOWN
14:31:48: %LINEPROTO-5-UPDOWN: Line protocol on Gi1, changed to down
14:32:01: %LINEPROTO-5-UPDOWN: Line protocol on Gi1, changed to up
14:32:15: %OSPF-5-ADJCHG: Nbr 1.1.1.1 on Gi1 from DOWN to INIT
```
*Analysis:* R2 saw Gi1 flap at 14:31. Line protocol came back up but OSPF hasn't re-converged. Likely physical issue on R1 side causing interface bounce.
```
### Phase 4: Status Updates
Post periodic status updates:
```
:hourglass_flowing_sand: *Status Update — 14:50 UTC (18 min elapsed)*
*Status:* Investigating
*Finding:* R1 appears to have reloaded unexpectedly. R2 sees the link recover but R1 is not responding to OSPF hellos yet. Possible crash or power event.
*Next Step:* Waiting for R1 to complete boot sequence. Checking console access.
*ETA:* Unknown — dependent on R1 recovery
_ServiceNow INC0012345 updated_
```
### Phase 5: Resolution
```
:white_check_mark: *INCIDENT RESOLVED*
*Duration:* 34 minutes (14:32 — 15:06 UTC)
*Resolution:* R1 experienced a software crash (Traceback in logs). Device auto-reloaded and recovered. All OSPF adjacencies re-established. Full routing restored.
*Post-Resolution Verification:*
• R1 reachable: :white_check_mark: 100% ping success
• OSPF neighbors: :white_check_mark: 3/3 FULL
• BGP peer: :white_check_mark: Established
• Route count: :white_check_mark: 47 routes (matches baseline)
• Connectivity: :white_check_mark: 100% to all targets
*Root Cause:* Software crash — Traceback found in logs indicating bug CSCxx12345. TAC case recommended.
*ServiceNow:* INC0012345 resolved
*GAIT:* Session abc123 closed
```
### Phase 6: Post-Incident Review
```
:clipboard: *Post-Incident Review — Scheduled*
*Incident:* Network Outage — R1 crash
*Date:* 2024-02-22 10:00 UTC
*Channel:* This thread
*Review Artifacts (attached):*
1. :page_facing_up: Timeline of events
2. :page_facing_up: R1 show logging output
3. :page_facing_up: R1 show version (confirms reload reason)
4. :page_facing_up: GAIT audit trail (full session)
5. :page_facing_up: Pre/post health check comparison
*Discussion Topics:*
• Was detection fast enough?
• Was automated investigation helpful?
• What monitoring gaps exist?
• Should R1 be upgraded to patched version?
• Do we need redundant path for this link?
```
## Escalation Matrix
```
:arrow_up: *Escalation Guide*
│ Severity │ Notify │ Escalate After │ Channel │
│ P1 │ IC + Manager + NOC │ 15 min │ #incidents │
│ P2 │ IC + Team │ 30 min │ #netclaw-alerts │
│ P3 │ Assigned engineer │ 4 hours │ #netclaw-alerts │
│ P4 │ Queue only │ Next business │ #netclaw-general │
```
Before escalating, check DND status:
- If engineer has DND active, escalate to next person in rotation
- Never suppress P1 escalation for DND
## Reaction-Based Status Tracking
| Reaction | Status | Meaning |
|----------|--------|---------|
| :rotating_light: | Declared | Incident is active |
| :raised_hand: | Claimed | IC has taken ownership |
| :mag: | Investigating | Active investigation |
| :wrench: | Fixing | Fix being applied |
| :hourglass: | Waiting | Waiting on external (vendor, ISP) |
| :white_check_mark: | Resolved | Incident resolved |
| :bookmark: | PIR Scheduled | Post-incident review planned |
## ServiceNow Integration
Create ServiceNow incident at Phase 1:
```bash
python3 $MCP_CALL "python3 -u $SERVICENOW_MCP_SCRIPT" create_incident \
'{"short_description":"P1 - R1 unreachable, WAN outage Site A","description":"R1 is unreachable. 47 routes lost, 3 OSPF adjacencies down. Impact: ~200 users at Site A without WAN connectivity.","urgency":"1","impact":"1","category":"Network"}'
```
Update ServiceNow as incident progresses and close on resolution.
## GAIT Audit Trail
Record every phase in GAIT:
```bash
python3 $MCP_CALL "python3 -u $GAIT_MCP_SCRIPT" gait_record_turn \
'{"input":{"role":"assistant","content":"INCIDENT P1: R1 unreachable. Phase 1 declared, Phase 2 IC assigned @engineer1, Phase 3 automated investigation shows R1 crash, Phase 4 monitoring recovery, Phase 5 resolved after 34 min. INC0012345 closed.","artifacts":[]}}'
```