pyats-health-check
$
npx mdskill add automateyournetwork/netclaw/pyats-health-checkMonitors and analyzes network device health metrics
- Solves device health monitoring for CPU, memory, interfaces, and NTP
- Uses pyATS tools and device APIs to collect and parse data
- Analyzes thresholds and patterns to detect anomalies or issues
- Returns structured JSON with health status and actionable insights
SKILL.md
.github/skills/pyats-health-checkView on GitHub ↗
---
name: pyats-health-check
description: "Comprehensive network device health monitoring - CPU, memory, interfaces, hardware, NTP, logging, environment, and uptime analysis. Use when running a device health check, monitoring CPU or memory usage, checking interface errors, or validating NTP sync."
license: Apache-2.0
user-invocable: true
metadata:
{ "openclaw": { "requires": { "bins": ["python3"], "env": ["PYATS_TESTBED_PATH"] } } }
netshell:
mcp_tools:
- mcp: pyats-mcp
tools:
- pyats_run_show_command
- pyats_parse_show_command
- pyats_get_device_info
- pyats_list_devices
---
# Device Health Check
## When to Use
- Proactive daily/weekly health monitoring
- Pre-change and post-change validation
- Incident response — first thing you run when alerted
- Capacity planning and trending
- Compliance checks for operational readiness
## Health Check Procedure
Always run health checks in this exact order. Each section builds on the previous one.
### Step 1: Device Identity & Uptime
Run `show version` to establish baseline identity.
```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show version"}'
```
**Extract and report:**
- Hostname, model, serial number
- IOS-XE version and image filename
- Uptime (flag if < 24 hours — indicates recent reload)
- Last reload reason (flag if unexpected: crash, power failure)
- Total/available memory
- License status
**Thresholds:**
- Uptime < 24h → WARNING: Recent reload
- Uptime < 1h → CRITICAL: Very recent reload, check for crash
- Last reload reason contains "crash" or "error" → CRITICAL
### Step 2: CPU Utilization
```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show processes cpu sorted"}'
```
**Thresholds (5-second / 1-minute / 5-minute averages):**
- < 50% → HEALTHY
- 50-75% → WARNING: Elevated CPU
- 75-90% → HIGH: Investigate top processes
- > 90% → CRITICAL: Immediate investigation required
**Top processes to watch:**
- `IP Input` — high traffic volume or routing loops
- `BGP Router` / `BGP I/O` — large BGP table or instability
- `OSPF-1 Hello` — OSPF adjacency issues
- `Crypto IKMP` / `Crypto Engine` — IPsec overhead
- `SNMP ENGINE` — polling storm
- `ARP Input` — ARP storm or L2 loop
### Step 3: Memory Utilization
```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show processes memory sorted"}'
```
Also run:
```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show platform resources"}'
```
**Thresholds:**
- Used < 70% → HEALTHY
- 70-85% → WARNING: Memory pressure
- 85-95% → HIGH: May impact routing table updates
- > 95% → CRITICAL: Risk of process crashes or OOM
**Memory consumers to watch:**
- `BGP Router` — large BGP table (full internet table = ~1M routes)
- `CEF process` — large FIB
- `OSPF Router` — large OSPF LSDB
- `HTTP CORE` — web server / RESTCONF overhead
- `IOSD iomem` — I/O memory for packet buffers
### Step 4: Interface Status
```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ip interface brief"}'
```
Then for each active interface, get detailed counters:
```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show interfaces"}'
```
**Report for each interface:**
- Admin status (up/down) and protocol status (up/down)
- IP address and subnet
- Speed, duplex, MTU
- Input/output rate (bps and pps)
- Error counters: CRC, input errors, output errors, drops, overruns
- Resets counter (flag if incrementing — indicates flapping)
- Last input/output timestamps
**Flags:**
- Interface up/down → WARNING: Check physical or protocol
- CRC errors > 0 → WARNING: Physical layer issue (cabling, optics, duplex mismatch)
- Input errors incrementing → WARNING: Packet corruption
- Output drops > 0 → WARNING: Congestion or QoS issue
- Resets incrementing → CRITICAL: Interface flapping
- Line protocol down on configured interface → CRITICAL
### Step 5: Hardware & Environment
```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show inventory"}'
```
```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show platform"}'
```
**Report:** Module status (ok/fail), serial numbers, PID, transceiver types and DOM readings.
### Step 6: NTP Synchronization
```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show ntp associations"}'
```
```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_run_show_command '{"device_name":"R1","command":"show clock"}'
```
**Flags:**
- No NTP peer synchronized (no `*` in associations) → CRITICAL for logging/forensics
- Clock offset > 100ms → WARNING
- Clock offset > 1s → CRITICAL
- No NTP configured at all → CRITICAL
### Step 7: System Logs
```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_show_logging '{"device_name":"R1"}'
```
**Scan for these patterns:**
- `%SYS-*-RELOAD` — reload events
- `%LINEPROTO-5-UPDOWN` — interface flaps
- `%OSPF-*-ADJCHG` — OSPF adjacency changes
- `%BGP-*-ADJCHANGE` — BGP peer state changes
- `%DUAL-*-NBRCHANGE` — EIGRP neighbor changes
- `%SYS-2-MALLOCFAIL` — memory allocation failure (CRITICAL)
- `%SYS-3-CPUHOG` — process monopolizing CPU (HIGH)
- `%TRACKING-*` — IP SLA or object tracking changes
- `%SEC-*` / `%AUTHMGR-*` — security events
- `%PLATFORM-*-CRASH` — crash events (CRITICAL)
- `Traceback` — software bug (CRITICAL — open TAC case)
### Step 8: Connectivity Validation
Test reachability to critical infrastructure:
```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_ping_from_network_device '{"device_name":"R1","command":"ping 8.8.8.8 repeat 5"}'
```
**Thresholds:**
- 100% success, RTT < 50ms → HEALTHY
- 100% success, RTT > 100ms → WARNING: High latency
- 80-99% success → WARNING: Packet loss
- < 80% success → CRITICAL: Significant packet loss
- 0% success → CRITICAL: No reachability
## Health Report Format
Always produce a summary table:
```
Device: R1 (devnetsandboxiosxec8k.cisco.com)
Model: C8000V | IOS-XE: 17.x.x | Uptime: XXd XXh
┌──────────────────┬──────────┬─────────────────────────┐
│ Check │ Status │ Details │
├──────────────────┼──────────┼─────────────────────────┤
│ CPU (5min avg) │ HEALTHY │ 12% │
│ Memory │ HEALTHY │ 45% used (1.2G/2.6G) │
│ Interfaces │ WARNING │ Gi2 down/down │
│ Hardware │ HEALTHY │ All modules OK │
│ NTP │ HEALTHY │ Synced, offset 2ms │
│ Logs │ WARNING │ 3 OSPF adjacency flaps │
│ Connectivity │ HEALTHY │ 100% to 8.8.8.8, 23ms │
└──────────────────┴──────────┴─────────────────────────┘
Overall: WARNING — 2 items need attention
```
Severity order: CRITICAL > HIGH > WARNING > HEALTHY. Overall status = worst individual status.
## NetBox Cross-Reference (MISSION02 Enhancement)
When NetBox is available ($NETBOX_MCP_SCRIPT is set), cross-reference device state against the source of truth after Steps 1 and 4:
### Interface State Validation
Query NetBox for expected interface states:
```bash
python3 $MCP_CALL "python3 -u $NETBOX_MCP_SCRIPT" netbox_get_objects '{"object_type":"dcim.interfaces","filters":{"device":"R1"},"brief":true}'
```
**Compare NetBox intent vs device reality:**
- NetBox shows interface enabled but device shows down → CRITICAL: Unexpected outage
- NetBox shows interface disabled but device shows up → WARNING: Undocumented activation
- Interface exists on device but not in NetBox → WARNING: Undocumented interface
- Interface in NetBox but not on device → WARNING: NetBox stale data
### IP Address Validation
Query NetBox for expected IP assignments:
```bash
python3 $MCP_CALL "python3 -u $NETBOX_MCP_SCRIPT" netbox_get_objects '{"object_type":"ipam.ip-addresses","filters":{"device":"R1"}}'
```
**Compare:** Flag any IP_DRIFT where the device IP differs from NetBox.
## Fleet-Wide Health (pCall)
To run health checks across ALL devices simultaneously, first list all devices:
```bash
PYATS_TESTBED_PATH=$PYATS_TESTBED_PATH python3 $MCP_CALL "python3 -u $PYATS_MCP_SCRIPT" pyats_list_devices
```
Then run Steps 1-8 on each device concurrently using multiple exec commands. Collect all results and produce a fleet summary:
```
┌──────────┬──────────┬──────┬────────┬──────────┬─────────────┐
│ Device │ CPU │ Mem │ Intf │ NTP │ Overall │
├──────────┼──────────┼──────┼────────┼──────────┼─────────────┤
│ R1 │ HEALTHY │ WARN │ HEALTHY│ HEALTHY │ WARNING │
│ R2 │ HEALTHY │ OK │ CRIT │ HEALTHY │ CRITICAL │
│ SW1 │ HIGH │ OK │ HEALTHY│ CRIT │ CRITICAL │
└──────────┴──────────┴──────┴────────┴──────────┴─────────────┘
```
Sort devices by severity (CRITICAL first) for triage prioritization.
## GAIT Audit Trail
After completing a health check, record the session in GAIT:
```bash
python3 $MCP_CALL "python3 -u $GAIT_MCP_SCRIPT" gait_record_turn '{"input":{"role":"assistant","content":"Health check completed on R1: CPU HEALTHY (12%), Memory WARNING (78%), Interfaces HEALTHY, NTP HEALTHY. Overall: WARNING.","artifacts":[]}}'
```