cluster-documents

Name: cluster-documents
Author: dandye/ai-runbooks

$npx mdskill add dandye/ai-runbooks/cluster-documents

Group documents by topic using similarity analysis.

Organizes large document collections to find redundancies.
Depends on text normalization and vector embedding generation.
Uses clustering algorithms to group documents by similarity.
Delivers a structured report with optional visualizations.

SKILL.md

.github/skills/cluster-documentsView on GitHub ↗

---
name: cluster-documents
description: Automated content similarity and grouping analysis. Groups related documents by topic, purpose, or content similarity.
required_roles:
  scribe: roles/scribe.viewer
personas: [information-architect, data-analyst, researcher]
---

# Document Clustering Skill

Analyze a repository of documents to group them based on content similarity, topic, or purpose. This skill helps organize large collections, identify redundancies, and discover relationships.

## Inputs

- `PATH` - The repository to analyze (e.g., "/repository")
- `SIMILARITY_THRESHOLD` - (Optional) Float (0.0-1.0), threshold for grouping (default: 0.8)
- `VISUALIZATION` - (Optional) Boolean, whether to generate a visual representation (default: false)

## Workflow

### Step 1: Text Processing

Ingest documents from `PATH`.
- Normalize text (remove stop words, stemming/lemmatization).
- Generate embeddings or TF-IDF vectors for each document.

### Step 2: Clustering Analysis

Apply clustering algorithms (e.g., K-Means, DBSCAN) to the document vectors.
- Group documents that meet the `SIMILARITY_THRESHOLD`.
- Identify outliers or unique documents.

### Step 3: Cluster Labeling

Analyze the centroid or representative terms of each cluster to assign a meaningful label (Topic).

### Step 4: Output Generation

Generate the clustering report.
- If `VISUALIZATION` is true, create a scatter plot or dendrogram data.

## Required Outputs

A `CLUSTERING_REPORT` object containing:
- **Cluster List**: ID, Label, and List of Documents in each cluster.
- **Redundancy Report**: Sets of highly similar documents (potential duplicates).
- **Visualization Data**: (If requested) Coordinates for plotting.

## Quick Reference

- **Purpose**: Organize unstructured content and find duplicates.
- **Techniques**: Text Mining, NLP, Vector Space Models.

More from dandye/ai-runbooks

Skill	Description
analyze-content-gaps	Identify content gaps and organizational opportunities. Analyzes missing content areas, redundancies, and consolidation opportunities.
audit-content	Comprehensive content quality and maintenance assessment. Evaluates documentation quality, relevance, maintenance needs, and provides actionable recommendations.
check-duplicates	"Check for duplicate or similar cases. Use before deep analysis to avoid investigating the same incident twice. Takes a CASE_ID and returns list of similar cases."
close-case-artifact	"Close a case or alert with proper reason and documentation. Use when triage determines an alert is FP/BTP or investigation is complete. Requires artifact ID, type, closure reason, and root cause."
confirm-action	"Ask the user to confirm before taking a significant action. Use before containment, remediation, or other impactful operations to ensure analyst approval. Presents options and waits for response."
correlate-ioc	"Check for existing SIEM alerts and case management entries related to IOCs. Use to understand if an indicator has triggered previous alerts or is part of ongoing investigations. Takes IOC list and returns related alerts and cases."
deep-dive-ioc	"Perform exhaustive analysis of a critical IOC. Use when an IOC needs Tier 2+ investigation beyond basic enrichment - includes GTI pivoting, deep SIEM searches, correlation with related entities, and threat attribution. For escalated IOCs requiring comprehensive investigation."
design-metadata-schema	Design comprehensive metadata frameworks. Develops structured metadata templates and tagging systems.
document-in-case	"Add a comment to a case to document findings, actions, or recommendations. Use to maintain audit trail during investigations. Requires CASE_ID and comment text."
enrich-ioc	"Enrich an IOC (IP, domain, hash, URL) with threat intelligence. Use when you need to look up reputation and context for an indicator using GTI and SIEM. Returns threat intel findings, SIEM entity summary, and IOC match status."