databricks-unstructured-pdf-generation

Name: databricks-unstructured-pdf-generation
Author: databricks/databricks-agent-skills

$npx mdskill add databricks/databricks-agent-skills/databricks-unstructured-pdf-generation

Convert HTML to PDF and store reports in Unity Catalog.

Creates test documents, demos, and evaluation datasets.
Integrates with plutoprint, Unity Catalog, and Python scripts.
Executes parallel conversion and uploads based on file timestamps.
Delivers PDFs to storage and generates JSON test questions.

SKILL.md

.github/skills/databricks-unstructured-pdf-generationView on GitHub ↗

---
name: databricks-unstructured-pdf-generation
description: "Generate PDF documents from HTML and upload to Unity Catalog volumes. Use for creating test PDFs, demo documents, reports, or evaluation datasets."
---

# PDF Generation from HTML

Convert HTML content to PDF documents and upload them to Unity Catalog Volumes.

## Workflow

1. Write HTML files to `./raw_data/html/` (write multiple files in parallel for speed)
2. Convert HTML → PDF using `<SKILL_ROOT>/scripts/pdf_generator.py` (parallel conversion)
3. Upload PDFs to Unity Catalog volume using `databricks fs cp`
4. Generate `doc_questions.json` with test questions for each document

> **Path convention:** `<SKILL_ROOT>` below = the directory containing this SKILL.md. Resolve to the absolute install path (e.g. `~/.claude/skills/databricks-unstructured-pdf-generation`). `./raw_data/...` paths are relative to your own project cwd.

## Dependencies

```bash
uv pip install plutoprint
```

## Step 1: Write HTML Files

```bash
mkdir -p ./raw_data/html
```

Write HTML documents to `./raw_data/html/filename.html`. Use subdirectories to organize (structure is preserved).

## Step 2: Convert to PDF

```bash
# Convert entire folder (parallel, 4 workers)
python <SKILL_ROOT>/scripts/pdf_generator.py convert --input ./raw_data/html --output ./raw_data/pdf
```

Skips files where PDF exists and is newer than HTML. Use `--force` to reconvert all.

## Step 3: Upload to Volume

`databricks fs` requires the `dbfs:` scheme prefix even for UC Volume paths. `-r` copies the *contents* of the source directory into the target (the source directory name is not preserved), so files land directly under `raw_data/`.

```bash
databricks fs cp -r --overwrite ./raw_data/pdf dbfs:/Volumes/my_catalog/my_schema/raw_data
```

## Step 4: Generate Test Questions

Create `./raw_data/pdf/pdf_eval_questions.json` with questions for Knowledge Assistant evaluation or MAS:

```json
{
  "api_errors_guide.pdf": {
    "question": "What is the solution for error ERR-4521?",
    "expected_fact": "Call /api/v2/auth/refresh with refresh_token before the 3600s TTL expires"
  },
  "installation_manual.pdf": {
    "question": "What port does the service use by default?",
    "expected_fact": "Port 8443 for HTTPS, configurable via CONFIG_PORT environment variable"
  }
}
```

This JSON can be used to build KA test cases and validate retrieval accuracy.

## Document Content Guidelines

When generating documents for Knowledge Assistant testing or demos:

- **Multi-page documents**: Each PDF should be several pages with substantial content
- **Specific error codes and solutions**: Include product-specific error codes, causes, and resolution steps
- **Technical details**: API endpoints, configuration parameters, version numbers, specific commands
- **Simple CSS**: Keep styling minimal for fast HTML creation and reliable PDF conversion
- **Queryable facts**: Include details a KA must read the document to answer (not general knowledge)

**Good document types:**
- Product user manuals with troubleshooting sections
- API error reference guides (error codes, causes, solutions)
- Installation/configuration guides with specific steps
- Technical specifications with version-specific details

**Example content:** Instead of generic "Connection failed" errors, write:
- "Error ERR-4521: OAuth token expired. Cause: Token TTL exceeded 3600s default. Solution: Call `/api/v2/auth/refresh` with your refresh_token before expiration. See Section 4.2 for token lifecycle management."

## CLI Reference

```
python <SKILL_ROOT>/scripts/pdf_generator.py convert [OPTIONS]

  --input, -i     Input HTML file or folder (required)
  --output, -o    Output folder for PDFs (required)
  --force, -f     Force reconvert (ignore timestamps)
  --workers, -w   Parallel workers (default: 4)
```

## Folder Structure

Subfolder structure is preserved:

```
./raw_data/html/                    ./raw_data/pdf/
├── report.html             →       ├── report.pdf
├── quarterly/                      ├── quarterly/
│   └── q1.html             →       │   └── q1.pdf
└── legal/                          └── legal/
    └── terms.html          →           └── terms.pdf
```

## Troubleshooting

| Issue | Solution |
|-------|----------|
| "plutoprint not installed" | `uv pip install plutoprint` |
| PDF looks wrong | Check HTML/CSS syntax |
| "Volume does not exist" | `databricks volumes create CATALOG SCHEMA VOLUME_NAME MANAGED` (four separate positional args, not `catalog.schema.volume`) |

More from databricks/databricks-agent-skills

Skill	Description
databricks-agent-bricks	Create Agent Bricks: Knowledge Assistants (KA) for document Q&A and Supervisor Agents for multi-agent orchestration (MAS).
databricks-ai-functions	Use Databricks built-in AI Functions (ai_classify, ai_extract, ai_summarize, ai_mask, ai_translate, ai_fix_grammar, ai_gen, ai_analyze_sentiment, ai_similarity, ai_parse_document, ai_query, ai_forecast) to add AI capabilities directly to SQL and PySpark pipelines without managing model endpoints. Also covers document parsing and building custom RAG pipelines (parse → chunk → index → query).
databricks-aibi-dashboards	Create Databricks AI/BI dashboards. Must use when creating, updating, or deploying Lakeview dashboards as Databricks Dashboard have a unique json structure. CRITICAL: You MUST test ALL SQL queries via CLI BEFORE deploying. Follow guidelines strictly.
databricks-apps	Build apps on Databricks Apps platform. Use when asked to create dashboards, data apps, analytics tools, or visualizations. Evaluates data access patterns (analytics vs Lakebase synced tables) before scaffolding. Invoke BEFORE starting implementation.
databricks-apps-python	Builds Databricks applications. Prefers AppKit (TypeScript + React SDK) for new apps; falls back to Python frameworks (Dash, Streamlit, Gradio, Flask, FastAPI, Reflex) when Python is required. Handles OAuth authorization, app resources, SQL warehouse and Lakebase connectivity, model serving, foundation model APIs, and deployment. Use when building web apps, dashboards, ML demos, or REST APIs for Databricks, or when the user mentions AppKit, Streamlit, Dash, Gradio, Flask, FastAPI, Reflex, or Databricks app.
databricks-core	Databricks CLI operations: auth, profiles, data exploration, and bundles. Contains up-to-date guidelines for Databricks-related CLI tasks.
databricks-dabs	Create, configure, validate, deploy, run, and manage DABs — Declarative Automation Bundles (formerly Databricks Asset Bundles) — for Databricks resources including dashboards, jobs, pipelines, alerts, volumes, and apps
databricks-dbsql	>-
databricks-docs	Databricks documentation reference via llms.txt index. Use when other skills do not cover a topic, looking up unfamiliar Databricks features, or needing authoritative docs on APIs, configurations, or platform capabilities.
databricks-execution-compute	>-