doc-reader

$npx mdskill add HKUDS/Vibe-Trading/doc-reader

Extract text from PDFs using OCR for scanned documents.

  • Processes papers, annual reports, and research documents instantly.
  • Integrates with the read_document tool for direct file access.
  • Selects extraction method based on whether pages are text or images.
  • Returns structured JSON with page counts, OCR flags, and full text.

SKILL.md

.github/skills/doc-readerView on GitHub ↗
---
name: doc-reader
description: Read PDF documents (papers, annual reports, research reports), automatically extracting text pages and applying OCR to image/scanned pages. Use the `read_document` tool.
category: tool
---
# PDF Document Reading

## Purpose

Read the full text of PDF documents and automatically handle two page types:
- **Text pages** (most papers and digital reports) → extracted directly in milliseconds
- **Image / scanned pages** (annual report charts, scanned research reports) → OCR recognition with Chinese and English support

Applicable to PDF documents such as papers, annual reports, research reports, announcements, and contracts.

## Usage

**Call the `read_document` tool directly (do not use bash to write a Python script):**

```
read_document(file_path="uploads/paper.pdf")
read_document(file_path="uploads/annual_report.pdf", pages="1-10")
read_document(file_path="uploads/research.pdf", pages="1,3,15-20")
```

**Forbidden**: do not run a Python script from bash to read PDFs. Call the tool directly.

## Return Format

```json
{
  "status": "ok",
  "file": "paper.pdf",
  "total_pages": 45,
  "pages_read": 45,
  "ocr_pages": 3,
  "char_count": 52000,
  "truncated": true,
  "text": "--- Page 1 ---\n...\n--- Page 5 [OCR] ---\n..."
}
```

- `ocr_pages`: number of pages recognized via OCR (image / scanned pages)
- `truncated`: content is truncated when it exceeds 15000 characters
- `[OCR]` indicates that the page content was obtained via image recognition

## Typical Workflows

### Paper Summary
```
1. read_document(file_path="paper.pdf")  → get the full text
2. Analyze the text and extract the abstract, methodology, and conclusion
3. Output the summary
```

### Annual Report Analysis
```
1. read_document(file_path="annual_report.pdf", pages="1-5")  → read the summary first
2. Determine the key sections from the summary
3. read_document(file_path="...", pages="15-25")  → read the financial-data section
4. Extract key metrics
```

### Research Report Review
```
1. read_document(file_path="research.pdf")  → full text
2. Extract the core thesis, target price, and risk factors
```

## Notes

- Content longer than 15000 characters will be truncated. For long documents, read them in chunks with the `pages` parameter
- OCR pages are slower (about 1-3 seconds per page), while pure text pages are processed in milliseconds
- OCR for charts and tables inside images may be imperfect, so complex tables should be checked manually
- Only PDF format is supported

More from HKUDS/Vibe-Trading

SkillDescription
adr-hshareADR/H-share/A-share cross-listing premium analysis — track pricing gaps between US-listed ADRs, HK-listed H-shares, and A-shares for arbitrage signals, dual-listing valuation, and delisting risk assessment.
akshareAKShare financial data aggregator (18k+ stars). Free, no API key. Covers A-shares, US, HK, futures, macro, forex. Primary fallback for tushare and yfinance.
asset-allocationAsset allocation theory and optimizer usage — MPT / Black-Litterman / risk budgeting / all-weather strategy, including guides for 4 optimizers and rebalancing rules.
backtest-diagnoseDiagnose failed or underperforming backtests, locate the root cause, and fix the issue
behavioral-financeBehavioral finance applications: theories of overreaction and underreaction, behavioral explanations for momentum and reversal, investor sentiment cycles, cognitive-bias checklists, and debiasing quantitative strategies.
candlestickCandlestick pattern recognition engine, pure pandas vectorized implementation of 15 classic candlestick patterns (5 single-candle + 5 double-candle + 4 triple-candle + 1 trend confirmation), generating a composite signal from bullish/bearish pattern scores.
ccxtCCXT unified crypto exchange library (100+ exchanges). Free public market data. Fallback when OKX is unavailable.
chanlun基于缠论(缠中说禅)的形态识别引擎,使用czsc库自动检测K线分型、笔、中枢,并生成一买/一卖/二买/二卖/三买/三卖等买卖点信号。支持多周期分析和形态分类(3/5/7/9/11笔形态)。
commodity-analysisCommodity analysis (oil supply-demand balance / gold pricing / copper as an economic predictor / inventory cycles / futures premium-discount structure / seasonality), generating directional commodity signals.
convertible-bondA股可转债分析——转股/纯债/期权三维估值、下修/强赎/回售博弈、双低策略与转债轮动选债框架