scrapeninja

Name: scrapeninja
Author: vm0-ai/vm0-skills

$npx mdskill add vm0-ai/vm0-skills/scrapeninja

Execute high-performance web scraping via ScrapeNinja API

Extracts text and data from any URL instantly
Integrates with ScrapeNinja API for backend execution
Selects rendering mode based on site complexity
Returns structured JSON with status and body length

SKILL.md

.github/skills/scrapeninjaView on GitHub ↗

---
name: scrapeninja
description: ScrapeNinja API for web scraping. Use when user mentions "ScrapeNinja",
  "scrape", "web scraping", or data extraction.
---

## Troubleshooting

If requests fail, run `zero doctor check-connector --env-name SCRAPENINJA_TOKEN` or `zero doctor check-connector --url https://scrapeninja.p.rapidapi.com/scrape --method POST`

## How to Use

### 1. Basic Scrape (Non-JS, Fast)

High-performance scraping with Chrome TLS fingerprint, no JavaScript:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com"
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq '{status: .info.statusCode, url: .info.finalUrl, bodyLength: (.body | length)}'
```

**With custom headers and retries:**

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "headers": ["Accept-Language: en-US"],
  "retryNum": 3,
  "timeout": 15
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json
```

### 2. Scrape with JavaScript Rendering

For JavaScript-heavy sites (React, Vue, etc.):

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "waitForSelector": "h1",
  "timeout": 20
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq '{status: .info.statusCode, bodyLength: (.body | length)}'
```

**With screenshot:**

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "screenshot": true
}
```

Then run:

```bash
# Get screenshot URL from response
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq -r '.info.screenshot'
```

### 3. Geo-Based Proxy Selection

Use proxies from specific regions:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "geo": "eu"
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq .info
```

Available geos: `us`, `eu`, `br` (Brazil), `fr` (France), `de` (Germany), `4g-eu`

### 4. Smart Retries

Retry on specific HTTP status codes or text patterns:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "retryNum": 3,
  "statusNotExpected": [403, 429, 503],
  "textNotExpected": ["captcha", "Access Denied"]
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json
```

### 5. Extract Data with Cheerio

Extract structured JSON using Cheerio extractor functions:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://news.ycombinator.com",
  "extractor": "function(input, cheerio) { let $ = cheerio.load(input); return $(\".titleline > a\").slice(0,5).map((i,el) => ({title: $(el).text(), url: $(el).attr(\"href\")})).get(); }"
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq '.extractor'
```

### 6. Intercept AJAX Requests

Capture XHR/fetch responses:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "catchAjaxHeadersUrlMask": "api/data"
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json | jq '.info.catchedAjax'
```

### 7. Block Resources for Speed

Speed up JS rendering by blocking images and media:

Write to `/tmp/scrapeninja_request.json`:

```json
{
  "url": "https://example.com",
  "blockImages": true,
  "blockMedia": true
}
```

Then run:

```bash
curl -s -X POST "https://scrapeninja.p.rapidapi.com/scrape-js" --header "Content-Type: application/json" --header "X-RapidAPI-Key: $SCRAPENINJA_TOKEN" -d @/tmp/scrapeninja_request.json
```

## API Endpoints

| Endpoint | Description |
|----------|-------------|
| `/scrape` | Fast non-JS scraping with Chrome TLS fingerprint |
| `/scrape-js` | Full Chrome browser with JS rendering |
| `/v2/scrape-js` | Enhanced JS rendering for protected sites (APIRoad only) |

## Request Parameters

### Common Parameters (all endpoints)

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `url` | string | required | URL to scrape |
| `headers` | string[] | - | Custom HTTP headers |
| `retryNum` | int | 1 | Number of retry attempts |
| `geo` | string | `us` | Proxy geo: us, eu, br, fr, de, 4g-eu |
| `proxy` | string | - | Custom proxy URL (overrides geo) |
| `timeout` | int | 10/16 | Timeout per attempt in seconds |
| `textNotExpected` | string[] | - | Text patterns that trigger retry |
| `statusNotExpected` | int[] | [403, 502] | HTTP status codes that trigger retry |
| `extractor` | string | - | Cheerio extractor function |

### JS Rendering Parameters (`/scrape-js`, `/v2/scrape-js`)

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `waitForSelector` | string | - | CSS selector to wait for |
| `postWaitTime` | int | - | Extra wait time after load (1-12s) |
| `screenshot` | bool | true | Take page screenshot |
| `blockImages` | bool | false | Block image loading |
| `blockMedia` | bool | false | Block CSS/fonts loading |
| `catchAjaxHeadersUrlMask` | string | - | URL pattern to intercept AJAX |
| `viewport` | object | 1920x1080 | Custom viewport size |

## Response Format

```json
{
  "info": {
  "statusCode": 200,
  "finalUrl": "https://example.com",
  "headers": ["content-type: text/html"],
  "screenshot": "base64-encoded-png",
  "catchedAjax": {
  "url": "https://example.com/api/data",
  "method": "GET",
  "body": "...",
  "status": 200
  }
  },
  "body": "<html>...</html>",
  "extractor": { "extracted": "data" }
}
```

## Guidelines

1. **Start with `/scrape`**: Use the fast non-JS endpoint first, only switch to `/scrape-js` if needed
2. **Retries**: Set `retryNum` to 2-3 for unreliable sites
3. **Geo Selection**: Use `eu` for European sites, `us` for American sites
4. **Extractors**: Test extractors at https://scrapeninja.net/cheerio-sandbox/
5. **Blocked Sites**: For Cloudflare/Datadome protected sites, use `/v2/scrape-js` via APIRoad
6. **Screenshots**: Set `screenshot: false` to speed up JS rendering
7. **Rate Limits**: Check your plan limits on RapidAPI/APIRoad dashboard

## Tools

- **Playground**: https://scrapeninja.net/scraper-sandbox
- **Cheerio Sandbox**: https://scrapeninja.net/cheerio-sandbox
- **cURL Converter**: https://scrapeninja.net/curl-to-scraper

More from vm0-ai/vm0-skills

Skill	Description
account-reconciliation	Perform account reconciliations comparing general ledger balances against subledgers, bank statements, or external records. Use for bank reconciliation, GL-to-subledger reconciliation, intercompany reconciliation, balance sheet reconciliation, reconciling item analysis, outstanding item aging, or clearing open items.
agentphone	Build AI phone agents with AgentPhone API. Use when the user wants to make phone calls, send/receive SMS, manage phone numbers, create voice agents, set up webhooks, or check usage — anything related to telephony, phone numbers, or voice AI.
ahrefs	Ahrefs SEO API for backlink and keyword analysis. Use when user mentions
amplitude	Amplitude product analytics API. Use when user mentions "Amplitude",
analysis-qa	Quality-check a data analysis before sharing — verify joins, aggregations, denominators, time ranges, and metric definitions. Detect pitfalls like survivorship bias, average-of-averages, join explosion, timezone mismatches, incomplete periods, and selection bias. Includes documentation templates for reproducible analyses.
anthropic-managed-agents	Anthropic Managed Agents API for programmatically creating, running, and streaming AI agents on Anthropic's cloud infrastructure. Use when the user mentions "Managed Agents", "Anthropic agent sessions", or needs to create/run/stream an Anthropic agent with tool use (bash, git, web), attach GitHub repositories, or inject secrets via Vault. Do NOT use for standard Claude Messages API — use the Claude API skill instead.
apify	Apify web scraping platform. Use when user mentions "scrape website",
asana	Asana API for tasks and projects. Use when user mentions "Asana", "asana.com",
atlassian	Atlassian API for Confluence and Jira. Use when user mentions "Confluence
attio	Attio REST API for AI-native CRM operations — manage companies, people, deals, and custom objects, plus notes, tasks, lists, and comments. Use when the user mentions "Attio", "CRM record", "create company", "add person", "list entry", "CRM note", or "CRM task".