bright-data

Name: bright-data
Author: vm0-ai/vm0-skills

$npx mdskill add vm0-ai/vm0-skills/bright-data

Execute large-scale web scraping and proxy services for data collection.

Enables gathering profiles, posts, and comments from Twitter, Reddit, YouTube, Instagram, TikTok, and LinkedIn.
Depends on Bright Data API for triggering asynchronous or synchronous data collection jobs.
Selects scraping targets based on user-specified URLs or platform requirements.
Delivers results via JSON snapshots or immediate responses depending on request size.

SKILL.md

.github/skills/bright-dataView on GitHub ↗

---
name: bright-data
description: Bright Data proxy and web scraping API. Use when user mentions "Bright
  Data", "proxy", "web scraping at scale", or data collection.
---

## Troubleshooting

If requests fail, run `zero doctor check-connector --env-name BRIGHTDATA_TOKEN` or `zero doctor check-connector --url https://api.brightdata.com/datasets/v3/trigger --method POST`

## Social Media Scraping

Bright Data supports scraping these social media platforms:

| Platform | Profiles | Posts | Comments | Reels/Videos |
|----------|----------|-------|----------|--------------|
| Twitter/X | ✅ | ✅ | - | - |
| Reddit | - | ✅ | ✅ | - |
| YouTube | ✅ | ✅ | ✅ | - |
| Instagram | ✅ | ✅ | ✅ | ✅ |
| TikTok | ✅ | ✅ | ✅ | - |
| LinkedIn | ✅ | ✅ | - | - |

## How to Use

### 1. Trigger Scraping (Asynchronous)

Trigger a data collection job and get a `snapshot_id` for later retrieval.

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://twitter.com/username"},
  {"url": "https://twitter.com/username2"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Response:**
```json
{
  "snapshot_id": "s_m4x7enmven8djfqak"
}
```

### 2. Trigger Scraping (Synchronous)

Get results immediately in the response (for small requests).

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.reddit.com/r/technology/comments/xxxxx"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

### 3. Monitor Progress

Check the status of a scraping job (replace `<snapshot-id>` with your actual snapshot ID):

```bash
curl -s "https://api.brightdata.com/datasets/v3/progress/<snapshot-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```

**Response:**
```json
{
  "snapshot_id": "s_m4x7enmven8djfqak",
  "dataset_id": "gd_xxxxx",
  "status": "running"
}
```

Status values: `running`, `ready`, `failed`

### 4. Download Results

Once status is `ready`, download the collected data (replace `<snapshot-id>` with your actual snapshot ID):

```bash
curl -s "https://api.brightdata.com/datasets/v3/snapshot/<snapshot-id>?format=json" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```

### 5. List Snapshots

Get all your snapshots:

```bash
curl -s "https://api.brightdata.com/datasets/v3/snapshots" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" | jq '.[] | {snapshot_id, dataset_id, status}'
```

### 6. Cancel Snapshot

Cancel a running job (replace `<snapshot-id>` with your actual snapshot ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/cancel?snapshot_id=<snapshot-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```

## Platform-Specific Examples

### Twitter/X - Scrape Profile

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://twitter.com/elonmusk"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Returns:** `x_id`, `profile_name`, `biography`, `is_verified`, `followers`, `following`, `profile_image_link`

### Twitter/X - Scrape Posts

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://twitter.com/username/status/123456789"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Returns:** `post_id`, `text`, `replies`, `likes`, `retweets`, `views`, `hashtags`, `media`

### Reddit - Scrape Subreddit Posts

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.reddit.com/r/technology", "sort_by": "hot"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Parameters:** `url`, `sort_by` (new/top/hot)

**Returns:** `post_id`, `title`, `description`, `num_comments`, `upvotes`, `date_posted`, `community`

### Reddit - Scrape Comments

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.reddit.com/r/technology/comments/xxxxx/post_title"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Returns:** `comment_id`, `user_posted`, `comment_text`, `upvotes`, `replies`

### YouTube - Scrape Video Info

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Returns:** `title`, `views`, `likes`, `num_comments`, `video_length`, `transcript`, `channel_name`

### YouTube - Search by Keyword

Write to `/tmp/brightdata_request.json`:

```json
[
  {"keyword": "artificial intelligence", "num_of_posts": 50}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

### YouTube - Scrape Comments

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.youtube.com/watch?v=xxxxx", "load_replies": 3}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Returns:** `comment_text`, `likes`, `replies`, `username`, `date`

### Instagram - Scrape Profile

Write to `/tmp/brightdata_request.json`:

```json
[
  {"url": "https://www.instagram.com/username"}
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

**Returns:** `followers`, `post_count`, `profile_name`, `is_verified`, `biography`

### Instagram - Scrape Posts

Write to `/tmp/brightdata_request.json`:

```json
[
  {
    "url": "https://www.instagram.com/username",
    "num_of_posts": 20,
    "start_date": "01-01-2024",
    "end_date": "12-31-2024"
  }
]
```

Then run (replace `<dataset-id>` with your actual dataset ID):

```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
  -H "Content-Type: application/json" \
  -d @/tmp/brightdata_request.json
```

## Account Management

### Check Account Status

```bash
curl -s "https://api.brightdata.com/status" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```

**Response:**
```json
{
  "status": "active",
  "customer": "hl_xxxxxxxx",
  "can_make_requests": true,
  "ip": "x.x.x.x"
}
```

### Get Active Zones

```bash
curl -s "https://api.brightdata.com/zone/get_active_zones" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN" | jq '.[] | {name, type}'
```

### Get Bandwidth Usage

```bash
curl -s "https://api.brightdata.com/customer/bw" \
  -H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```

## Getting Dataset IDs

To use the scraping features, you need a `dataset_id`:

1. Go to [Bright Data Control Panel](https://brightdata.com/cp/datasets)
2. Create a new Web Scraper dataset or select an existing one
3. Choose the platform (Twitter, Reddit, YouTube, etc.)
4. Copy the `dataset_id` from the dataset settings

Dataset IDs can also be found in the bandwidth usage API response under the `data` field keys (e.g., `v__ds_api_gd_xxxxx` where `gd_xxxxx` is your dataset ID).

## Common Parameters

| Parameter | Description | Example |
|-----------|-------------|---------|
| `url` | Target URL to scrape | `https://twitter.com/user` |
| `keyword` | Search keyword | `"artificial intelligence"` |
| `num_of_posts` | Limit number of results | `50` |
| `start_date` | Filter by date (MM-DD-YYYY) | `"01-01-2024"` |
| `end_date` | Filter by date (MM-DD-YYYY) | `"12-31-2024"` |
| `sort_by` | Sort order (Reddit) | `new`, `top`, `hot` |
| `format` | Response format | `json`, `csv` |

## Rate Limits

- Batch mode: up to 100 concurrent requests
- Maximum input size: 1GB per batch
- Exceeding limits returns `429` error

## Guidelines

1. **Create datasets first**: Use the Control Panel to create scraper datasets
2. **Use async for large jobs**: Use `/trigger` for discovery and batch operations
3. **Use sync for small jobs**: Use `/scrape` for single URL quick lookups
4. **Check status before download**: Poll `/progress` until status is `ready`
5. **Respect rate limits**: Don't exceed 100 concurrent requests
6. **Date format**: Use MM-DD-YYYY for date parameters

More from vm0-ai/vm0-skills

Skill	Description
account-reconciliation	Perform account reconciliations comparing general ledger balances against subledgers, bank statements, or external records. Use for bank reconciliation, GL-to-subledger reconciliation, intercompany reconciliation, balance sheet reconciliation, reconciling item analysis, outstanding item aging, or clearing open items.
agentphone	Build AI phone agents with AgentPhone API. Use when the user wants to make phone calls, send/receive SMS, manage phone numbers, create voice agents, set up webhooks, or check usage — anything related to telephony, phone numbers, or voice AI.
ahrefs	Ahrefs SEO API for backlink and keyword analysis. Use when user mentions
amplitude	Amplitude product analytics API. Use when user mentions "Amplitude",
analysis-qa	Quality-check a data analysis before sharing — verify joins, aggregations, denominators, time ranges, and metric definitions. Detect pitfalls like survivorship bias, average-of-averages, join explosion, timezone mismatches, incomplete periods, and selection bias. Includes documentation templates for reproducible analyses.
anthropic-managed-agents	Anthropic Managed Agents API for programmatically creating, running, and streaming AI agents on Anthropic's cloud infrastructure. Use when the user mentions "Managed Agents", "Anthropic agent sessions", or needs to create/run/stream an Anthropic agent with tool use (bash, git, web), attach GitHub repositories, or inject secrets via Vault. Do NOT use for standard Claude Messages API — use the Claude API skill instead.
apify	Apify web scraping platform. Use when user mentions "scrape website",
asana	Asana API for tasks and projects. Use when user mentions "Asana", "asana.com",
atlassian	Atlassian API for Confluence and Jira. Use when user mentions "Confluence
attio	Attio REST API for AI-native CRM operations — manage companies, people, deals, and custom objects, plus notes, tasks, lists, and comments. Use when the user mentions "Attio", "CRM record", "create company", "add person", "list entry", "CRM note", or "CRM task".