bright-data
$
npx mdskill add vm0-ai/vm0-skills/bright-dataExecute large-scale web scraping and proxy services for data collection.
- Enables gathering profiles, posts, and comments from Twitter, Reddit, YouTube, Instagram, TikTok, and LinkedIn.
- Depends on Bright Data API for triggering asynchronous or synchronous data collection jobs.
- Selects scraping targets based on user-specified URLs or platform requirements.
- Delivers results via JSON snapshots or immediate responses depending on request size.
SKILL.md
.github/skills/bright-dataView on GitHub ↗
---
name: bright-data
description: Bright Data proxy and web scraping API. Use when user mentions "Bright
Data", "proxy", "web scraping at scale", or data collection.
---
## Troubleshooting
If requests fail, run `zero doctor check-connector --env-name BRIGHTDATA_TOKEN` or `zero doctor check-connector --url https://api.brightdata.com/datasets/v3/trigger --method POST`
## Social Media Scraping
Bright Data supports scraping these social media platforms:
| Platform | Profiles | Posts | Comments | Reels/Videos |
|----------|----------|-------|----------|--------------|
| Twitter/X | ✅ | ✅ | - | - |
| Reddit | - | ✅ | ✅ | - |
| YouTube | ✅ | ✅ | ✅ | - |
| Instagram | ✅ | ✅ | ✅ | ✅ |
| TikTok | ✅ | ✅ | ✅ | - |
| LinkedIn | ✅ | ✅ | - | - |
## How to Use
### 1. Trigger Scraping (Asynchronous)
Trigger a data collection job and get a `snapshot_id` for later retrieval.
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://twitter.com/username"},
{"url": "https://twitter.com/username2"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Response:**
```json
{
"snapshot_id": "s_m4x7enmven8djfqak"
}
```
### 2. Trigger Scraping (Synchronous)
Get results immediately in the response (for small requests).
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://www.reddit.com/r/technology/comments/xxxxx"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
### 3. Monitor Progress
Check the status of a scraping job (replace `<snapshot-id>` with your actual snapshot ID):
```bash
curl -s "https://api.brightdata.com/datasets/v3/progress/<snapshot-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```
**Response:**
```json
{
"snapshot_id": "s_m4x7enmven8djfqak",
"dataset_id": "gd_xxxxx",
"status": "running"
}
```
Status values: `running`, `ready`, `failed`
### 4. Download Results
Once status is `ready`, download the collected data (replace `<snapshot-id>` with your actual snapshot ID):
```bash
curl -s "https://api.brightdata.com/datasets/v3/snapshot/<snapshot-id>?format=json" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```
### 5. List Snapshots
Get all your snapshots:
```bash
curl -s "https://api.brightdata.com/datasets/v3/snapshots" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" | jq '.[] | {snapshot_id, dataset_id, status}'
```
### 6. Cancel Snapshot
Cancel a running job (replace `<snapshot-id>` with your actual snapshot ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/cancel?snapshot_id=<snapshot-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```
## Platform-Specific Examples
### Twitter/X - Scrape Profile
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://twitter.com/elonmusk"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Returns:** `x_id`, `profile_name`, `biography`, `is_verified`, `followers`, `following`, `profile_image_link`
### Twitter/X - Scrape Posts
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://twitter.com/username/status/123456789"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Returns:** `post_id`, `text`, `replies`, `likes`, `retweets`, `views`, `hashtags`, `media`
### Reddit - Scrape Subreddit Posts
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://www.reddit.com/r/technology", "sort_by": "hot"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Parameters:** `url`, `sort_by` (new/top/hot)
**Returns:** `post_id`, `title`, `description`, `num_comments`, `upvotes`, `date_posted`, `community`
### Reddit - Scrape Comments
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://www.reddit.com/r/technology/comments/xxxxx/post_title"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Returns:** `comment_id`, `user_posted`, `comment_text`, `upvotes`, `replies`
### YouTube - Scrape Video Info
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Returns:** `title`, `views`, `likes`, `num_comments`, `video_length`, `transcript`, `channel_name`
### YouTube - Search by Keyword
Write to `/tmp/brightdata_request.json`:
```json
[
{"keyword": "artificial intelligence", "num_of_posts": 50}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
### YouTube - Scrape Comments
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://www.youtube.com/watch?v=xxxxx", "load_replies": 3}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Returns:** `comment_text`, `likes`, `replies`, `username`, `date`
### Instagram - Scrape Profile
Write to `/tmp/brightdata_request.json`:
```json
[
{"url": "https://www.instagram.com/username"}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/scrape?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
**Returns:** `followers`, `post_count`, `profile_name`, `is_verified`, `biography`
### Instagram - Scrape Posts
Write to `/tmp/brightdata_request.json`:
```json
[
{
"url": "https://www.instagram.com/username",
"num_of_posts": 20,
"start_date": "01-01-2024",
"end_date": "12-31-2024"
}
]
```
Then run (replace `<dataset-id>` with your actual dataset ID):
```bash
curl -s -X POST "https://api.brightdata.com/datasets/v3/trigger?dataset_id=<dataset-id>" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" \
-H "Content-Type: application/json" \
-d @/tmp/brightdata_request.json
```
## Account Management
### Check Account Status
```bash
curl -s "https://api.brightdata.com/status" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```
**Response:**
```json
{
"status": "active",
"customer": "hl_xxxxxxxx",
"can_make_requests": true,
"ip": "x.x.x.x"
}
```
### Get Active Zones
```bash
curl -s "https://api.brightdata.com/zone/get_active_zones" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN" | jq '.[] | {name, type}'
```
### Get Bandwidth Usage
```bash
curl -s "https://api.brightdata.com/customer/bw" \
-H "Authorization: Bearer $BRIGHTDATA_TOKEN"
```
## Getting Dataset IDs
To use the scraping features, you need a `dataset_id`:
1. Go to [Bright Data Control Panel](https://brightdata.com/cp/datasets)
2. Create a new Web Scraper dataset or select an existing one
3. Choose the platform (Twitter, Reddit, YouTube, etc.)
4. Copy the `dataset_id` from the dataset settings
Dataset IDs can also be found in the bandwidth usage API response under the `data` field keys (e.g., `v__ds_api_gd_xxxxx` where `gd_xxxxx` is your dataset ID).
## Common Parameters
| Parameter | Description | Example |
|-----------|-------------|---------|
| `url` | Target URL to scrape | `https://twitter.com/user` |
| `keyword` | Search keyword | `"artificial intelligence"` |
| `num_of_posts` | Limit number of results | `50` |
| `start_date` | Filter by date (MM-DD-YYYY) | `"01-01-2024"` |
| `end_date` | Filter by date (MM-DD-YYYY) | `"12-31-2024"` |
| `sort_by` | Sort order (Reddit) | `new`, `top`, `hot` |
| `format` | Response format | `json`, `csv` |
## Rate Limits
- Batch mode: up to 100 concurrent requests
- Maximum input size: 1GB per batch
- Exceeding limits returns `429` error
## Guidelines
1. **Create datasets first**: Use the Control Panel to create scraper datasets
2. **Use async for large jobs**: Use `/trigger` for discovery and batch operations
3. **Use sync for small jobs**: Use `/scrape` for single URL quick lookups
4. **Check status before download**: Poll `/progress` until status is `ready`
5. **Respect rate limits**: Don't exceed 100 concurrent requests
6. **Date format**: Use MM-DD-YYYY for date parameters