cudf-analytics
$
npx mdskill add langchain-ai/deepagents/cudf-analyticsAccelerate large-scale data analysis by performing computations directly on NVIDIA GPUs.
- Handles statistical summaries, grouping, and outlier detection on tabular data.
- Integrates with NVIDIA cuDF for GPU-accelerated data manipulation.
- Triggers when tasks require profiling or complex aggregations on datasets.
- Returns results via a pandas-compatible structure, optimized for downstream use.
SKILL.md
.github/skills/cudf-analyticsView on GitHub ↗
---
name: cudf-analytics
description: Use for GPU-accelerated data analysis on datasets, CSVs, or tabular data using NVIDIA cuDF. Triggers when tasks involve groupby aggregations, statistical summaries, anomaly detection, or large-scale data profiling.
---
# cuDF Analytics Skill
GPU-accelerated data analysis using NVIDIA RAPIDS cuDF. cuDF provides a pandas-like API that runs on NVIDIA GPUs, enabling massive speedups on large datasets.
## When to Use This Skill
Use this skill when:
- Analyzing CSV files, datasets, or tabular data
- Computing statistical summaries (mean, median, std, quartiles)
- Performing groupby aggregations
- Detecting anomalies or outliers in data
- Profiling datasets with millions of rows
- Computing correlation matrices
## Initialization (REQUIRED)
Always start every script with this boilerplate. It tests actual GPU operations, not just import.
```python
import pandas as pd
try:
import cudf
# Smoke-test: verify GPU compute AND host transfer both work
_test = cudf.Series([1, 2, 3])
assert _test.sum() == 6
assert _test.to_pandas().tolist() == [1, 2, 3]
GPU = True
except Exception as e:
print(f"[GPU] cudf unavailable, falling back to pandas: {e}")
GPU = False
def read_csv(path):
return cudf.read_csv(path) if GPU else pd.read_csv(path)
def to_pd(df):
"""Convert cuDF DataFrame/Series to pandas. Use this instead of .to_pandas() directly."""
if not GPU:
return df
try:
return df.to_pandas()
except Exception as e:
print(f"[GPU] .to_pandas() failed, using Arrow fallback: {e}")
return df.to_arrow().to_pandas()
```
## Quick Reference
cuDF mirrors the pandas API. Common operations:
### Read Data
```python
df = read_csv("data.csv")
```
### Statistical Summary
```python
# Use to_pd() when you need pandas output
summary = to_pd(df[["value", "score"]].describe())
# Scalar values work directly with float()
mean_val = float(df["value"].mean())
q1 = float(df["value"].quantile(0.25))
# Correlation
corr = float(df["value"].corr(df["score"]))
```
### Groupby Aggregation
```python
result = df.groupby("category").agg({
"revenue": ["sum", "mean", "count"],
"quantity": ["sum", "mean"],
})
result_pd = to_pd(result)
```
### Anomaly Detection (IQR Method)
```python
col = "value"
Q1 = float(df[col].quantile(0.25))
Q3 = float(df[col].quantile(0.75))
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
outliers = to_pd(df[(df[col] < lower) | (df[col] > upper)])
```
### Anomaly Detection (Z-Score Method)
```python
mean = float(df[col].mean())
std = float(df[col].std())
df["z_score"] = (df[col] - mean) / std
anomalies = to_pd(df[df["z_score"].abs() > 3])
```
### Filtering and Selection
```python
# Filter rows
filtered = df[df["status"] == "active"]
# Select columns
subset = df[["name", "revenue", "date"]]
# Sort
sorted_df = df.sort_values("revenue", ascending=False)
# Convert to pandas for final output / iteration
result_pd = to_pd(sorted_df)
```
## Data Type Requirements
cuDF requires explicit type specification for optimal performance:
- Use `float32` or `float64` for numeric data
- Use `int32` or `int64` for integer data
- String columns use cuDF's string dtype automatically
## Output Guidelines
When reporting analysis results:
- Include dataset dimensions (rows x columns)
- Show key statistics in formatted tables
- Highlight notable patterns, trends, or anomalies
- Provide both summary statistics and specific examples
- Note any data quality issues (missing values, outliers)