bio-machine-learning-model-validation
$
npx mdskill add GPTomics/bioSkills/bio-machine-learning-model-validationValidates machine learning models on biomedical data using nested cross-validation
- Solves overfitting and data leakage in biomarker discovery on omics datasets
- Uses scikit-learn's StratifiedKFold and cross_val_score for nested loops
- Separates hyperparameter tuning from performance evaluation to ensure unbiased results
- Returns reliable performance metrics for classifiers on small biomedical datasets
SKILL.md
.github/skills/bio-machine-learning-model-validationView on GitHub ↗
---
name: bio-machine-learning-model-validation
description: Implements nested cross-validation and stratified splits for unbiased model evaluation on biomedical datasets. Prevents data leakage and overfitting in biomarker discovery. Use when validating classifiers or optimizing hyperparameters on omics data.
tool_type: python
primary_tool: sklearn
---
## Version Compatibility
Reference examples tested with: numpy 1.26+, scikit-learn 1.4+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Cross-Validation for Biomedical Data
**"Properly validate my omics classifier"** → Use nested cross-validation with stratified splits to get unbiased performance estimates while tuning hyperparameters on small biomedical datasets.
- Python: `sklearn.model_selection.cross_val_score()` with `StratifiedKFold` inner/outer loops
## Why Nested CV Matters
Simple train/test splits overestimate performance on small omics datasets. Nested CV provides unbiased estimates by separating hyperparameter tuning from performance evaluation.
## Nested Cross-Validation
**Goal:** Obtain unbiased performance estimates by separating hyperparameter tuning from evaluation.
**Approach:** Use an outer CV loop for scoring and an inner CV loop for grid search, preventing information leakage between tuning and evaluation.
```python
from sklearn.model_selection import cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
import numpy as np
pipe = Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier(random_state=42))
])
param_grid = {
'clf__n_estimators': [50, 100, 200],
'clf__max_depth': [5, 10, None]
}
# Outer CV: performance estimation (5 folds)
# Inner CV: hyperparameter tuning (3 folds)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
nested_scores = []
for train_idx, test_idx in outer_cv.split(X, y):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
grid = GridSearchCV(pipe, param_grid, cv=inner_cv, scoring='roc_auc', n_jobs=-1)
grid.fit(X_train, y_train)
score = grid.score(X_test, y_test)
nested_scores.append(score)
print(f'Nested CV AUC: {np.mean(nested_scores):.3f} +/- {np.std(nested_scores):.3f}')
```
## Stratified K-Fold
**Goal:** Evaluate model performance while preserving class proportions in each fold.
**Approach:** Split data into stratified folds and compute cross-validated scores to account for class imbalance.
```python
from sklearn.model_selection import StratifiedKFold, cross_val_score
# Always stratify for class imbalance
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')
print(f'CV AUC: {scores.mean():.3f} +/- {scores.std():.3f}')
```
## Repeated Stratified K-Fold
**Goal:** Produce more stable performance estimates by averaging across multiple CV repetitions.
**Approach:** Repeat stratified K-fold splitting with different random seeds and aggregate scores across all iterations.
```python
from sklearn.model_selection import RepeatedStratifiedKFold
# More robust estimate with multiple repeats
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')
print(f'Repeated CV AUC: {scores.mean():.3f} +/- {scores.std():.3f}')
```
## Leave-One-Out (Small Datasets)
**Goal:** Maximize training data when sample size is very small (n < 30).
**Approach:** Hold out one sample at a time for testing and train on all remaining samples, then aggregate predictions.
```python
from sklearn.model_selection import LeaveOneOut, cross_val_predict
# Use for very small datasets (n < 30)
loo = LeaveOneOut()
y_pred = cross_val_predict(pipe, X, y, cv=loo, method='predict_proba')[:, 1]
auc = roc_auc_score(y, y_pred)
print(f'LOO AUC: {auc:.3f}')
```
## Group-Aware Splits
**Goal:** Prevent data leakage when samples from the same patient or batch are correlated.
**Approach:** Use group-aware splitting to ensure all samples from a single group stay in the same fold.
```python
from sklearn.model_selection import GroupKFold, LeaveOneGroupOut
# When samples from same patient/batch must stay together
groups = meta['patient_id'].values
group_cv = GroupKFold(n_splits=5)
scores = cross_val_score(pipe, X, y, cv=group_cv, groups=groups, scoring='roc_auc')
```
## CV Strategy Selection
| Dataset Size | Strategy | Notes |
|--------------|----------|-------|
| n > 100 | StratifiedKFold(5) | Standard choice |
| n = 50-100 | StratifiedKFold(10) | More train data per fold |
| n < 30 | LeaveOneOut | Maximum train data |
| Repeated measures | GroupKFold | Keep patients together |
| High variance | RepeatedStratifiedKFold | More stable estimates |
## Avoiding Data Leakage
**Goal:** Ensure feature selection does not use test-fold information, which inflates performance estimates.
**Approach:** Embed feature selection inside a pipeline so it executes independently within each CV fold.
```python
# WRONG: Feature selection before CV
# selected = SelectKBest(k=100).fit_transform(X, y) # Leaks info!
# scores = cross_val_score(clf, selected, y, cv=cv)
# CORRECT: Feature selection inside CV
from sklearn.feature_selection import SelectKBest
pipe = Pipeline([
('scaler', StandardScaler()),
('select', SelectKBest(k=100)), # Done per fold
('clf', RandomForestClassifier())
])
scores = cross_val_score(pipe, X, y, cv=cv, scoring='roc_auc')
```
## Related Skills
- machine-learning/omics-classifiers - Model training
- experimental-design/multiple-testing - Multiple hypothesis correction
- machine-learning/biomarker-discovery - Feature selection within CV