bio-experimental-design-sample-size
$
npx mdskill add GPTomics/bioSkills/bio-experimental-design-sample-sizeCalculate required replicates for statistical significance
- Estimates sample sizes needed to detect expected effect sizes in RNA-seq and other omics studies.
- Depends on ssizeRNA, DESeq2, powsimR, and pilot dispersion estimates from biological data.
- Decides recommendations by comparing target power levels against variability and fold change inputs.
- Delivers integer replicate counts per group to ensure experiments achieve statistical significance.
SKILL.md
.github/skills/bio-experimental-design-sample-sizeView on GitHub ↗
---
name: bio-experimental-design-sample-size
description: Estimates required sample sizes for differential expression, ChIP-seq, methylation, and proteomics studies. Use when budgeting experiments, writing grant proposals, or determining minimum replicates needed to achieve statistical significance for expected effect sizes.
tool_type: r
primary_tool: ssizeRNA
---
## Version Compatibility
Reference examples tested with: DESeq2 1.42+
Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Sample Size Estimation
**"How many samples do I need for my experiment?"** → Estimate required biological replicates per group for a target power level given expected effect sizes and variability.
- R: `ssizeRNA::ssizeRNA_single()`, `DESeq2` pilot dispersion estimates
- scRNA-seq: `powsimR::simulateDE()`
## RNA-seq Sample Size
```r
library(ssizeRNA)
# Estimate sample size for RNA-seq
# m = total genes, m1 = expected DE genes
# fc = fold change, fdr = target FDR
result <- ssizeRNA_single(nGenes = 20000, pi0 = 0.9, m = 200,
mu = 10, disp = 0.1, fc = 2,
fdr = 0.05, power = 0.8)
result$ssize # Required n per group
```
## DESeq2-based Estimation
**Goal:** Derive realistic dispersion estimates from pilot RNA-seq data for use in power and sample size calculations.
**Approach:** Run DESeq2 on pilot count data to estimate per-gene dispersions, then extract the median dispersion as a representative variability parameter for power formulas.
```r
library(DESeq2)
# From pilot data
dds_pilot <- DESeqDataSetFromMatrix(pilot_counts, colData, ~condition)
dds_pilot <- DESeq(dds_pilot)
# Extract dispersion estimates for power calculation
dispersions <- mcols(dds_pilot)$dispGeneEst
median_disp <- median(dispersions, na.rm = TRUE)
# Use median_disp in power calculations
```
## Single-cell Sample Size
```r
library(powsimR)
# Estimate for scRNA-seq
# Accounts for dropout and cell-to-cell variability
params <- estimateParam(pilot_sce)
power <- simulateDE(params, n1 = 100, n2 = 100,
p.DE = 0.1, pLFC = 1)
```
## Sample Size by Assay Type
| Assay | Min Recommended | For Small Effects |
|-------|-----------------|-------------------|
| Bulk RNA-seq | 3 | 6-12 |
| scRNA-seq | 3 samples, 1000 cells | 6+ samples |
| ATAC-seq | 2 | 4-6 |
| ChIP-seq | 2 | 3-4 |
| Proteomics | 3 | 6-10 |
| Methylation | 4 | 8-12 |
## Budget Optimization
When resources are limited, prioritize:
1. Biological replicates over technical replicates
2. More samples over deeper sequencing (after ~20M reads for RNA-seq)
3. Balanced designs (equal n per group)
## Related Skills
- experimental-design/power-analysis - Power calculations
- experimental-design/batch-design - Optimal batch assignment
- single-cell/preprocessing - scRNA-seq experimental design