bio-de-visualization
$
npx mdskill add GPTomics/bioSkills/bio-de-visualizationGenerate DESeq2 and edgeR differential expression visualizations
- Solves tasks requiring plots of MA, dispersion estimates, counts, BCV, heatmaps, or p-value histograms.
- Depends on R environment with DESeq2, edgeR, ggplot2, limma, and matplotlib libraries installed.
- Decides actions by verifying package versions match required minimums before executing code patterns.
- Delivers results through built-in plotting functions adapted to the user's specific API version.
SKILL.md
.github/skills/bio-de-visualizationView on GitHub ↗
---
name: bio-de-visualization
description: Visualize differential expression results using DESeq2/edgeR built-in functions. Covers plotMA, plotDispEsts, plotCounts, plotBCV, sample distance heatmaps, and p-value histograms. Use when visualizing differential expression results.
tool_type: r
primary_tool: DESeq2
---
## Version Compatibility
Reference examples tested with: DESeq2 1.42+, edgeR 4.0+, ggplot2 3.5+, limma 3.58+, matplotlib 3.8+
Before using code patterns, verify installed versions match. If versions differ:
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# DE Visualization
Create visualizations for differential expression analysis using DESeq2 and edgeR built-in plotting functions.
## Scope
This skill covers **DE-specific built-in functions**:
- DESeq2: `plotMA()`, `plotPCA()`, `plotDispEsts()`, `plotCounts()`
- edgeR: `plotMD()`, `plotBCV()`, `plotMDS()`
- Sample distance heatmaps and p-value distributions
**For custom ggplot2/matplotlib implementations** of volcano, MA, and PCA plots, see `data-visualization/specialized-omics-plots`.
## Required Libraries
```r
library(DESeq2)
library(ggplot2)
library(pheatmap)
library(RColorBrewer)
library(ggrepel) # For labeled points
```
## Installation
```r
install.packages(c('ggplot2', 'pheatmap', 'RColorBrewer', 'ggrepel'))
# Optional: Enhanced volcano plots
BiocManager::install('EnhancedVolcano')
```
## MA Plot
**Goal:** Visualize the relationship between mean expression and log fold change to assess DE results.
**Approach:** Plot log fold change against mean normalized counts, highlighting significant genes.
**"Make an MA plot of my DE results"** → Plot mean expression vs. fold change with significant genes colored, using plotMA or ggplot2.
### DESeq2 MA Plot
```r
# Built-in MA plot
plotMA(res, ylim = c(-5, 5), main = 'MA Plot')
# With custom alpha
plotMA(res, alpha = 0.05, ylim = c(-5, 5))
# Highlight specific genes
plotMA(res, ylim = c(-5, 5))
with(subset(res, padj < 0.01 & abs(log2FoldChange) > 2),
points(baseMean, log2FoldChange, col = 'red', pch = 20))
```
### Custom ggplot2 MA Plot
```r
res_df <- as.data.frame(res)
res_df$significant <- res_df$padj < 0.05 & !is.na(res_df$padj)
ggplot(res_df, aes(x = log10(baseMean), y = log2FoldChange, color = significant)) +
geom_point(alpha = 0.5, size = 1) +
scale_color_manual(values = c('grey60', 'red')) +
geom_hline(yintercept = 0, linetype = 'dashed') +
labs(x = 'log10(Mean Expression)', y = 'log2 Fold Change', title = 'MA Plot') +
theme_bw() +
theme(legend.position = 'bottom')
```
### edgeR MA Plot
```r
# Using plotMD (mean-difference plot)
plotMD(qlf, main = 'MD Plot')
abline(h = c(-1, 1), col = 'blue', lty = 2)
```
## Volcano Plot
**Goal:** Display statistical significance against fold change magnitude to identify the most important DE genes.
**Approach:** Plot -log10(p-value) vs. log2 fold change with threshold lines and optional gene labels.
**"Create a volcano plot of differentially expressed genes"** → Scatter plot of fold change vs. significance with colored significance regions and labeled top hits.
### Basic Volcano Plot
```r
res_df <- as.data.frame(res)
res_df$significant <- res_df$padj < 0.05 & abs(res_df$log2FoldChange) > 1
ggplot(res_df, aes(x = log2FoldChange, y = -log10(pvalue), color = significant)) +
geom_point(alpha = 0.5, size = 1) +
scale_color_manual(values = c('grey60', 'red')) +
geom_vline(xintercept = c(-1, 1), linetype = 'dashed', color = 'blue') +
geom_hline(yintercept = -log10(0.05), linetype = 'dashed', color = 'blue') +
labs(x = 'log2 Fold Change', y = '-log10(p-value)', title = 'Volcano Plot') +
theme_bw()
```
### Volcano with Gene Labels
```r
res_df <- as.data.frame(res)
res_df$gene <- rownames(res_df)
res_df$significant <- res_df$padj < 0.05 & abs(res_df$log2FoldChange) > 1
# Label top genes
top_genes <- head(res_df[order(res_df$padj), ], 10)
ggplot(res_df, aes(x = log2FoldChange, y = -log10(pvalue))) +
geom_point(aes(color = significant), alpha = 0.5, size = 1) +
scale_color_manual(values = c('grey60', 'red')) +
geom_text_repel(data = top_genes, aes(label = gene),
size = 3, max.overlaps = 20) +
geom_vline(xintercept = c(-1, 1), linetype = 'dashed') +
geom_hline(yintercept = -log10(0.05), linetype = 'dashed') +
labs(x = 'log2 Fold Change', y = '-log10(p-value)') +
theme_bw()
```
### EnhancedVolcano
```r
library(EnhancedVolcano)
EnhancedVolcano(res,
lab = rownames(res),
x = 'log2FoldChange',
y = 'pvalue',
pCutoff = 0.05,
FCcutoff = 1,
title = 'Differential Expression',
subtitle = 'Treatment vs Control')
```
## PCA Plot
**Goal:** Assess sample clustering and identify batch effects or outliers via dimensionality reduction.
**Approach:** Apply variance-stabilizing transformation then project samples onto principal components, coloring by experimental variables.
**"Show me a PCA plot of my samples"** → Perform PCA on transformed expression data and visualize sample separation by condition and batch.
### DESeq2 PCA
```r
# Variance stabilizing transformation first
vsd <- vst(dds, blind = FALSE)
# Basic PCA
plotPCA(vsd, intgroup = 'condition')
# With more options
plotPCA(vsd, intgroup = c('condition', 'batch'), ntop = 500)
```
### Custom PCA with ggplot2
```r
vsd <- vst(dds, blind = FALSE)
pca_data <- plotPCA(vsd, intgroup = c('condition', 'batch'), returnData = TRUE)
percentVar <- round(100 * attr(pca_data, 'percentVar'))
ggplot(pca_data, aes(x = PC1, y = PC2, color = condition, shape = batch)) +
geom_point(size = 4) +
xlab(paste0('PC1: ', percentVar[1], '% variance')) +
ylab(paste0('PC2: ', percentVar[2], '% variance')) +
ggtitle('PCA Plot') +
theme_bw() +
theme(legend.position = 'right')
```
### edgeR PCA (via limma)
```r
library(limma)
log_cpm <- cpm(y, log = TRUE)
plotMDS(log_cpm, col = as.numeric(group), pch = 16)
legend('topright', legend = levels(group), col = 1:nlevels(group), pch = 16)
```
## Heatmaps
**Goal:** Visualize expression patterns of significant genes across samples to reveal clusters and condition effects.
**Approach:** Z-score normalize VST-transformed counts for significant genes and cluster with pheatmap, annotating by condition.
**"Make a heatmap of the top differentially expressed genes"** → Extract significant genes, z-score normalize, and create a clustered heatmap with sample annotations.
### Top DE Genes Heatmap
```r
library(pheatmap)
# Get top significant genes
sig_genes <- rownames(subset(res, padj < 0.01))
# Get normalized counts
vsd <- vst(dds, blind = FALSE)
mat <- assay(vsd)[sig_genes, ]
# Scale by row (z-score)
mat_scaled <- t(scale(t(mat)))
# Create annotation
annotation_col <- data.frame(
condition = colData(dds)$condition,
row.names = colnames(mat)
)
pheatmap(mat_scaled,
annotation_col = annotation_col,
show_rownames = FALSE,
clustering_distance_rows = 'correlation',
clustering_distance_cols = 'correlation',
color = colorRampPalette(c('blue', 'white', 'red'))(100),
main = 'Top DE Genes')
```
### Sample Distance Heatmap
```r
vsd <- vst(dds, blind = FALSE)
# Calculate sample distances
sampleDists <- dist(t(assay(vsd)))
sampleDistMatrix <- as.matrix(sampleDists)
# Annotation
annotation <- data.frame(
condition = colData(dds)$condition,
row.names = colnames(dds)
)
pheatmap(sampleDistMatrix,
annotation_col = annotation,
annotation_row = annotation,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists,
color = colorRampPalette(c('white', 'steelblue'))(100),
main = 'Sample Distance Matrix')
```
### Gene Expression Heatmap
```r
# Select genes of interest
genes_of_interest <- c('gene1', 'gene2', 'gene3', 'gene4', 'gene5')
mat <- assay(vsd)[genes_of_interest, ]
pheatmap(mat,
scale = 'row',
annotation_col = annotation_col,
show_rownames = TRUE,
cluster_cols = TRUE,
cluster_rows = TRUE,
main = 'Genes of Interest')
```
## Dispersion Plot
**Goal:** Assess the fit of the dispersion model to verify DE analysis assumptions.
**Approach:** Plot gene-wise, fitted, and final dispersion estimates against mean expression.
### DESeq2
```r
plotDispEsts(dds, main = 'Dispersion Estimates')
```
### edgeR
```r
plotBCV(y, main = 'Biological Coefficient of Variation')
```
## Counts Plot for Individual Genes
**Goal:** Visualize expression of a specific gene across samples and conditions.
**Approach:** Extract per-sample counts for a gene and plot by condition using plotCounts or ggplot2.
### DESeq2
```r
# Plot counts for a specific gene
plotCounts(dds, gene = 'GENE_NAME', intgroup = 'condition')
# With ggplot2
d <- plotCounts(dds, gene = 'GENE_NAME', intgroup = 'condition', returnData = TRUE)
ggplot(d, aes(x = condition, y = count, color = condition)) +
geom_point(position = position_jitter(width = 0.1), size = 3) +
scale_y_log10() +
ggtitle('GENE_NAME Expression') +
theme_bw()
```
### edgeR
```r
# Get CPM for a gene
gene_idx <- which(rownames(y) == 'GENE_NAME')
cpm_gene <- cpm(y)[gene_idx, ]
# Plot
df <- data.frame(cpm = cpm_gene, group = group)
ggplot(df, aes(x = group, y = cpm, color = group)) +
geom_point(position = position_jitter(width = 0.1), size = 3) +
scale_y_log10() +
labs(y = 'CPM', title = 'GENE_NAME Expression') +
theme_bw()
```
## P-value Histogram
**Goal:** Diagnose the quality of the DE analysis by examining the raw p-value distribution.
**Approach:** Histogram of raw p-values; a uniform distribution with a peak near zero indicates a well-calibrated test.
```r
# Check p-value distribution (should be uniform under null with peak near 0)
res_df <- as.data.frame(res)
ggplot(res_df, aes(x = pvalue)) +
geom_histogram(bins = 50, fill = 'steelblue', color = 'white') +
labs(x = 'P-value', y = 'Frequency', title = 'P-value Distribution') +
theme_bw()
```
## Saving Plots
**Goal:** Export publication-quality plots in vector or raster formats.
**Approach:** Use pdf/png devices or ggsave with appropriate resolution and dimensions.
```r
# Save as PDF (vector)
pdf('volcano_plot.pdf', width = 8, height = 6)
# ... plot code ...
dev.off()
# Save as PNG (raster)
png('volcano_plot.png', width = 800, height = 600, res = 150)
# ... plot code ...
dev.off()
# Using ggsave for ggplot objects
p <- ggplot(...) + ...
ggsave('plot.pdf', p, width = 8, height = 6)
ggsave('plot.png', p, width = 8, height = 6, dpi = 300)
```
## Color Palettes
**Goal:** Select appropriate color schemes for heatmaps and categorical data.
**Approach:** Use RColorBrewer palettes -- diverging for expression, sequential for distances, qualitative for groups.
```r
# For heatmaps
library(RColorBrewer)
# Diverging (for expression: blue-white-red)
colorRampPalette(rev(brewer.pal(n = 7, name = 'RdBu')))(100)
# Sequential (for distances)
colorRampPalette(brewer.pal(n = 9, name = 'Blues'))(100)
# For categorical groups
brewer.pal(n = 8, name = 'Set1')
```
## Quick Reference: Common Plots
| Plot | Purpose | Function |
|------|---------|----------|
| MA plot | LFC vs mean expression | `plotMA()`, `plotMD()` |
| Volcano | LFC vs significance | ggplot2, EnhancedVolcano |
| PCA | Sample clustering | `plotPCA()`, `plotMDS()` |
| Heatmap | Gene patterns | `pheatmap()` |
| Dispersion | Model fit | `plotDispEsts()`, `plotBCV()` |
| Counts | Individual genes | `plotCounts()` |
## Diagnostic Interpretation
### P-value Histogram
Check raw p-value distribution before trusting DE results:
| Shape | Meaning | Action |
|-------|---------|--------|
| Uniform + spike near 0 | Correct: null genes uniform, true DE near 0 | Proceed normally |
| Anti-conservative (U-shape, spikes at 0 and 1) | Inflated significance; unmodeled batch or violated assumptions | Check for batch effects, verify model specification |
| Conservative (depleted near 0, spike near 1) | Over-correction; too many covariates or wrong dispersion | Simplify model, check dispersion plot |
| Spike at p = 1 only | Discrete artifact from low-count genes | Pre-filter more aggressively |
### MA Plot Diagnostics
| Pattern | Meaning |
|---------|---------|
| Symmetric cloud centered at LFC = 0 | Correct normalization |
| Cloud shifted up or down | Normalization failure; majority-DE experiment may violate assumptions |
| Funnel shape widening at low expression | Expected — low-count genes have noisier fold changes |
| Discrete horizontal bands | Low-count artifacts; consider stronger pre-filtering |
### Volcano Plot: Shrunken LFCs
Use shrunken LFCs (apeglm/ashr) on the x-axis and un-shrunken p-values on the y-axis. This combination gives stable fold change estimates while preserving the original significance assessment. Without shrinkage, low-count genes with extreme but unreliable fold changes dominate the plot edges.
### Dispersion Plot Diagnostics
| Pattern | Meaning |
|---------|---------|
| Gene-wise points scattered around fitted line | Good model fit |
| Gene-wise points far above fitted line | Possible outlier genes or unmodeled batch effects |
| Fitted line flat (no trend) | Unusual — check if data is over-filtered or has unusual structure |
| Final estimates much lower than gene-wise | Expected — shrinkage toward the fitted trend |
### PCA Diagnostics
| Pattern | Meaning | Action |
|---------|---------|--------|
| Clear separation by condition on PC1/PC2 | Strong biological signal | Proceed |
| Separation by batch, not condition | Batch effect dominates | Include batch in model; do NOT use corrected counts for DE |
| One sample far from its group | Potential outlier or sample swap | Check library QC metrics; consider removing |
| No separation on PC1/PC2 but present on PC3+ | Subtle effects | May still find DE genes; check dispersion estimates |
## Related Skills
- deseq2-basics - Generate DESeq2 results for visualization
- edger-basics - Generate edgeR results for visualization
- de-results - Filter genes before visualization
- data-visualization/specialized-omics-plots - Custom ggplot2 volcano/MA/PCA functions
- data-visualization/heatmaps-clustering - Advanced heatmap customization