bio-temporal-genomics-temporal-clustering

Name: bio-temporal-genomics-temporal-clustering
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-temporal-genomics-temporal-clustering

Cluster genes by temporal expression shape to reveal coordinated response patterns.

Groups co-regulated genes into shared trajectory modules across time points.
Depends on Mfuzz, TCseq, DEGreport, and fuzzy c-means algorithms.
Requires prior identification of temporally variable genes before clustering.
Outputs categorized gene sets representing distinct temporal expression profiles.

SKILL.md

.github/skills/bio-temporal-genomics-temporal-clusteringView on GitHub ↗

---
name: bio-temporal-genomics-temporal-clustering
description: Clusters genes by temporal expression profile shape using Mfuzz soft clustering, TCseq, and DEGreport degPatterns. Groups co-regulated genes into shared trajectory patterns via fuzzy c-means or hierarchical approaches. Use when categorizing temporally dynamic genes into response groups or identifying co-expression modules across time points. Requires temporally variable genes identified first (see differential-expression/timeseries-de).
tool_type: mixed
primary_tool: Mfuzz
---

## Version Compatibility

Reference examples tested with: numpy 1.26+, scanpy 1.10+, scikit-learn 1.4+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Temporal Gene Clustering

**"Group my time-course genes by expression pattern shape"** → Cluster temporally variable genes into co-expression modules by trajectory shape using fuzzy c-means (Mfuzz), hierarchical methods, or DTW-based approaches, revealing coordinated response patterns.
- R: `Mfuzz::mfuzz()` for soft (fuzzy) temporal clustering
- Python: `sklearn.cluster.KMeans` on z-scored time profiles for hard clustering

Groups genes with similar temporal expression dynamics into clusters, revealing shared regulatory programs and coordinated response patterns across time-course experiments.

## Core Workflow

1. Select temporally variable genes (pre-filtered by DE or variance)
2. Standardize expression profiles (z-score across timepoints)
3. Choose clustering method and number of clusters
4. Assign genes to clusters (hard or soft membership)
5. Validate clusters and run functional enrichment per cluster

## Mfuzz (R/Bioconductor)

**Goal:** Group temporally variable genes into co-expression clusters by trajectory shape using fuzzy c-means, revealing shared regulatory programs.

**Approach:** Create an ExpressionSet from the time-series matrix, filter low-variance genes, standardize profiles, estimate the fuzzifier parameter, then run fuzzy c-means to assign soft cluster memberships.

Soft (fuzzy) c-means clustering assigns genes membership scores across all clusters, capturing genes with ambiguous temporal behavior.

### Setup and Preprocessing

```r
library(Mfuzz)
library(Biobase)

# Rows = genes, columns = timepoints (mean across replicates)
expr_mat <- as.matrix(read.csv('temporal_expression.csv', row.names = 1))

# Create ExpressionSet
eset <- ExpressionSet(assayData = expr_mat)

# filter.std removes genes with near-zero variance across timepoints
# min.std=0.5: removes flat genes; adjust based on data spread
eset <- filter.std(eset, min.std = 0.5)

# Standardize each gene to mean=0, sd=1 across timepoints
eset <- standardise(eset)
```

### Fuzzifier Estimation and Clustering

```r
# mestimate(): data-driven fuzzifier estimate based on gene count and dimensions
# Prevents clusters from being too crisp (m close to 1) or too fuzzy (m >> 2)
m <- mestimate(eset)
cat(sprintf('Estimated fuzzifier: %.2f\n', m))

# c=8: typical starting point for 6-12 timepoints; refine with cluster validity indices
cl <- mfuzz(eset, c = 8, m = m)

# Membership filtering: genes with membership < 0.5 in all clusters are ambiguous
# 0.5 threshold: standard cutoff; genes below this are equidistant from multiple centroids
core_genes <- acore(eset, cl, min.acore = 0.5)
```

### Visualization

```r
# Temporal profile plot with membership-based color intensity
mfuzz.plot2(eset, cl, mfrow = c(2, 4), time.labels = colnames(expr_mat),
            centre = TRUE, x11 = FALSE)

# Cluster overlap plot shows similarity between cluster centroids
overlap.plot(cl, over = overlap(cl), thres = 0.05)
```

### Cluster Number Selection

```r
# Evaluate multiple k values; pick where cluster validity stabilizes
# Range 4-20: typical for temporal data; fewer for simple designs, more for dense sampling
validity_scores <- numeric()
for (k in 4:20) {
    cl_k <- mfuzz(eset, c = k, m = m)
    # Minimum centroid distance: should not collapse below threshold
    centroids <- cl_k$centers
    dists <- as.matrix(dist(centroids))
    diag(dists) <- Inf
    validity_scores <- c(validity_scores, min(dists))
}
plot(4:20, validity_scores, type = 'b', xlab = 'Number of clusters', ylab = 'Min centroid distance')
```

## TCseq (R/Bioconductor)

Temporal clustering with fuzzy c-means and k-means on time-course sequencing data.

```r
library(TCseq)

# timeclust with fuzzy c-means
# algo='cm': fuzzy c-means; captures soft membership like Mfuzz
# k=6: number of clusters; test range and evaluate with silhouette
tc <- timeclust(expr_mat, algo = 'cm', k = 6, standardize = TRUE)

# Cluster assignment plot
timeclustplot(tc, value = 'z-score', cols = 3)

# k-means alternative for hard clustering
tc_km <- timeclust(expr_mat, algo = 'km', k = 6, standardize = TRUE)
```

## DEGreport degPatterns (R)

Automatic cluster number selection and publication-ready plots.

```r
library(DEGreport)

# degPatterns automatically selects optimal cluster count via hierarchical clustering
# time: factor defining timepoint order
# col: column in metadata for coloring (e.g., condition)
# minc=15: minimum genes per cluster to retain; prevents singleton clusters
patterns <- degPatterns(expr_mat, metadata = sample_info,
                        time = 'timepoint', col = 'condition', minc = 15)

# Access cluster assignments
cluster_df <- patterns$df

# Plot individual clusters
degPlotCluster(patterns$normalized, time = 'timepoint', color = 'condition')
```

## tslearn (Python)

Time-series clustering with Dynamic Time Warping (DTW) distance.

```python
import numpy as np
from tslearn.clustering import TimeSeriesKMeans
from tslearn.preprocessing import TimeSeriesScalerMeanVariance
from tslearn.utils import to_time_series_dataset
from sklearn.metrics import silhouette_score

# expr_mat: numpy array of shape (n_genes, n_timepoints)
expr_scaled = TimeSeriesScalerMeanVariance().fit_transform(expr_mat[:, :, np.newaxis])

# DTW metric: handles phase-shifted profiles better than Euclidean
# Soft-DTW (metric='softdtw') is differentiable and faster for large datasets
# n_clusters=8: starting point; evaluate with silhouette
model = TimeSeriesKMeans(n_clusters=8, metric='dtw', max_iter=50, random_state=42)
labels = model.fit_predict(expr_scaled)

# sklearn silhouette_score does not support DTW; precompute distance matrix
from tslearn.metrics import cdist_dtw
dist_matrix = cdist_dtw(expr_scaled)
sil = silhouette_score(dist_matrix, labels, metric='precomputed')
```

### Cluster Number Selection with Silhouette

```python
# Test k from 3-15; pick k with highest silhouette score
# 3-15 range: fewer than 3 is too coarse; more than 15 rarely adds biological meaning
sil_scores = []
for k in range(3, 16):
    model = TimeSeriesKMeans(n_clusters=k, metric='softdtw', max_iter=30, random_state=42)
    labels = model.fit_predict(expr_scaled)
    # Euclidean silhouette as computational shortcut; DTW silhouette is O(n^2 * T^2)
    sil_scores.append(silhouette_score(expr_scaled.squeeze(), labels, metric='euclidean'))
```

## Method Comparison

| Method | Clustering Type | Distance | Best For |
|--------|----------------|----------|----------|
| Mfuzz | Soft (fuzzy c-means) | Euclidean | Standard temporal profiling |
| TCseq | Soft or hard | Euclidean | RNA-seq time courses |
| DEGreport | Hierarchical | Correlation | Automatic k selection |
| tslearn | Hard (k-means) | DTW/soft-DTW | Phase-shifted profiles |

## Tips

- Always standardize (z-score) before clustering; otherwise, highly expressed genes dominate
- Soft clustering (Mfuzz) is preferred when genes may participate in multiple temporal programs
- DTW-based clustering captures time-shifted patterns but is computationally expensive for >5000 genes
- Run functional enrichment (GO/GSEA) per cluster to interpret biological meaning
- Membership threshold of 0.5 for Mfuzz filters ~30-50% of genes as ambiguous; adjust if too stringent

## Related Skills

- circadian-rhythms - Rhythm-specific clustering by phase
- trajectory-modeling - Continuous trajectory fitting before clustering
- differential-expression/timeseries-de - Upstream temporal DE for gene selection
- pathway-analysis/go-enrichment - Per-cluster functional enrichment

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.