bio-expression-matrix-counts-ingest

Name: bio-expression-matrix-counts-ingest
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-expression-matrix-counts-ingest

Ingest gene expression matrices from multiple formats

Solves importing quantification results for downstream analysis tasks
Depends on pandas and tximport libraries to process data
Decides processing method based on matrix unit type table
Delivers normalized counts ready for differential expression tools

SKILL.md

.github/skills/bio-expression-matrix-counts-ingestView on GitHub ↗

---
name: bio-expression-matrix-counts-ingest
description: Load gene expression count matrices from various formats including CSV, TSV, featureCounts, Salmon, kallisto, and 10X. Use when importing quantification results for downstream analysis.
tool_type: python
primary_tool: pandas
---

## Version Compatibility

Reference examples tested with: pandas 2.2+, tximport 1.30+, tximeta 1.20+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Count Matrix Ingestion

## Expression Units Decision Table

| Unit | Source | Use for DE | Use for visualization | Cross-sample comparison |
|------|--------|------------|----------------------|------------------------|
| Raw counts | featureCounts, HTSeq, STAR | Yes (DE tools normalize internally) | No | No |
| Estimated counts | Salmon, kallisto (via tximport) | Yes (with offset correction) | No | No |
| TPM | Salmon, kallisto | No | Within-sample gene comparison | Partially (same composition caveat) |
| FPKM/RPKM | Cufflinks, legacy tools | No | Within-sample only | No (composition bias) |
| Normalized counts | DESeq2, edgeR | Via DE tool | Prefer VST/rlog | Yes |

For differential expression, always provide raw integer counts. DE tools (DESeq2, edgeR, limma-voom) model count distributions and handle normalization internally.

## Transcript-Level vs Gene-Level Quantification

Salmon and kallisto quantify at the transcript level. Naive summation of transcript counts to gene level introduces **length bias**: if treatment causes a shift from short to long isoforms, the gene appears upregulated even if the number of mRNA molecules is unchanged. Use tximport in R (or equivalent offset handling in Python) to correct for this.

For detailed tximport workflows, see rna-quantification/tximport-workflow. Key decisions:
- `countsFromAbundance='no'` (default): raw estimated counts with length offsets for DESeq2 (recommended)
- `countsFromAbundance='lengthScaledTPM'`: required for limma-voom, which cannot use offset matrices
- `countsFromAbundance='scaledTPM'`: prevents differential transcript usage from creating spurious DGE

## Basic CSV/TSV Loading

**Goal:** Load a gene expression count matrix from a delimited text file into a pandas DataFrame.

**Approach:** Read CSV/TSV with gene IDs as the row index, handling comment lines if present.

**"Load my count matrix"** → Read a delimited file into a DataFrame with genes as rows and samples as columns.

```python
import pandas as pd

# TSV with gene IDs as first column
counts = pd.read_csv('counts.tsv', sep='\t', index_col=0)

# CSV with header
counts = pd.read_csv('counts.csv', index_col=0)

# Skip comment lines
counts = pd.read_csv('counts.txt', sep='\t', index_col=0, comment='#')
```

## featureCounts Output

**Goal:** Parse featureCounts output into a clean count matrix by stripping metadata columns.

**Approach:** Skip the 6 annotation columns (Chr, Start, End, Strand, Length) and clean BAM path suffixes from column names.

```python
import pandas as pd

# featureCounts format has 6 metadata columns before counts
fc = pd.read_csv('featurecounts.txt', sep='\t', comment='#')
counts = fc.set_index('Geneid').iloc[:, 5:]  # Skip Chr, Start, End, Strand, Length
counts.columns = [c.replace('.bam', '').split('/')[-1] for c in counts.columns]
```

## Salmon Quant Files

**Goal:** Combine per-sample Salmon quantification files into a single count or TPM matrix.

**Approach:** Iterate over quant directories, extract the desired column from each quant.sf, and merge into a DataFrame.

```python
import pandas as pd
from pathlib import Path

def load_salmon_quants(quant_dirs, column='NumReads'):
    '''Load multiple Salmon quant.sf files into a count matrix.'''
    dfs = {}
    for qdir in quant_dirs:
        sample = Path(qdir).name
        sf = pd.read_csv(f'{qdir}/quant.sf', sep='\t', index_col=0)
        dfs[sample] = sf[column]
    return pd.DataFrame(dfs)

# Usage
quant_dirs = ['salmon_out/sample1', 'salmon_out/sample2', 'salmon_out/sample3']
counts = load_salmon_quants(quant_dirs, column='NumReads')
tpm = load_salmon_quants(quant_dirs, column='TPM')
```

## STAR ReadsPerGene Files

**Goal:** Load STAR gene-level count output, selecting the correct strandedness column.

**Approach:** STAR outputs 4 columns per gene: gene ID, unstranded counts, sense-strand counts, antisense-strand counts. Selecting the wrong strand column silently halves the counts.

```python
import pandas as pd
from pathlib import Path

def load_star_genecounts(filepaths, strandedness='reverse'):
    '''Load STAR ReadsPerGene.out.tab files.

    strandedness: 'unstranded' (col 1), 'forward' (col 2), 'reverse' (col 3)
    For Illumina TruSeq stranded libraries, use 'reverse'.
    '''
    col_map = {'unstranded': 1, 'forward': 2, 'reverse': 3}
    col_idx = col_map[strandedness]
    dfs = {}
    for fp in filepaths:
        sample = Path(fp).name.replace('_ReadsPerGene.out.tab', '')
        df = pd.read_csv(fp, sep='\t', header=None, index_col=0)
        dfs[sample] = df.iloc[4:, col_idx - 1]  # Skip first 4 summary rows
    return pd.DataFrame(dfs)
```

Strandedness verification: compare total counts in columns 2 vs 3 across samples. The column with much larger totals corresponds to the correct strand. If roughly equal, the library is unstranded (use column 1).

## HTSeq Count Files

**Goal:** Load per-sample HTSeq count output into a combined matrix.

**Approach:** Read each file (two columns: gene ID and count), skipping the 5 summary lines at the end (prefixed with `__`), then merge.

```python
import pandas as pd
from pathlib import Path

def load_htseq_counts(filepaths):
    '''Load multiple HTSeq count files into a matrix.'''
    dfs = {}
    for fp in filepaths:
        sample = Path(fp).stem.replace('_counts', '')
        df = pd.read_csv(fp, sep='\t', header=None, index_col=0, names=['gene', 'count'])
        df = df[~df.index.str.startswith('__')]  # Remove __no_feature, __ambiguous, etc.
        dfs[sample] = df['count']
    return pd.DataFrame(dfs)
```

## kallisto Abundance Files

**Goal:** Combine per-sample kallisto abundance files into a single count or TPM matrix.

**Approach:** Read each abundance.tsv, extract the target column, and assemble into a DataFrame keyed by sample name.

```python
import pandas as pd
from pathlib import Path

def load_kallisto_quants(abundance_files, column='est_counts'):
    '''Load multiple kallisto abundance.tsv files.'''
    dfs = {}
    for f in abundance_files:
        sample = Path(f).parent.name
        ab = pd.read_csv(f, sep='\t', index_col=0)
        dfs[sample] = ab[column]
    return pd.DataFrame(dfs)

# Usage
files = ['kallisto_out/sample1/abundance.tsv', 'kallisto_out/sample2/abundance.tsv']
counts = load_kallisto_quants(files, column='est_counts')
tpm = load_kallisto_quants(files, column='tpm')
```

## 10X Genomics Sparse Matrix

**Goal:** Load single-cell count data from 10X Genomics format (MTX directory or H5 file).

**Approach:** Use scanpy to read the sparse matrix with gene and barcode metadata into an AnnData object.

```python
import scanpy as sc

# Load 10X directory (contains matrix.mtx, genes.tsv/features.tsv, barcodes.tsv)
adata = sc.read_10x_mtx('filtered_feature_bc_matrix/')

# Load 10X H5 file
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')

# Convert to dense DataFrame if needed
counts = adata.to_df()
```

## AnnData H5AD Files

**Goal:** Load preprocessed expression data stored in the AnnData H5AD format.

**Approach:** Read the H5AD file with scanpy and access the count matrix, raw layer, or metadata as needed.

```python
import anndata as ad
import scanpy as sc

# Load h5ad
adata = sc.read_h5ad('data.h5ad')

# Access count matrix
counts = adata.to_df()  # Dense DataFrame
sparse_counts = adata.X  # Sparse matrix (if stored sparse)

# Access raw counts if normalized data is in .X
raw_counts = adata.raw.to_adata().to_df()
```

## RDS Files (from R)

**Goal:** Load R-serialized count data into Python.

**Approach:** Use pyreadr to read RDS files directly into a pandas DataFrame.

```python
import pyreadr

# Read RDS file
result = pyreadr.read_r('counts.rds')
counts = result[None]  # Access the data

# For Seurat objects, use anndata2ri or convert in R first
```

## Combine Multiple Files

**Goal:** Merge per-sample count files into a single genes-by-samples matrix.

**Approach:** Glob matching files, read the first data column from each, and concatenate into a DataFrame.

```python
import pandas as pd
from pathlib import Path

def combine_count_files(file_pattern, index_col=0, sep='\t'):
    '''Combine multiple count files into one matrix.'''
    files = sorted(Path('.').glob(file_pattern))
    dfs = {}
    for f in files:
        sample = f.stem.replace('_counts', '')
        dfs[sample] = pd.read_csv(f, sep=sep, index_col=index_col).iloc[:, 0]
    return pd.DataFrame(dfs)

# Usage
counts = combine_count_files('counts/*_counts.tsv')
```

## Filter Low-Count Genes

**Goal:** Remove genes with insufficient expression to reduce noise and multiple testing burden.

**Approach:** Apply expression thresholds appropriate for the downstream DE tool. edgeR requires explicit filtering; DESeq2 handles it internally but benefits from a speed-only pre-filter.

### edgeR: Design-Aware Filtering (Recommended)

```r
library(edgeR)

# filterByExpr uses smallest group size from design matrix
# Requires CPM above threshold in at least n samples (n = smallest group)
keep <- filterByExpr(y, design=model.matrix(~condition, data=metadata))
y <- y[keep, , keep.lib.sizes=FALSE]
```

Forgetting `filterByExpr` in edgeR inflates the multiple testing burden because edgeR's quasi-likelihood pipeline does not have DESeq2's automatic independent filtering.

### DESeq2: Speed Pre-Filter

```r
# Manual pre-filter for speed only -- independent filtering at results() handles statistics
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
```

### Python

```python
# Group-aware filter: require min_counts in at least min_samples
min_counts, min_samples = 10, 3
expressed = (counts >= min_counts).sum(axis=1) >= min_samples
counts_filtered = counts.loc[expressed]

# For min_samples, use the smallest experimental group size
# e.g., if 3 controls and 5 treated, use min_samples=3
```

## Handle Gene ID Versions

**Goal:** Strip Ensembl version suffixes for consistent ID matching across datasets.

**Approach:** Split on the dot separator and keep only the stable identifier prefix.

```python
# Remove Ensembl version numbers (ENSG00000123456.12 -> ENSG00000123456)
counts.index = counts.index.str.split('.').str[0]

# Or keep as-is for compatibility
```

## Save Count Matrix

**Goal:** Export a count matrix in formats suitable for downstream tools or archival.

**Approach:** Write as TSV, compressed TSV, or AnnData H5AD depending on the target workflow.

```python
# Save as TSV
counts.to_csv('count_matrix.tsv', sep='\t')

# Save as compressed
counts.to_csv('count_matrix.tsv.gz', sep='\t', compression='gzip')

# Save as AnnData
import anndata as ad
adata = ad.AnnData(counts)
adata.write_h5ad('counts.h5ad')
```

## R Loading Equivalents

**Goal:** Load count data in R from common formats including featureCounts and Salmon/kallisto via tximport.

**Approach:** Use base R read functions for alignment-based counts, or tximport with a tx2gene mapping for pseudo-alignment outputs (Salmon/kallisto). tximport handles length-bias correction automatically.

```r
# Basic CSV/TSV
counts <- read.csv('counts.csv', row.names=1)
counts <- read.delim('counts.tsv', row.names=1)

# featureCounts
fc <- read.delim('featurecounts.txt', comment.char='#', row.names=1)
counts <- fc[, 6:ncol(fc)]

# tximport for Salmon/kallisto (RECOMMENDED over manual loading)
# Use tx2gene for gene-level summarization with length-offset correction
library(tximport)
tx2gene <- read.csv('tx2gene.csv')  # columns: TXNAME, GENEID
files <- file.path('salmon_out', samples, 'quant.sf')
names(files) <- samples
txi <- tximport(files, type='salmon', tx2gene=tx2gene)

# For DESeq2: use DESeqDataSetFromTximport (handles offsets natively)
# For limma-voom: use countsFromAbundance='lengthScaledTPM' in tximport
# See rna-quantification/tximport-workflow for full details
```

## Related Skills

- rna-quantification/featurecounts-counting - Generate featureCounts output
- rna-quantification/alignment-free-quant - Generate Salmon/kallisto output
- rna-quantification/tximport-workflow - Import Salmon/kallisto with length-offset correction
- expression-matrix/normalization - Normalize counts for downstream analysis
- expression-matrix/sparse-handling - Memory-efficient storage
- expression-matrix/gene-id-mapping - Convert gene identifiers

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.