bio-expression-matrix-counts-ingest
$
npx mdskill add GPTomics/bioSkills/bio-expression-matrix-counts-ingestIngest gene expression matrices from multiple formats
- Solves importing quantification results for downstream analysis tasks
- Depends on pandas and tximport libraries to process data
- Decides processing method based on matrix unit type table
- Delivers normalized counts ready for differential expression tools
SKILL.md
.github/skills/bio-expression-matrix-counts-ingestView on GitHub ↗
---
name: bio-expression-matrix-counts-ingest
description: Load gene expression count matrices from various formats including CSV, TSV, featureCounts, Salmon, kallisto, and 10X. Use when importing quantification results for downstream analysis.
tool_type: python
primary_tool: pandas
---
## Version Compatibility
Reference examples tested with: pandas 2.2+, tximport 1.30+, tximeta 1.20+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Count Matrix Ingestion
## Expression Units Decision Table
| Unit | Source | Use for DE | Use for visualization | Cross-sample comparison |
|------|--------|------------|----------------------|------------------------|
| Raw counts | featureCounts, HTSeq, STAR | Yes (DE tools normalize internally) | No | No |
| Estimated counts | Salmon, kallisto (via tximport) | Yes (with offset correction) | No | No |
| TPM | Salmon, kallisto | No | Within-sample gene comparison | Partially (same composition caveat) |
| FPKM/RPKM | Cufflinks, legacy tools | No | Within-sample only | No (composition bias) |
| Normalized counts | DESeq2, edgeR | Via DE tool | Prefer VST/rlog | Yes |
For differential expression, always provide raw integer counts. DE tools (DESeq2, edgeR, limma-voom) model count distributions and handle normalization internally.
## Transcript-Level vs Gene-Level Quantification
Salmon and kallisto quantify at the transcript level. Naive summation of transcript counts to gene level introduces **length bias**: if treatment causes a shift from short to long isoforms, the gene appears upregulated even if the number of mRNA molecules is unchanged. Use tximport in R (or equivalent offset handling in Python) to correct for this.
For detailed tximport workflows, see rna-quantification/tximport-workflow. Key decisions:
- `countsFromAbundance='no'` (default): raw estimated counts with length offsets for DESeq2 (recommended)
- `countsFromAbundance='lengthScaledTPM'`: required for limma-voom, which cannot use offset matrices
- `countsFromAbundance='scaledTPM'`: prevents differential transcript usage from creating spurious DGE
## Basic CSV/TSV Loading
**Goal:** Load a gene expression count matrix from a delimited text file into a pandas DataFrame.
**Approach:** Read CSV/TSV with gene IDs as the row index, handling comment lines if present.
**"Load my count matrix"** → Read a delimited file into a DataFrame with genes as rows and samples as columns.
```python
import pandas as pd
# TSV with gene IDs as first column
counts = pd.read_csv('counts.tsv', sep='\t', index_col=0)
# CSV with header
counts = pd.read_csv('counts.csv', index_col=0)
# Skip comment lines
counts = pd.read_csv('counts.txt', sep='\t', index_col=0, comment='#')
```
## featureCounts Output
**Goal:** Parse featureCounts output into a clean count matrix by stripping metadata columns.
**Approach:** Skip the 6 annotation columns (Chr, Start, End, Strand, Length) and clean BAM path suffixes from column names.
```python
import pandas as pd
# featureCounts format has 6 metadata columns before counts
fc = pd.read_csv('featurecounts.txt', sep='\t', comment='#')
counts = fc.set_index('Geneid').iloc[:, 5:] # Skip Chr, Start, End, Strand, Length
counts.columns = [c.replace('.bam', '').split('/')[-1] for c in counts.columns]
```
## Salmon Quant Files
**Goal:** Combine per-sample Salmon quantification files into a single count or TPM matrix.
**Approach:** Iterate over quant directories, extract the desired column from each quant.sf, and merge into a DataFrame.
```python
import pandas as pd
from pathlib import Path
def load_salmon_quants(quant_dirs, column='NumReads'):
'''Load multiple Salmon quant.sf files into a count matrix.'''
dfs = {}
for qdir in quant_dirs:
sample = Path(qdir).name
sf = pd.read_csv(f'{qdir}/quant.sf', sep='\t', index_col=0)
dfs[sample] = sf[column]
return pd.DataFrame(dfs)
# Usage
quant_dirs = ['salmon_out/sample1', 'salmon_out/sample2', 'salmon_out/sample3']
counts = load_salmon_quants(quant_dirs, column='NumReads')
tpm = load_salmon_quants(quant_dirs, column='TPM')
```
## STAR ReadsPerGene Files
**Goal:** Load STAR gene-level count output, selecting the correct strandedness column.
**Approach:** STAR outputs 4 columns per gene: gene ID, unstranded counts, sense-strand counts, antisense-strand counts. Selecting the wrong strand column silently halves the counts.
```python
import pandas as pd
from pathlib import Path
def load_star_genecounts(filepaths, strandedness='reverse'):
'''Load STAR ReadsPerGene.out.tab files.
strandedness: 'unstranded' (col 1), 'forward' (col 2), 'reverse' (col 3)
For Illumina TruSeq stranded libraries, use 'reverse'.
'''
col_map = {'unstranded': 1, 'forward': 2, 'reverse': 3}
col_idx = col_map[strandedness]
dfs = {}
for fp in filepaths:
sample = Path(fp).name.replace('_ReadsPerGene.out.tab', '')
df = pd.read_csv(fp, sep='\t', header=None, index_col=0)
dfs[sample] = df.iloc[4:, col_idx - 1] # Skip first 4 summary rows
return pd.DataFrame(dfs)
```
Strandedness verification: compare total counts in columns 2 vs 3 across samples. The column with much larger totals corresponds to the correct strand. If roughly equal, the library is unstranded (use column 1).
## HTSeq Count Files
**Goal:** Load per-sample HTSeq count output into a combined matrix.
**Approach:** Read each file (two columns: gene ID and count), skipping the 5 summary lines at the end (prefixed with `__`), then merge.
```python
import pandas as pd
from pathlib import Path
def load_htseq_counts(filepaths):
'''Load multiple HTSeq count files into a matrix.'''
dfs = {}
for fp in filepaths:
sample = Path(fp).stem.replace('_counts', '')
df = pd.read_csv(fp, sep='\t', header=None, index_col=0, names=['gene', 'count'])
df = df[~df.index.str.startswith('__')] # Remove __no_feature, __ambiguous, etc.
dfs[sample] = df['count']
return pd.DataFrame(dfs)
```
## kallisto Abundance Files
**Goal:** Combine per-sample kallisto abundance files into a single count or TPM matrix.
**Approach:** Read each abundance.tsv, extract the target column, and assemble into a DataFrame keyed by sample name.
```python
import pandas as pd
from pathlib import Path
def load_kallisto_quants(abundance_files, column='est_counts'):
'''Load multiple kallisto abundance.tsv files.'''
dfs = {}
for f in abundance_files:
sample = Path(f).parent.name
ab = pd.read_csv(f, sep='\t', index_col=0)
dfs[sample] = ab[column]
return pd.DataFrame(dfs)
# Usage
files = ['kallisto_out/sample1/abundance.tsv', 'kallisto_out/sample2/abundance.tsv']
counts = load_kallisto_quants(files, column='est_counts')
tpm = load_kallisto_quants(files, column='tpm')
```
## 10X Genomics Sparse Matrix
**Goal:** Load single-cell count data from 10X Genomics format (MTX directory or H5 file).
**Approach:** Use scanpy to read the sparse matrix with gene and barcode metadata into an AnnData object.
```python
import scanpy as sc
# Load 10X directory (contains matrix.mtx, genes.tsv/features.tsv, barcodes.tsv)
adata = sc.read_10x_mtx('filtered_feature_bc_matrix/')
# Load 10X H5 file
adata = sc.read_10x_h5('filtered_feature_bc_matrix.h5')
# Convert to dense DataFrame if needed
counts = adata.to_df()
```
## AnnData H5AD Files
**Goal:** Load preprocessed expression data stored in the AnnData H5AD format.
**Approach:** Read the H5AD file with scanpy and access the count matrix, raw layer, or metadata as needed.
```python
import anndata as ad
import scanpy as sc
# Load h5ad
adata = sc.read_h5ad('data.h5ad')
# Access count matrix
counts = adata.to_df() # Dense DataFrame
sparse_counts = adata.X # Sparse matrix (if stored sparse)
# Access raw counts if normalized data is in .X
raw_counts = adata.raw.to_adata().to_df()
```
## RDS Files (from R)
**Goal:** Load R-serialized count data into Python.
**Approach:** Use pyreadr to read RDS files directly into a pandas DataFrame.
```python
import pyreadr
# Read RDS file
result = pyreadr.read_r('counts.rds')
counts = result[None] # Access the data
# For Seurat objects, use anndata2ri or convert in R first
```
## Combine Multiple Files
**Goal:** Merge per-sample count files into a single genes-by-samples matrix.
**Approach:** Glob matching files, read the first data column from each, and concatenate into a DataFrame.
```python
import pandas as pd
from pathlib import Path
def combine_count_files(file_pattern, index_col=0, sep='\t'):
'''Combine multiple count files into one matrix.'''
files = sorted(Path('.').glob(file_pattern))
dfs = {}
for f in files:
sample = f.stem.replace('_counts', '')
dfs[sample] = pd.read_csv(f, sep=sep, index_col=index_col).iloc[:, 0]
return pd.DataFrame(dfs)
# Usage
counts = combine_count_files('counts/*_counts.tsv')
```
## Filter Low-Count Genes
**Goal:** Remove genes with insufficient expression to reduce noise and multiple testing burden.
**Approach:** Apply expression thresholds appropriate for the downstream DE tool. edgeR requires explicit filtering; DESeq2 handles it internally but benefits from a speed-only pre-filter.
### edgeR: Design-Aware Filtering (Recommended)
```r
library(edgeR)
# filterByExpr uses smallest group size from design matrix
# Requires CPM above threshold in at least n samples (n = smallest group)
keep <- filterByExpr(y, design=model.matrix(~condition, data=metadata))
y <- y[keep, , keep.lib.sizes=FALSE]
```
Forgetting `filterByExpr` in edgeR inflates the multiple testing burden because edgeR's quasi-likelihood pipeline does not have DESeq2's automatic independent filtering.
### DESeq2: Speed Pre-Filter
```r
# Manual pre-filter for speed only -- independent filtering at results() handles statistics
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
```
### Python
```python
# Group-aware filter: require min_counts in at least min_samples
min_counts, min_samples = 10, 3
expressed = (counts >= min_counts).sum(axis=1) >= min_samples
counts_filtered = counts.loc[expressed]
# For min_samples, use the smallest experimental group size
# e.g., if 3 controls and 5 treated, use min_samples=3
```
## Handle Gene ID Versions
**Goal:** Strip Ensembl version suffixes for consistent ID matching across datasets.
**Approach:** Split on the dot separator and keep only the stable identifier prefix.
```python
# Remove Ensembl version numbers (ENSG00000123456.12 -> ENSG00000123456)
counts.index = counts.index.str.split('.').str[0]
# Or keep as-is for compatibility
```
## Save Count Matrix
**Goal:** Export a count matrix in formats suitable for downstream tools or archival.
**Approach:** Write as TSV, compressed TSV, or AnnData H5AD depending on the target workflow.
```python
# Save as TSV
counts.to_csv('count_matrix.tsv', sep='\t')
# Save as compressed
counts.to_csv('count_matrix.tsv.gz', sep='\t', compression='gzip')
# Save as AnnData
import anndata as ad
adata = ad.AnnData(counts)
adata.write_h5ad('counts.h5ad')
```
## R Loading Equivalents
**Goal:** Load count data in R from common formats including featureCounts and Salmon/kallisto via tximport.
**Approach:** Use base R read functions for alignment-based counts, or tximport with a tx2gene mapping for pseudo-alignment outputs (Salmon/kallisto). tximport handles length-bias correction automatically.
```r
# Basic CSV/TSV
counts <- read.csv('counts.csv', row.names=1)
counts <- read.delim('counts.tsv', row.names=1)
# featureCounts
fc <- read.delim('featurecounts.txt', comment.char='#', row.names=1)
counts <- fc[, 6:ncol(fc)]
# tximport for Salmon/kallisto (RECOMMENDED over manual loading)
# Use tx2gene for gene-level summarization with length-offset correction
library(tximport)
tx2gene <- read.csv('tx2gene.csv') # columns: TXNAME, GENEID
files <- file.path('salmon_out', samples, 'quant.sf')
names(files) <- samples
txi <- tximport(files, type='salmon', tx2gene=tx2gene)
# For DESeq2: use DESeqDataSetFromTximport (handles offsets natively)
# For limma-voom: use countsFromAbundance='lengthScaledTPM' in tximport
# See rna-quantification/tximport-workflow for full details
```
## Related Skills
- rna-quantification/featurecounts-counting - Generate featureCounts output
- rna-quantification/alignment-free-quant - Generate Salmon/kallisto output
- rna-quantification/tximport-workflow - Import Salmon/kallisto with length-offset correction
- expression-matrix/normalization - Normalize counts for downstream analysis
- expression-matrix/sparse-handling - Memory-efficient storage
- expression-matrix/gene-id-mapping - Convert gene identifiers