bio-expression-matrix-sparse-handling

Name: bio-expression-matrix-sparse-handling
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-expression-matrix-sparse-handling

Efficiently handle sparse expression matrices for bioinformatics data

Solve memory issues with large zero-heavy RNA-seq or single-cell datasets
Uses scipy.sparse, numpy, pandas, and anndata for sparse matrix operations
Analyzes sparsity level to determine if sparse format is appropriate
Converts dense data to sparse formats and provides sparsity metrics

SKILL.md

.github/skills/bio-expression-matrix-sparse-handlingView on GitHub ↗

---
name: bio-expression-matrix-sparse-handling
description: Work with sparse matrices for memory-efficient storage of count data. Use when dealing with single-cell data or large bulk RNA-seq datasets where most values are zero.
tool_type: python
primary_tool: scipy.sparse
---

## Version Compatibility

Reference examples tested with: numpy 1.26+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Sparse Matrix Handling

**"Convert counts to sparse matrix"** → Store zero-heavy expression data (especially single-cell) in memory-efficient sparse format.
- Python: `scipy.sparse.csr_matrix(dense_array)`, `anndata.X` stores sparse by default
- Python: `scipy.io.mmread('matrix.mtx')` for Market Matrix format (10x Genomics)

## Check Sparsity

```python
import numpy as np

# Calculate sparsity (proportion of zeros)
def check_sparsity(counts):
    zeros = (counts == 0).sum().sum()
    total = counts.size
    sparsity = zeros / total
    print(f'Sparsity: {sparsity:.1%} ({zeros:,} / {total:,} zeros)')
    return sparsity

# Rule of thumb: use sparse if >50% zeros
```

## Convert Dense to Sparse

```python
import scipy.sparse as sp
import pandas as pd

# From pandas DataFrame
dense_df = pd.read_csv('counts.csv', index_col=0)
sparse_matrix = sp.csr_matrix(dense_df.values)

# Keep row/column names
gene_names = dense_df.index.tolist()
sample_names = dense_df.columns.tolist()

# CSR vs CSC
# CSR (Compressed Sparse Row): efficient row slicing, matrix-vector products
# CSC (Compressed Sparse Column): efficient column slicing
sparse_csr = sp.csr_matrix(dense_df.values)  # Row-oriented
sparse_csc = sp.csc_matrix(dense_df.values)  # Column-oriented
```

## Convert Sparse to Dense

```python
import pandas as pd
import scipy.sparse as sp

# To numpy array
dense_array = sparse_matrix.toarray()

# To pandas DataFrame
dense_df = pd.DataFrame(
    sparse_matrix.toarray(),
    index=gene_names,
    columns=sample_names
)
```

## Memory Comparison

```python
import sys
import scipy.sparse as sp

def compare_memory(dense, sparse):
    dense_mb = dense.nbytes / 1e6
    sparse_mb = (sparse.data.nbytes + sparse.indices.nbytes + sparse.indptr.nbytes) / 1e6
    ratio = dense_mb / sparse_mb
    print(f'Dense:  {dense_mb:.1f} MB')
    print(f'Sparse: {sparse_mb:.1f} MB')
    print(f'Ratio:  {ratio:.1f}x smaller')
    return ratio

sparse = sp.csr_matrix(counts.values)
compare_memory(counts.values, sparse)
```

## Save/Load Sparse Matrices

```python
import scipy.sparse as sp
import numpy as np

# Save sparse matrix
sp.save_npz('counts_sparse.npz', sparse_matrix)

# Save with metadata
np.savez('counts_with_meta.npz',
    data=sparse_matrix.data,
    indices=sparse_matrix.indices,
    indptr=sparse_matrix.indptr,
    shape=sparse_matrix.shape,
    genes=np.array(gene_names),
    samples=np.array(sample_names))

# Load sparse matrix
sparse_matrix = sp.load_npz('counts_sparse.npz')

# Load with metadata
loaded = np.load('counts_with_meta.npz', allow_pickle=True)
sparse_matrix = sp.csr_matrix(
    (loaded['data'], loaded['indices'], loaded['indptr']),
    shape=tuple(loaded['shape']))
gene_names = loaded['genes'].tolist()
```

## AnnData with Sparse Matrices

```python
import anndata as ad
import scipy.sparse as sp
import pandas as pd

# Create AnnData with sparse matrix
adata = ad.AnnData(
    X=sp.csr_matrix(counts.values),
    obs=pd.DataFrame(index=counts.columns),  # Samples
    var=pd.DataFrame(index=counts.index)     # Genes
)

# Note: AnnData transposes so cells/samples are rows
adata = ad.AnnData(
    X=sp.csr_matrix(counts.T.values),  # Transpose for samples-as-rows
    obs=pd.DataFrame(index=counts.columns),
    var=pd.DataFrame(index=counts.index)
)

# Save (automatically handles sparse)
adata.write_h5ad('counts.h5ad')

# Check if stored sparse
adata = ad.read_h5ad('counts.h5ad')
print(f'Matrix type: {type(adata.X)}')
```

## Sparse Operations

```python
import scipy.sparse as sp
import numpy as np

# Row sums (gene totals)
row_sums = np.array(sparse_matrix.sum(axis=1)).flatten()

# Column sums (sample totals)
col_sums = np.array(sparse_matrix.sum(axis=0)).flatten()

# Filter rows by sum (keep genes with >10 total counts)
keep_mask = row_sums > 10
sparse_filtered = sparse_matrix[keep_mask, :]

# Filter columns (keep samples with >1000 counts)
keep_cols = col_sums > 1000
sparse_filtered = sparse_matrix[:, keep_cols]

# Log transform (add pseudocount)
sparse_log = sparse_matrix.copy()
sparse_log.data = np.log1p(sparse_log.data)
```

## Subsetting Sparse Matrices

```python
# Select specific genes
gene_idx = [gene_names.index(g) for g in ['TP53', 'BRCA1', 'MYC'] if g in gene_names]
subset = sparse_matrix[gene_idx, :]

# Select specific samples
sample_idx = [sample_names.index(s) for s in ['sample1', 'sample2']]
subset = sparse_matrix[:, sample_idx]

# Boolean indexing
expressed = row_sums > 0
sparse_expressed = sparse_matrix[expressed, :]
```

## Normalization on Sparse

**Goal:** Apply CPM normalization and log transformation to a sparse count matrix without converting to dense format.

**Approach:** Compute library sizes from column sums, broadcast scaling factors with sparse multiply for CPM, then transform only the nonzero data array in-place with log1p.

```python
import numpy as np
import scipy.sparse as sp

def normalize_sparse_cpm(sparse_matrix):
    '''CPM normalization for sparse matrix.'''
    lib_sizes = np.array(sparse_matrix.sum(axis=0)).flatten()
    scaling_factors = 1e6 / lib_sizes
    normalized = sparse_matrix.multiply(scaling_factors)  # Broadcasts across columns
    return normalized

def normalize_sparse_log1p(sparse_matrix):
    '''Log1p transform sparse matrix in place.'''
    result = sparse_matrix.copy()
    result.data = np.log1p(result.data)
    return result

cpm = normalize_sparse_cpm(sparse_matrix)
log_cpm = normalize_sparse_log1p(cpm)
```

## 10X Matrix Format

```python
import scipy.io
import pandas as pd

# Read 10X format
matrix = scipy.io.mmread('matrix.mtx').T.tocsr()  # Transpose and convert to CSR
features = pd.read_csv('features.tsv', sep='\t', header=None)
barcodes = pd.read_csv('barcodes.tsv', sep='\t', header=None)

gene_names = features[1].tolist()  # Gene symbols
cell_barcodes = barcodes[0].tolist()

# Write 10X format
scipy.io.mmwrite('output_matrix.mtx', sparse_matrix)
```

## Backed Mode for Large Datasets

**Goal:** Work with H5AD files too large to fit in memory by loading data on demand.

**Approach:** Open the file in backed mode so only accessed rows/columns are loaded from disk.

```python
import scanpy as sc

# Read-only backed mode -- data stays on disk, loaded on demand
adata = sc.read_h5ad('large_dataset.h5ad', backed='r')
print(f'Shape: {adata.shape}, Type: {type(adata.X)}')

# Only the accessed slice is loaded into memory
subset = adata[adata.obs['cell_type'] == 'T_cell', :].to_memory()
```

Backed mode limitations:
- Only the X matrix supports backed access; obs/var are fully loaded
- Very large sparse H5AD files (>35 GB) can still cause memory issues
- Write mode (`backed='r+'`) only supports updates to X, not obs/var
- Many scanpy functions require in-memory data; use `.to_memory()` for subsets

For datasets too large for backed mode, process in chunks:

```python
import anndata as ad
import numpy as np

def process_in_chunks(h5ad_path, chunk_size=10000, func=None):
    '''Process large H5AD file in cell chunks.'''
    adata = sc.read_h5ad(h5ad_path, backed='r')
    n_cells = adata.shape[0]
    results = []
    for start in range(0, n_cells, chunk_size):
        end = min(start + chunk_size, n_cells)
        chunk = adata[start:end].to_memory()
        if func:
            chunk = func(chunk)
        results.append(chunk)
    return ad.concat(results)
```

## Sparsity Changes After Transformation

After log-transformation or normalization, zero values become non-zero (log1p(0) = 0, but CPM normalization moves zeros to non-zero when prior counts are added). This reduces the effective sparsity:

```python
import scipy.sparse as sp
import numpy as np

def check_sparsity_after_transform(sparse_matrix, transform_name='original'):
    nnz = sparse_matrix.nnz
    total = sparse_matrix.shape[0] * sparse_matrix.shape[1]
    sparsity = 1 - (nnz / total)
    print(f'{transform_name}: {sparsity:.1%} sparse ({nnz:,} nonzero / {total:,} total)')
    return sparsity

# After normalization, sparsity decreases
# If <50% sparse after transformation, dense format may be more memory-efficient
check_sparsity_after_transform(adata.X, 'raw')
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
check_sparsity_after_transform(adata.X, 'after log1p')
```

Rule of thumb: if sparsity drops below ~50% after transformation, convert back to dense format. The sparse index overhead outweighs the benefit of skipping non-zeros.

## Sparse-Dense Arithmetic Gotcha

Adding/subtracting a dense numpy array from a scipy sparse matrix can return a `numpy.matrix` (deprecated type), not a regular array. Always convert explicitly:

```python
import numpy as np
import scipy.sparse as sp

# This may return numpy.matrix (deprecated)
result_bad = sparse_matrix + dense_array

# Always convert explicitly
result_good = np.asarray((sparse_matrix + dense_array))
```

## Related Skills

- expression-matrix/counts-ingest - Load count data
- expression-matrix/normalization - Normalization affects sparsity
- single-cell/data-io - Single-cell data loading
- single-cell/preprocessing - Single-cell normalization

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.