bio-rna-quantification-featurecounts-counting
$
npx mdskill add GPTomics/bioSkills/bio-rna-quantification-featurecounts-countingCounts RNA-seq reads per gene using featureCounts from BAM files
- Solves the task of generating gene-level read counts for differential expression analysis
- Relies on Subread featureCounts, STAR/HISAT2 BAM files, and GTF annotations
- Uses GTF annotations to map reads to genes and count overlaps
- Exports a gene-by-sample count matrix as a text file for downstream analysis
SKILL.md
.github/skills/bio-rna-quantification-featurecounts-countingView on GitHub ↗
---
name: bio-rna-quantification-featurecounts-counting
description: Count reads per gene from aligned BAM files using Subread featureCounts. Use when processing BAM files from STAR/HISAT2 to generate gene-level counts for DESeq2/edgeR.
tool_type: cli
primary_tool: featureCounts
---
## Version Compatibility
Reference examples tested with: DESeq2 1.42+, HISAT2 2.2.1+, STAR 2.7.11+, Subread 2.0+, edgeR 4.0+, pandas 2.2+, scanpy 1.10+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# featureCounts Counting
**"Count reads per gene from my BAM files"** → Assign aligned reads to genomic features using a GTF annotation to produce a gene-by-sample count matrix for DE analysis.
- CLI: `featureCounts -a genes.gtf -o counts.txt sample1.bam sample2.bam`
Count reads mapping to genomic features (genes, exons) from BAM files.
## Basic Usage
```bash
# Single sample
featureCounts -a annotation.gtf -o counts.txt aligned.bam
# Multiple samples (recommended - single matrix output)
featureCounts -a annotation.gtf -o counts.txt sample1.bam sample2.bam sample3.bam
# All BAMs in directory
featureCounts -a annotation.gtf -o counts.txt *.bam
```
## Paired-End Data
```bash
# Count fragments, not reads (required for paired-end)
featureCounts -p --countReadPairs -a annotation.gtf -o counts.txt *.bam
# Check proper pairs only
featureCounts -p --countReadPairs -B -C -a annotation.gtf -o counts.txt *.bam
```
**Flags:**
- `-p` - Input is paired-end
- `--countReadPairs` - Count fragments instead of reads
- `-B` - Only count properly paired reads
- `-C` - Don't count chimeric fragments
## Strand-Specific Libraries
```bash
# Unstranded (default)
featureCounts -s 0 -a annotation.gtf -o counts.txt *.bam
# Forward stranded (e.g., dUTP, NSR)
featureCounts -s 1 -a annotation.gtf -o counts.txt *.bam
# Reverse stranded (e.g., Illumina TruSeq, most common)
featureCounts -s 2 -a annotation.gtf -o counts.txt *.bam
```
**Determining strandedness:** Use `infer_experiment.py` from RSeQC or check library prep protocol.
## Feature Types
```bash
# Count at gene level (default)
featureCounts -t exon -g gene_id -a annotation.gtf -o counts.txt *.bam
# Count at transcript level
featureCounts -t exon -g transcript_id -a annotation.gtf -o counts.txt *.bam
# Count CDS only
featureCounts -t CDS -g gene_id -a annotation.gtf -o counts.txt *.bam
```
**Flags:**
- `-t` - Feature type in GTF (default: exon)
- `-g` - Meta-feature attribute (default: gene_id)
## Multi-Mapping Reads
```bash
# Discard multi-mappers (default, recommended for DE)
featureCounts -a annotation.gtf -o counts.txt *.bam
# Count multi-mappers (fractional)
featureCounts -M --fraction -a annotation.gtf -o counts.txt *.bam
# Count multi-mappers (full count to each location)
featureCounts -M -a annotation.gtf -o counts.txt *.bam
```
## Overlapping Features
```bash
# Discard reads overlapping multiple features (default)
featureCounts -a annotation.gtf -o counts.txt *.bam
# Count reads overlapping multiple features
featureCounts -O -a annotation.gtf -o counts.txt *.bam
# Fractional count for overlaps
featureCounts -O --fraction -a annotation.gtf -o counts.txt *.bam
```
## Performance Options
```bash
# Use multiple threads
featureCounts -T 8 -a annotation.gtf -o counts.txt *.bam
# Use less memory (slower)
featureCounts --largeBAM -a annotation.gtf -o counts.txt *.bam
```
## Output Files
featureCounts produces two files:
1. **counts.txt** - Main count matrix
```
Geneid Chr Start End Strand Length sample1.bam sample2.bam
GENE1 chr1 100 500 + 400 1523 1891
GENE2 chr1 1000 2000 - 1000 892 756
```
2. **counts.txt.summary** - Assignment statistics
```
Status sample1.bam sample2.bam
Assigned 1523456 1678234
Unassigned_Unmapped 12345 11234
Unassigned_NoFeatures 234567 245678
```
## Extract Count Matrix
```bash
# Remove first 6 columns (metadata) to get just counts
cut -f1,7- counts.txt | tail -n +2 > count_matrix.txt
```
## Python Processing
```python
import pandas as pd
counts = pd.read_csv('counts.txt', sep='\t', comment='#')
count_matrix = counts.set_index('Geneid').iloc[:, 5:] # Skip metadata columns
count_matrix.columns = [c.replace('.bam', '') for c in count_matrix.columns]
count_matrix.to_csv('count_matrix.csv')
```
## R Processing
```r
counts <- read.table('counts.txt', header=TRUE, row.names=1, skip=1)
count_matrix <- counts[, 6:ncol(counts)] # Skip metadata columns
colnames(count_matrix) <- gsub('.bam', '', colnames(count_matrix))
```
## Common Issues
**Low assignment rate:**
- Check strandedness setting (`-s`)
- Verify GTF matches reference genome version
- Check BAM alignment quality
**Zero counts for known expressed genes:**
- Ensure feature type matches GTF (`-t exon` vs `-t gene`)
- Check gene_id attribute name in GTF
## Related Skills
- alignment-files/sam-bam-basics - Input BAM file handling
- genome-intervals/gtf-gff-handling - GTF annotation files
- differential-expression/deseq2-basics - Downstream analysis with counts
- rna-quantification/count-matrix-qc - QC of count data