bio-bam-statistics

Name: bio-bam-statistics
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-bam-statistics

Generate alignment statistics from BAM files

Assess read counts and coverage quality in sequencing data
Depends on samtools CLI tools and pysam Python library APIs
Selects methods based on specific QC metrics requested by user queries
Outputs numerical reports via command line or programmatic interfaces

SKILL.md

.github/skills/bio-bam-statisticsView on GitHub ↗

---
name: bio-bam-statistics
description: Generate alignment statistics using samtools flagstat, stats, depth, coverage, and mosdepth. Use when assessing alignment quality, calculating coverage, or generating QC reports.
tool_type: cli
primary_tool: samtools
---

## Version Compatibility

Reference examples tested with: pysam 0.22+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# BAM Statistics

**"Get alignment statistics and coverage from my BAM file"** → Generate read counts, mapping rates, per-chromosome statistics, depth profiles, and coverage summaries.
- CLI: `samtools flagstat`, `samtools stats`, `samtools depth`, `samtools coverage` (samtools)
- Python: `pysam.AlignmentFile` with `pileup()` and `get_index_statistics()` (pysam)

Generate alignment statistics using samtools and pysam.

## Quick Summary Commands

| Question | Best tool | Why |
|----------|-----------|-----|
| Quick read counts by FLAG category | `samtools flagstat` | Fast; counts secondary+supp in totals |
| Per-chromosome counts | `samtools idxstats` | Fast (needs index); **counts secondary+supp** |
| Insert size, MAPQ, error, GC | `samtools stats -r ref.fa` | Comprehensive; feeds MultiQC |
| Per-position depth (small region) | `samtools depth` or pysam pileup | Slow on full genome |
| Per-position depth (genome-wide) | **`mosdepth`** | 3-10x faster than `samtools depth` |
| Per-region coverage (BED) | `mosdepth --by regions.bed` | Production default |
| Coverage histogram / cumulative | `mosdepth -t 4 --no-per-base` | Single-pass histogram |
| Breadth at depth thresholds | `mosdepth --thresholds 1,10,30,100` | Standard exome QC |
| Targeted enrichment QC | `picard CollectHsMetrics` | PCT_OFF_BAIT, FOLD_80_BASE_PENALTY, AT/GC dropout |
| Cross-sample contamination | `verifybamid2`, `somalier` | FREEMIX < 0.01 expected |

### What Each Tool Counts (and Doesn't)

| Counting category | flagstat | stats | idxstats |
|-------------------|----------|-------|----------|
| Primary alignments | `in total` minus supp | `raw total sequences` | mapped column |
| Secondary | `secondary` line | filtered out | counted in mapped |
| Supplementary | `supplementary` line | filtered out | counted in mapped |
| Mapping rate denominator | total *including* supp | primary only | mapped+unmapped |

For long-read data where one read produces many supplementary alignments, the senior cross-check:
```
input_read_count = flagstat_total - secondary - supplementary
                 = stats_raw_total_sequences
```
Reports of "the file has 1.2M reads" where the input was actually 800k with 400k supplementary chimeric splits trace to flagstat misinterpretation.

## samtools flagstat

Fast summary of alignment flags.

```bash
samtools flagstat input.bam
```

Output:
```
10000000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
50000 + 0 supplementary
0 + 0 duplicates
9800000 + 0 mapped (98.00% : N/A)
9950000 + 0 paired in sequencing
4975000 + 0 read1
4975000 + 0 read2
9700000 + 0 properly paired (97.49% : N/A)
9750000 + 0 with itself and mate mapped
100000 + 0 singletons (1.01% : N/A)
25000 + 0 with mate mapped to a different chr
10000 + 0 with mate mapped to a different chr (mapQ>=5)
```

### Multi-threaded
```bash
samtools flagstat -@ 4 input.bam
```

### Output to File
```bash
samtools flagstat input.bam > flagstat.txt
```

## samtools idxstats

Per-chromosome read counts (requires index).

```bash
samtools idxstats input.bam
```

Output format: `chrom length mapped unmapped`
```
chr1    248956422    5000000    1000
chr2    242193529    4800000    800
chrM    16569        50000      100
*       0            0          150000
```

### Parse idxstats
```bash
# Total mapped reads
samtools idxstats input.bam | awk '{sum += $3} END {print sum}'

# Mitochondrial percentage
samtools idxstats input.bam | awk '
    /^chrM/ {mt = $3}
    {total += $3}
    END {print mt/total*100 "% mitochondrial"}'
```

## samtools stats

Comprehensive statistics including insert size, base quality, and more.

```bash
samtools stats input.bam > stats.txt
```

### View Summary Numbers
```bash
samtools stats input.bam | grep "^SN"
```

Key summary fields:
- `raw total sequences` - Total reads
- `reads mapped` - Mapped reads
- `reads mapped and paired` - Properly paired
- `insert size average` - Mean insert size
- `insert size standard deviation` - Insert size spread
- `average length` - Mean read length
- `error rate` - Mismatch rate

### Generate Plots (with plot-bamstats)
```bash
samtools stats input.bam > stats.txt
plot-bamstats -p plots/ stats.txt
```

### Stats for Specific Region
```bash
samtools stats input.bam chr1:1000000-2000000 > region_stats.txt
```

## samtools depth

Per-position read depth.

### Basic Depth
```bash
samtools depth input.bam > depth.txt
```

Output: `chrom position depth`

### Depth at Specific Positions
```bash
samtools depth -r chr1:1000-2000 input.bam
```

### Include Zero-Depth Positions
```bash
samtools depth -a input.bam > depth_with_zeros.txt
```

### Maximum Depth Cap (Critical Trap)
```bash
# samtools <1.13: silently caps at 8000 -- pipelines pinned to old versions are silently wrong
# samtools 1.13+: no cap by default
# Always set explicitly to be portable:
samtools depth -d 0 input.bam
```

Pipelines that routinely break the 8000 cap: targeted oncology hotspots (5000-50000x), mitochondrial DNA (small genome, large read share), amplicon viral (ARTIC: 1000-100000x per amplicon), UMI-deduped capture (14000-17000x post-collapse), highly expressed transcripts (rRNA, mt-RNA).

### Overlapping Pair Correction
```bash
# When fragment length < 2 * read_length, R1 and R2 overlap.
# Default samtools depth double-counts overlap; -s deducts:
samtools depth -s input.bam
```
Without `-s`, doubled support inflates somatic VAFs at sites covered by overlapping pairs (especially in fragmented samples: FFPE, cfDNA). `mosdepth` does not double-count overlap; `bcftools mpileup` corrects by default; `samtools mpileup` does not.

### mosdepth (Modern Default)
```bash
mosdepth -t 4 sample input.bam                                                       # genome-wide per-base
mosdepth -t 4 --by exome.bed --thresholds 1,10,20,30,100 --no-per-base sample input.bam   # exome QC
mosdepth -t 4 --quantize 0:1:10:100: sample input.bam                                # CNV-style bands
mosdepth -t 4 -f ref.fa sample input.cram                                            # CRAM with reference
```
`mosdepth` filters dup/secondary/supp by default; configurable via `--flag`. Memory ~ 4 bytes x longest chrom (1 GB for human chr1, 12+ GB for axolotl). Does not honor base quality; use `samtools depth -q INT` if needed.

### Depth from BED Regions
```bash
samtools depth -b regions.bed input.bam
```

### Calculate Mean Depth
```bash
samtools depth input.bam | awk '{sum += $3; n++} END {print sum/n}'
```

## samtools coverage

Per-chromosome or per-region coverage statistics (faster than depth).

```bash
samtools coverage input.bam
```

Output columns:
- `#rname` - Reference name
- `startpos` - Start position
- `endpos` - End position
- `numreads` - Number of reads
- `covbases` - Bases with coverage
- `coverage` - Percentage of bases covered
- `meandepth` - Mean depth
- `meanbaseq` - Mean base quality
- `meanmapq` - Mean mapping quality

### Coverage for Specific Region
```bash
samtools coverage -r chr1:1000000-2000000 input.bam
```

### Coverage from BED
```bash
samtools coverage -b regions.bed input.bam
```

### Histogram Output
```bash
samtools coverage -m input.bam
```

## pysam Python Alternative

### Count Reads
```python
import pysam

with pysam.AlignmentFile('input.bam', 'rb') as bam:
    total = mapped = paired = proper = 0
    for read in bam:
        total += 1
        if not read.is_unmapped:
            mapped += 1
        if read.is_paired:
            paired += 1
        if read.is_proper_pair:
            proper += 1

    print(f'Total: {total}')
    print(f'Mapped: {mapped} ({mapped/total*100:.1f}%)')
    print(f'Properly paired: {proper} ({proper/paired*100:.1f}%)')
```

### Per-Chromosome Counts
```python
import pysam

with pysam.AlignmentFile('input.bam', 'rb') as bam:
    for stat in bam.get_index_statistics():
        print(f'{stat.contig}: {stat.mapped} mapped, {stat.unmapped} unmapped')
```

### Calculate Depth at Position
```python
import pysam

with pysam.AlignmentFile('input.bam', 'rb') as bam:
    for pileup in bam.pileup('chr1', 1000000, 1000001):
        print(f'Position {pileup.pos}: depth {pileup.n}')
```

### Mean Depth in Region
```python
import pysam

def mean_depth(bam_path, chrom, start, end):
    depths = []
    with pysam.AlignmentFile(bam_path, 'rb') as bam:
        for pileup in bam.pileup(chrom, start, end, truncate=True):
            depths.append(pileup.n)

    if depths:
        return sum(depths) / len(depths)
    return 0

depth = mean_depth('input.bam', 'chr1', 1000000, 2000000)
print(f'Mean depth: {depth:.1f}x')
```

### Coverage Statistics

**Goal:** Compute coverage breadth and depth for a genomic region from a BAM file.

**Approach:** Iterate pileup columns in the region, count covered positions and accumulate depth, then derive percentages and means.

**Reference (pysam 0.22+):**
```python
import pysam

def coverage_stats(bam_path, chrom, start, end):
    covered = 0
    total_depth = 0

    with pysam.AlignmentFile(bam_path, 'rb') as bam:
        for pileup in bam.pileup(chrom, start, end, truncate=True):
            covered += 1
            total_depth += pileup.n

    length = end - start
    pct_covered = covered / length * 100
    mean_depth = total_depth / length if length > 0 else 0

    return {
        'length': length,
        'covered_bases': covered,
        'pct_covered': pct_covered,
        'mean_depth': mean_depth
    }

stats = coverage_stats('input.bam', 'chr1', 1000000, 2000000)
print(f'Coverage: {stats["pct_covered"]:.1f}%')
print(f'Mean depth: {stats["mean_depth"]:.1f}x')
```

### Insert Size Distribution

**Goal:** Compute the insert size distribution to assess library preparation quality.

**Approach:** Iterate properly paired read1 records, accumulate template lengths into a Counter, then compute summary statistics.

**Reference (pysam 0.22+):**
```python
import pysam
from collections import Counter

insert_sizes = Counter()

with pysam.AlignmentFile('input.bam', 'rb') as bam:
    for read in bam:
        if read.is_proper_pair and read.is_read1 and read.template_length > 0:
            insert_sizes[read.template_length] += 1

sizes = list(insert_sizes.keys())
mean_insert = sum(s * c for s, c in insert_sizes.items()) / sum(insert_sizes.values())
print(f'Mean insert size: {mean_insert:.0f}')
print(f'Min: {min(sizes)}, Max: {max(sizes)}')
```

## Quick Reference

| Task | Command |
|------|---------|
| Quick counts | `samtools flagstat input.bam` |
| Per-chrom counts | `samtools idxstats input.bam` |
| Full stats | `samtools stats input.bam` |
| Coverage summary | `samtools coverage input.bam` |
| Per-position depth | `samtools depth input.bam` |
| Mean depth | `samtools depth input.bam \| awk '{sum+=$3;n++}END{print sum/n}'` |

## QC Thresholds Are Assay-Specific

A single "mapping rate > 95%" rule rejects valid ATAC, ChIP, RNA-seq, metagenomics, and aDNA samples. The threshold question is "is this rate normal for this assay?" not "is this rate above 95%?"

| Metric | WGS PCR-free | WGS PCR | WES | Targeted panel | Deep panel (UMI) | RNA-seq | scRNA (10x) | ATAC | ChIP | Long-read | aDNA |
|--------|--------------|---------|-----|----------------|------------------|---------|-------------|------|------|-----------|------|
| Mapping rate | >99% | >98% | >95% | >95% | >95% | >90% | >70% | >50% | >60% | >95% | 1-50% |
| Duplicate rate | <5% | 5-15% | 20-50% | 20-50% | 50-90% pre-consensus | (skip) | (use UMI) | 10-30% | 5-30% | n/a | 20-60% |
| Proper pair rate | >95% | >95% | >85% | >80% | >80% | >70% | n/a | >50% | >70% | n/a | >60% |
| Mean MAPQ | bimodal at 0/60 | bimodal | bimodal | bimodal | bimodal | bimodal incl 255 (STAR) | 0/3/255 | 30-55 | 30-55 | 30-50 | 20-40 |
| Mt fraction | 0.1-2% | 0.1-2% | <1% | <0.1% | <0.1% | varies | varies | **<10% target** | <2% | n/a | varies |

Mean MAPQ is misleading; the distribution is bimodal (0 and aligner-max). The fraction at MAPQ >= 30 is more informative:
```bash
samtools view -c -F 2308 -q 30 in.bam   # primary, mapped, MAPQ>=30
samtools view -c -F 2308 in.bam          # primary, mapped (denominator)
# For STAR/STARsolo, use -q 255 instead of -q 30 (255 is the unique-mapping sentinel)
```

## What Flagstat Does Not Reveal

A 99% flagstat mapping rate does NOT mean the data is usable. Common false-positive scenarios:

1. **Adapter readthrough**: short fragments (insert < 2 * read_length) sequence into adapter; aligners soft-clip the adapter portion and flag the read as MAPPED. Detect:
   ```bash
   samtools stats input.bam | grep "bases soft-clipped"   # >5% suggests adapter contamination
   ```
2. **Off-target enrichment** (capture/WES): detect via `picard CollectHsMetrics` PCT_OFF_BAIT or PCT_SELECTED_BASES.
3. **Low-complexity pile-up**: telomere/centromere reads mass at MAPQ-0; counted as mapped but useless. Detect via MAPQ distribution.
4. **Cross-sample contamination**: detect via `verifybamid2` or `somalier` (FREEMIX > 1% degrades somatic calling; > 5% breaks germline calling).
5. **Wrong reference build**: a BAM aligned to GRCh37 viewed against GRCh38 looks fine to flagstat but produces nonsense pileups. Compare `@SQ M5:` from BAM header with `samtools dict ref.fa` -- see alignment-validation.

## Insert Size Caveats

`samtools stats` reports the IS section only for FR-oriented properly paired reads. So:
- Mate-pair libraries (RF orientation): IS section *empty* -- proper-pair flag not set for RF
- ATAC-seq: bimodal/multimodal expected (nucleosome ladder ~50/~180/~340 bp). Unimodal suggests poor transposition.
- RNA-seq: TLEN includes intron span -- mean meaningless
- Bisulfite (PBAT): orientation reversed; samtools may not flag proper pair

## Related Skills

- sam-bam-basics - View alignment files; aligner-aware MAPQ semantics
- alignment-indexing - idxstats requires index; secondary+supp counted
- alignment-validation - Insert size by library, contamination, sample-swap detection
- duplicate-handling - Library-aware duplicate rate expectations
- alignment-filtering - Filter before stats
- sequence-io/sequence-statistics - FASTA/FASTQ statistics

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.