bio-alignment-indexing
$
npx mdskill add GPTomics/bioSkills/bio-alignment-indexingGenerate BAM/CRAM indices for random genomic access
- Enables fast retrieval of specific alignment regions from large files.
- Depends on samtools CLI and pysam Python library APIs.
- Selects BAI or CSI type based on contig length limits.
- Outputs index files for immediate use with random access tools.
SKILL.md
.github/skills/bio-alignment-indexingView on GitHub ↗
---
name: bio-alignment-indexing
description: Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
tool_type: cli
primary_tool: samtools
---
## Version Compatibility
Reference examples tested with: pysam 0.22+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Alignment Indexing
Create indices for random access to alignment files using samtools and pysam.
**"Index a BAM file"** → Create a .bai/.csi index enabling random access to genomic regions.
- CLI: `samtools index file.bam`
- Python: `pysam.index('file.bam')`
## Index Types
| Index | Extension | Max contig | Bin shift | When required |
|-------|-----------|-----------|-----------|---------------|
| BAI | `.bai` / `.bam.bai` | 2^29-1 = ~536 Mbp | fixed (16 kb) | Default for human, mouse, fly, fish |
| CSI | `.csi` / `.bam.csi` | 2^(min_shift + depth*3) | configurable via `-m` | **Required** for any contig >536 Mbp |
| CRAI | `.crai` / `.cram.crai` | chunk-based | n/a | CRAM only |
| TBI | `.tbi` | 2^29-1 | fixed | tabix VCF/BED -- same limit as BAI |
### Which Index for Which Genome
| Genome | Largest contig | Index |
|--------|---------------|-------|
| GRCh38 / GRCh37 (human) | 248 Mbp | BAI |
| GRCm39 (mouse) | 195 Mbp | BAI |
| GRCz11 (zebrafish), TAIR10 (Arabidopsis) | 78 Mbp / 30 Mbp | BAI |
| Wheat IWGSC (Triticum aestivum) | ~830 Mbp avg | **CSI** |
| Pine, fir, axolotl, sugar pine | multi-Gbp | **CSI with larger `-m`** |
| Long-read assembly with very large contigs | varies | check `cut -f2 ref.fa.fai \| sort -nr \| head -1` |
For polyploid plants and salamander-scale genomes, increase the bin shift:
```bash
# Default CSI matches BAI bin layout: 2^(14 + 5*3) = 2^29 ≈ 512 Mbp per contig
samtools index -c file.bam
# Larger min_shift for contigs >512 Mbp (wheat, axolotl, sugar pine)
samtools index -c -m 18 file.bam # 2^(18+15) = 2^33 = ~8.5 Gbp per contig
```
**Index file precedence trap:** when both `.bai` and `.csi` exist, samtools uses `.bai`. After re-indexing to CSI for a long contig, delete the old `.bai` or operations fail confusingly.
## samtools index
### Create BAI Index
```bash
samtools index input.bam
# Creates input.bam.bai
```
### Create CSI Index
```bash
samtools index -c input.bam
# Creates input.bam.csi
```
### Specify Output Name
```bash
samtools index input.bam output.bai
```
### Multi-threaded Indexing
```bash
samtools index -@ 4 input.bam
```
### Index CRAM
```bash
samtools index input.cram
# Creates input.cram.crai
```
## Index Requirements
Indexing requires coordinate-sorted files:
```bash
# Check sort order
samtools view -H input.bam | grep "^@HD"
# Should show SO:coordinate
# Sort if needed, then index
samtools sort -o sorted.bam input.bam
samtools index sorted.bam
```
## Using Indices for Region Access
**Goal:** Extract reads overlapping specific genomic coordinates from an indexed BAM.
**Approach:** With the index present, `samtools view` or `pysam.fetch()` can jump directly to the relevant file offset instead of scanning the entire file.
### samtools view with Region
```bash
# Requires index file present
samtools view input.bam chr1:1000000-2000000
```
### Multiple Regions
```bash
samtools view input.bam chr1:1000-2000 chr2:3000-4000
```
### Regions from BED File
```bash
samtools view -L regions.bed input.bam
```
## pysam Python Alternative
### Create Index
```python
import pysam
pysam.index('input.bam')
# Creates input.bam.bai
```
### Create CSI Index
```python
pysam.index('input.bam', 'input.bam.csi', csi=True)
```
### Fetch with Index
```python
with pysam.AlignmentFile('input.bam', 'rb') as bam:
# fetch() requires index
for read in bam.fetch('chr1', 1000000, 2000000):
print(read.query_name)
```
### Check if Indexed
```python
import pysam
from pathlib import Path
def is_indexed(bam_path):
bam_path = Path(bam_path)
return (bam_path.with_suffix('.bam.bai').exists() or
Path(str(bam_path) + '.bai').exists() or
bam_path.with_suffix('.bam.csi').exists())
if not is_indexed('input.bam'):
pysam.index('input.bam')
```
### Fetch Multiple Regions
```python
regions = [('chr1', 1000, 2000), ('chr1', 5000, 6000), ('chr2', 1000, 2000)]
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for chrom, start, end in regions:
count = sum(1 for _ in bam.fetch(chrom, start, end))
print(f'{chrom}:{start}-{end}: {count} reads')
```
### Count Reads in Region
```python
with pysam.AlignmentFile('input.bam', 'rb') as bam:
count = bam.count('chr1', 1000000, 2000000)
print(f'Reads in region: {count}')
```
### Get Reads Covering Position
```python
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for read in bam.fetch('chr1', 1000000, 1000001):
if read.reference_start <= 1000000 < read.reference_end:
print(f'{read.query_name} covers position 1000000')
```
## Index File Locations
samtools looks for indices in two locations:
```
input.bam.bai # Standard location
input.bai # Alternative location
```
For CRAM:
```
input.cram.crai
```
## idxstats - Index Statistics
### Get Per-Chromosome Counts
```bash
samtools idxstats input.bam
```
Output format:
```
chr1 248956422 5000000 0
chr2 242193529 4500000 0
* 0 0 10000
```
Columns: reference name, length, mapped reads, unmapped reads.
### What "mapped" Actually Counts (Caveat)
The mapped column counts **every alignment record with that RNAME, including secondary AND supplementary**. For long-read minimap2 output, where a single read can produce many supplementary chimeric alignments, idxstats overcounts input reads -- typically 1.5-3x.
For unique read counts, use primary-only:
```bash
samtools view -c -F 2304 input.bam chr1 # primary only
```
Cross-check unmapped consistency (a senior sanity check):
```bash
samtools idxstats file.bam | awk '{sum+=$4} END {print sum}' # idxstats unmapped (sum across all rows; PE orphans get a contig RNAME)
samtools view -c -f 4 -F 2304 file.bam # primary unmapped (should match)
```
### Sum Total Mapped Reads
```bash
samtools idxstats input.bam | awk '{sum += $3} END {print sum}'
```
### pysam idxstats
```python
with pysam.AlignmentFile('input.bam', 'rb') as bam:
for stat in bam.get_index_statistics():
print(f'{stat.contig}: {stat.mapped} mapped, {stat.unmapped} unmapped')
```
## FASTA Index (faidx)
Related but different - index reference FASTA for random access:
```bash
samtools faidx reference.fa
# Creates reference.fa.fai
# Fetch region from indexed FASTA
samtools faidx reference.fa chr1:1000-2000
```
### pysam FastaFile
```python
with pysam.FastaFile('reference.fa') as ref:
seq = ref.fetch('chr1', 1000, 2000)
print(seq)
```
## Quick Reference
| Task | samtools | pysam |
|------|----------|-------|
| Create BAI | `samtools index file.bam` | `pysam.index('file.bam')` |
| Create CSI | `samtools index -c file.bam` | `pysam.index('file.bam', csi=True)` |
| Fetch region | `samtools view file.bam chr1:1-1000` | `bam.fetch('chr1', 0, 1000)` |
| Count in region | `samtools view -c file.bam chr1:1-1000` | `bam.count('chr1', 0, 1000)` |
| Index stats | `samtools idxstats file.bam` | `bam.get_index_statistics()` |
| Index FASTA | `samtools faidx ref.fa` | Automatic with FastaFile |
## Index Staleness
If the BAM was modified after indexing, the index points to wrong file offsets and region queries return wrong (or zero) reads. Quick check:
```bash
if [ input.bam -nt input.bam.bai ]; then
echo "Index older than BAM; re-indexing"
samtools index input.bam
fi
```
## Contig-Naming Sanity Check
A leading cause of "my variant calling produced empty VCFs" tickets: querying `chrM` against a BAM that uses `MT` (or `chr1` vs `1`). Always inspect contig conventions before region queries:
```bash
samtools view -H input.bam | grep '^@SQ' | head -3
# Compare with reference dict:
samtools dict ref.fa | head -3
```
UCSC convention uses `chr1`/`chrM`; Ensembl/NCBI uses `1`/`MT`. The two are not interchangeable; tools fail with "contig not found" or silently return zero reads.
## Common Errors
| Error | Cause | Solution |
|-------|-------|----------|
| `random alignment retrieval only works for indexed BAM` | Missing index | Run `samtools index file.bam` |
| `file is not sorted` | Unsorted BAM | Sort first with `samtools sort` |
| `chromosome not found` | Wrong chromosome name | Check names with `samtools view -H` |
| Region query returns zero reads on a known-populated locus | Stale BAI / `chr` vs no-`chr` mismatch | Re-index; verify naming convention |
| BAI silently truncates reads on contigs >536 Mbp | Plant / amphibian / amplified genome | Use CSI: `samtools index -c file.bam` |
## Related Skills
- sam-bam-basics - View and convert alignment files
- alignment-sorting - Sort BAM files (required before indexing)
- alignment-filtering - Filter by regions using index
- bam-statistics - Use idxstats for quick counts
- sequence-io/read-sequences - Index FASTA with SeqIO.index_db()