bio-read-alignment-bwa-alignment

Name: bio-read-alignment-bwa-alignment
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-read-alignment-bwa-alignment

Align DNA reads to reference genomes using BWA-MEM2.

Processes paired-end and single-end sequencing data for whole-genome analysis.
Depends on bwa-mem2 CLI and requires samtools for downstream sorting.
Executes alignment commands based on input file formats and read group needs.
Outputs SAM or BAM files containing mapped read positions and quality scores.

SKILL.md

.github/skills/bio-read-alignment-bwa-alignmentView on GitHub ↗

---
name: bio-read-alignment-bwa-alignment
description: Align DNA short reads to reference genomes using bwa-mem2, the faster successor to BWA-MEM. Use when aligning DNA short reads to a reference genome.
tool_type: cli
primary_tool: bwa-mem2
---

## Version Compatibility

Reference examples tested with: GATK 4.5+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# BWA-MEM2 Alignment

**"Align reads with BWA"** → Map DNA reads to a reference genome using BWA-MEM2, the standard aligner for whole-genome and exome sequencing.
- CLI: `bwa-mem2 mem -t 8 ref.fa R1.fq R2.fq | samtools sort -o aligned.bam`

## Build Index

```bash
# Index reference genome (required once)
bwa-mem2 index reference.fa

# Creates: reference.fa.0123, reference.fa.amb, reference.fa.ann, reference.fa.bwt.2bit.64, reference.fa.pac
```

## Basic Alignment

```bash
# Paired-end reads
bwa-mem2 mem -t 8 reference.fa reads_1.fq.gz reads_2.fq.gz > aligned.sam

# Single-end reads
bwa-mem2 mem -t 8 reference.fa reads.fq.gz > aligned.sam
```

## Alignment with Read Groups

```bash
# Add read group information (required for GATK)
bwa-mem2 mem -t 8 \
    -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA\tLB:lib1' \
    reference.fa reads_1.fq.gz reads_2.fq.gz > aligned.sam
```

## Direct to Sorted BAM

```bash
# Pipe to samtools for sorted BAM output
bwa-mem2 mem -t 8 \
    -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA' \
    reference.fa reads_1.fq.gz reads_2.fq.gz | \
    samtools sort -@ 4 -o aligned.sorted.bam -

# Index the BAM
samtools index aligned.sorted.bam
```

## Mark Duplicates Pipeline

**Goal:** Produce a duplicate-marked, sorted BAM file from raw reads in a single streaming pipeline.

**Approach:** Pipe BWA-MEM2 output through samtools fixmate (to add mate score tags), coordinate sort, and markdup in a single command chain to avoid intermediate files.

```bash
# Full pipeline: align, fixmate, sort, markdup
bwa-mem2 mem -t 8 -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA' \
    reference.fa reads_1.fq.gz reads_2.fq.gz | \
    samtools fixmate -m -@ 4 - - | \
    samtools sort -@ 4 - | \
    samtools markdup -@ 4 - aligned.markdup.bam

samtools index aligned.markdup.bam
```

## Common Options

```bash
bwa-mem2 mem -t 8 \         # Threads
    -M \                     # Mark shorter split hits as secondary (Picard compatible)
    -Y \                     # Use soft clipping for supplementary alignments
    -K 100000000 \           # Process INT input bases in each batch
    -R '@RG\tID:s1\tSM:s1' \ # Read group
    reference.fa r1.fq r2.fq
```

## Key Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| -t | 1 | Number of threads |
| -k | 19 | Minimum seed length |
| -w | 100 | Band width for extension |
| -r | 1.5 | Re-seeding trigger ratio |
| -c | 500 | Skip seeds with more than INT hits |
| -A | 1 | Match score |
| -B | 4 | Mismatch penalty |
| -O | 6 | Gap open penalty |
| -E | 1 | Gap extension penalty |
| -M | off | Mark secondary alignments |

## Output Filters

```bash
# Filter unmapped and low quality
bwa-mem2 mem -t 8 reference.fa r1.fq r2.fq | \
    samtools view -@ 4 -bS -q 20 -F 4 - | \
    samtools sort -@ 4 -o aligned.filtered.bam -
```

## Split Read Alignment

```bash
# For SV detection, use -Y for soft clipping
bwa-mem2 mem -t 8 -Y reference.fa r1.fq r2.fq > aligned.sam
```

## Memory Requirements

- Index loading: ~10GB for human genome
- Per thread: ~1-2GB
- Typical human WGS: 30-50GB RAM with 8 threads

## BWA-MEM (Alternative)

```bash
# Build index
bwa index reference.fa

# Paired-end alignment
bwa mem -t 8 reference.fa reads_1.fq.gz reads_2.fq.gz > aligned.sam

# With read groups
bwa mem -t 8 -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA' \
    reference.fa reads_1.fq.gz reads_2.fq.gz > aligned.sam

# Direct to sorted BAM
bwa mem -t 8 -R '@RG\tID:sample1\tSM:sample1\tPL:ILLUMINA' \
    reference.fa reads_1.fq.gz reads_2.fq.gz | \
    samtools sort -@ 4 -o aligned.sorted.bam -
```

## BWA-MEM vs BWA-MEM2

| Feature | BWA-MEM | BWA-MEM2 |
|---------|---------|----------|
| Status | Active | Archived |
| Speed | 1x | 2-3x faster |
| Index format | .bwt | .bwt.2bit.64 |
| Results | Baseline | Nearly identical |
| Memory | ~5GB | ~10GB |

## Related Skills

- read-qc/fastp-workflow - Preprocess reads before alignment
- alignment-files/alignment-sorting - Post-alignment processing
- alignment-files/duplicate-handling - Mark duplicates
- variant-calling/variant-calling - Call variants from BAM

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.