bio-vcf-statistics

Name: bio-vcf-statistics
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-vcf-statistics

Generate variant statistics and quality metrics from VCF files

Analyzes VCF contents for counts, Ti/Tv ratios, and quality distributions
Depends on bcftools CLI tools including stats, gtcheck, and query
Parses section-tagged output lines to extract SN, TSTV, AF, QUAL, and DP
Outputs structured metrics for variant quality evaluation and sample comparison

SKILL.md

.github/skills/bio-vcf-statisticsView on GitHub ↗

---
name: bio-vcf-statistics
description: Generate variant statistics, sample concordance, and quality metrics using bcftools stats and gtcheck. Use when evaluating variant quality, comparing samples, or summarizing VCF contents.
tool_type: cli
primary_tool: bcftools
---

## Version Compatibility

Reference examples tested with: bcftools 1.19+, numpy 1.26+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# VCF Statistics

Generate statistics and quality metrics using bcftools.

## Statistics Tools

| Command | Purpose |
|---------|---------|
| `bcftools stats` | Comprehensive variant statistics |
| `bcftools gtcheck` | Sample concordance and relatedness |
| `bcftools query` | Custom summaries |

## bcftools stats

**Goal:** Generate comprehensive variant statistics including counts, Ti/Tv ratio, and quality distributions.

**Approach:** Run bcftools stats and parse section-tagged output lines (SN, TSTV, AF, QUAL, DP).

**"How many variants are in this VCF?"** → Compute summary counts, substitution types, and quality distributions from variant records.

### Basic Statistics

```bash
bcftools stats input.vcf.gz > stats.txt
```

### View Key Metrics

```bash
bcftools stats input.vcf.gz | grep "^SN"
```

Output sections:
- `SN` - Summary numbers
- `TSTV` - Transitions/transversions
- `SiS` - Singleton stats
- `AF` - Allele frequency distribution
- `QUAL` - Quality distribution
- `IDD` - Indel distribution
- `ST` - Substitution types
- `DP` - Depth distribution

### Summary Numbers (SN)

```bash
bcftools stats input.vcf.gz | grep "^SN" | cut -f3-
```

Reports:
- Number of samples
- Number of records
- Number of SNPs
- Number of indels
- Number of multiallelic sites
- Number of multiallelic SNPs

### Transition/Transversion Ratio

```bash
bcftools stats input.vcf.gz | grep "^TSTV"
```

Transitions (purine-to-purine: A↔G, or pyrimidine-to-pyrimidine: C↔T) are chemically favored over transversions because they preserve the purine/pyrimidine ring structure. CpG deamination (methylated C→T) is the single most common point mutation in vertebrate genomes and is a transition, which further inflates the Ti/Tv ratio. Exomes have higher Ti/Tv than whole genomes because coding regions are enriched for CpG dinucleotides.

Expected Ti/Tv ratio:
- Whole genome: ~2.0-2.1
- Exome: ~2.8-3.3
- Ti/Tv below expected range suggests excess false-positive SNPs (random errors produce Ti/Tv ~0.5)
- Ti/Tv above expected range suggests over-filtering that disproportionately removes transversions

### Per-Sample Statistics

```bash
bcftools stats -s - input.vcf.gz > per_sample.txt
```

### Compare Two VCFs

```bash
bcftools stats input1.vcf.gz input2.vcf.gz > comparison.txt
```

### Region-Specific Stats

```bash
bcftools stats -r chr1:1000000-2000000 input.vcf.gz > region_stats.txt
bcftools stats -R exome.bed input.vcf.gz > exome_stats.txt
```

## Plotting Statistics

**Goal:** Visualize variant statistics as publication-quality plots.

**Approach:** Pipe bcftools stats output to plot-vcfstats to generate PDF and PNG plots.

### Generate Plots

```bash
bcftools stats input.vcf.gz > stats.txt
plot-vcfstats -p output_dir stats.txt
```

Creates:
- `output_dir/summary.pdf`
- Individual PNG files

### Comparison Plots

```bash
bcftools stats file1.vcf.gz file2.vcf.gz > comparison.txt
plot-vcfstats -p comparison_dir comparison.txt
```

## Expected QC Metric Ranges

Interpreting variant statistics requires context-dependent thresholds. A metric that looks normal in WGS may be alarming in WES.

| Metric | WGS Expected | WES Expected | Flag If |
|--------|-------------|-------------|---------|
| Ti/Tv ratio | ~2.0-2.1 | ~2.8-3.3 | WGS: <1.8 or >2.5; WES: <2.5 or >3.5 |
| Het/Hom ratio | 1.5-2.0 | 1.5-2.0 | Outlier relative to cohort |
| Call rate per variant | >95% | >95% | <90% |
| Call rate per sample | >95% | >95% | <90% |
| Singleton rate (WGS) | ~100-200k | Variable | >2x cohort mean |
| Mendelian error rate (trios) | <0.5% | <0.5% | >1% |

Interpretation notes:
- Het/Hom ratio too high suggests contamination (mixed DNA inflates heterozygosity); too low suggests inbreeding, population structure, or excess homozygous reference calls from low coverage
- Excess singletons per sample suggest sample quality issues, contamination, or library artifacts
- Call rate thresholds apply to common variants; rare variants naturally have higher missingness
- These ranges assume human diploid germline calling; somatic, polyploid, or non-model organisms require different expectations

## Stratified Evaluation

A variant caller with 99% overall accuracy may perform at 70% in difficult genomic regions. Always evaluate stratified by region complexity.

```bash
bcftools stats -R easy_regions.bed input.vcf.gz > easy_stats.txt
bcftools stats -R difficult_regions.bed input.vcf.gz > difficult_stats.txt
```

Key stratification categories (using GIAB stratification BED files):
- **Easy/high-confidence regions** - baseline accuracy measurement
- **Homopolymer runs** - systematic indel errors, especially for Illumina and Ion Torrent
- **Tandem repeats / low-complexity** - alignment ambiguity inflates both FP and FN
- **Segmental duplications** - paralogous mapping produces false heterozygous calls
- **High GC (>70%) / Low GC (<25%)** - coverage bias creates systematic missingness
- **MHC / centromeric regions** - extreme polymorphism or repetitiveness defeats standard callers

GIAB stratification BED files are available at github.com/genome-in-a-bottle/genome-stratifications. Comparing Ti/Tv ratio between easy and difficult regions is a quick diagnostic: a drop in Ti/Tv within difficult regions confirms elevated false-positive rates there.

## Population-Scale QC Metrics

For multi-sample VCFs, per-sample and per-site population metrics reveal systematic issues invisible in single-sample analyses.

**Per-sample checks:**
- Variant count should be roughly consistent across samples from the same population (~4.5M SNPs per WGS sample in humans); outliers suggest processing errors or contamination
- Per-sample Het/Hom ratio: outliers suggest contamination (high) or sample swaps between populations (unexpected value)
- Excess singletons in one sample relative to the cohort mean suggests library or sequencing issues

**Per-site checks:**
- Excess heterozygosity (InbreedingCoeff < -0.3 in GATK, or HWE test): suggests genotyping error at sites where reads from paralogous regions pile up
- Hardy-Weinberg equilibrium: sites failing HWE (p < 1e-5) within a homogeneous population may be genotyping artifacts. HWE filtering must be applied within populations, never across mixed-ancestry cohorts where true allele frequency differences violate HWE assumptions

```bash
bcftools query -f '%CHROM\t%POS\t%INFO/InbreedingCoeff\n' input.vcf.gz | \
    awk '$3 < -0.3' > excess_het_sites.txt
```

## bcftools gtcheck

**Goal:** Verify sample identity and detect sample swaps by comparing genotype concordance.

**Approach:** Use bcftools gtcheck to compute pairwise discordance rates; interpret thresholds based on expected relatedness.

```bash
bcftools gtcheck -G 1 input.vcf.gz > relatedness.txt
bcftools gtcheck -g reference.vcf.gz query.vcf.gz
```

Discordance thresholds for interpreting results:
- Same individual (replicates/re-extractions): <0.5% discordance
- First-degree relatives (parent-child, siblings): ~10-15% discordance
- Unrelated individuals: >10% discordance (typically 15-25%)
- For trios: Mendelian error rate should be <0.5%; rates >1% suggest sample swaps or contamination

Output format (DC lines):

```
DC  0  sample1  sample2  0.95  1234  1200
```

Fields: DC tag, index, sample1, sample2, discordance rate, sites compared, discordant sites. Sample swaps are one of the most common errors in genomics studies. Running gtcheck is essential for any multi-sample study and should be performed early in the QC pipeline before downstream analysis.

## Quick Statistics with Query

**Goal:** Compute targeted summary statistics using bcftools query and shell tools.

**Approach:** Extract specific fields with bcftools query/view and aggregate with awk for counts, means, and distributions.

### Count Variants

```bash
bcftools view -H input.vcf.gz | wc -l
```

### Count by Type

```bash
# SNPs
bcftools view -v snps -H input.vcf.gz | wc -l

# Indels
bcftools view -v indels -H input.vcf.gz | wc -l
```

### Count PASS Variants

```bash
bcftools view -f PASS -H input.vcf.gz | wc -l
```

### Quality Distribution

```bash
bcftools query -f '%QUAL\n' input.vcf.gz | \
    awk '{sum+=$1; count++} END {print "Mean QUAL:", sum/count}'
```

### Depth Distribution

```bash
bcftools query -f '%INFO/DP\n' input.vcf.gz | \
    awk '{sum+=$1; count++} END {print "Mean DP:", sum/count}'
```

### Genotype Counts

```bash
# Count heterozygous sites per sample
bcftools query -f '[%GT\t]\n' input.vcf.gz | \
    awk -F'\t' '{for(i=1;i<=NF;i++) if($i=="0/1" || $i=="0|1") het[i]++}
        END {for(i in het) print "Sample", i, "het:", het[i]}'
```

### Allele Frequency Spectrum

```bash
bcftools query -f '%INFO/AF\n' input.vcf.gz | \
    awk '{
        if($1<0.01) rare++
        else if($1<0.05) low++
        else if($1<0.5) common++
        else freq++
    } END {
        print "Rare (<1%):", rare
        print "Low (1-5%):", low
        print "Common (5-50%):", common
        print "Frequent (>50%):", freq
    }'
```

## Sample Statistics

**Goal:** Compute per-sample variant counts, genotype distributions, and missingness rates.

**Approach:** Use bcftools query/view/stats per sample to tabulate sample-level metrics.

### List and Count Samples

```bash
bcftools query -l input.vcf.gz
bcftools query -l input.vcf.gz | wc -l
```

### Per-Sample Variant Counts

```bash
for sample in $(bcftools query -l input.vcf.gz); do
    count=$(bcftools view -s "$sample" -H input.vcf.gz | \
        bcftools view -c 1 -H | wc -l)
    echo "$sample: $count"
done
```

### Missing Genotypes per Sample

```bash
bcftools stats -s - input.vcf.gz | grep "^PSC"
```

## cyvcf2 Statistics

**Goal:** Compute variant statistics programmatically in Python for custom analyses.

**Approach:** Iterate variants with cyvcf2, accumulate counts by type, and compute quality/genotype distributions with numpy.

### Basic Counts

```python
from cyvcf2 import VCF

stats = {'snps': 0, 'indels': 0, 'other': 0}

for variant in VCF('input.vcf.gz'):
    if variant.is_snp:
        stats['snps'] += 1
    elif variant.is_indel:
        stats['indels'] += 1
    else:
        stats['other'] += 1

print(f'SNPs: {stats["snps"]}')
print(f'Indels: {stats["indels"]}')
print(f'Other: {stats["other"]}')
```

### Quality Statistics

```python
from cyvcf2 import VCF
import numpy as np

quals = []
for variant in VCF('input.vcf.gz'):
    if variant.QUAL:
        quals.append(variant.QUAL)

quals = np.array(quals)
print(f'Mean QUAL: {np.mean(quals):.1f}')
print(f'Median QUAL: {np.median(quals):.1f}')
print(f'Min QUAL: {np.min(quals):.1f}')
print(f'Max QUAL: {np.max(quals):.1f}')
```

### Genotype Distribution

```python
from cyvcf2 import VCF

vcf = VCF('input.vcf.gz')
samples = vcf.samples

hom_ref = [0] * len(samples)
het = [0] * len(samples)
hom_alt = [0] * len(samples)
missing = [0] * len(samples)

for variant in vcf:
    for i, gt in enumerate(variant.gt_types):
        if gt == 0:
            hom_ref[i] += 1
        elif gt == 1:
            het[i] += 1
        elif gt == 3:
            hom_alt[i] += 1
        else:
            missing[i] += 1

for i, sample in enumerate(samples):
    print(f'{sample}: HOM_REF={hom_ref[i]}, HET={het[i]}, HOM_ALT={hom_alt[i]}, MISS={missing[i]}')
```

### Transition/Transversion Calculation

```python
from cyvcf2 import VCF

transitions = 0
transversions = 0

ti_pairs = {('A', 'G'), ('G', 'A'), ('C', 'T'), ('T', 'C')}

for variant in VCF('input.vcf.gz'):
    if not variant.is_snp:
        continue
    ref = variant.REF
    alt = variant.ALT[0]
    if (ref, alt) in ti_pairs:
        transitions += 1
    else:
        transversions += 1

ratio = transitions / transversions if transversions > 0 else 0
print(f'Transitions: {transitions}')
print(f'Transversions: {transversions}')
print(f'Ti/Tv ratio: {ratio:.2f}')
```

## Common Workflows

### Quality Control Report

```bash
bcftools stats input.vcf.gz > stats.txt
grep "^SN" stats.txt | cut -f3-
grep "^TSTV" stats.txt | cut -f5
plot-vcfstats -p qc_plots stats.txt
```

### Compare Before/After Filtering

```bash
bcftools stats raw.vcf.gz filtered.vcf.gz > comparison.txt
grep "^SN" comparison.txt | head -20
```

## Quick Reference

| Task | Command |
|------|---------|
| Full stats | `bcftools stats input.vcf.gz` |
| Summary only | `bcftools stats input.vcf.gz \| grep "^SN"` |
| Ti/Tv ratio | `bcftools stats input.vcf.gz \| grep "^TSTV"` |
| Per-sample | `bcftools stats -s - input.vcf.gz` |
| Compare VCFs | `bcftools stats file1.vcf.gz file2.vcf.gz` |
| Sample check | `bcftools gtcheck -G 1 input.vcf.gz` |
| Plot stats | `plot-vcfstats -p dir stats.txt` |

## Common Errors

| Error | Cause | Solution |
|-------|-------|----------|
| `No data` | Empty VCF | Check if VCF has variants |
| `plot-vcfstats not found` | Not installed | Install with bcftools |
| `Cannot open` | Invalid VCF | Check file format |

## Related Skills

- variant-calling/vcf-basics - View and query VCF files
- variant-calling/filtering-best-practices - Evaluate filter impact on QC metrics
- variant-calling/vcf-manipulation - Compare and merge call sets
- variant-calling/variant-calling - Assess calling quality
- variant-calling/joint-calling - Cohort-level genotyping where population metrics apply
- alignment-files/bam-statistics - Upstream alignment QC that affects variant statistics

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.