bio-comparative-genomics-positive-selection

Name: bio-comparative-genomics-positive-selection
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-comparative-genomics-positive-selection

Detect adaptive evolution via dN/dS codon models

Identify positively selected sites in gene families using PAML and HyPhy tools.
Depends on PRANK alignments, BioPython, and statistical significance thresholds.
Compares non-synonymous to synonymous substitution rates across evolutionary branches.
Outputs specific codon positions under adaptive pressure with confidence intervals.

SKILL.md

.github/skills/bio-comparative-genomics-positive-selectionView on GitHub ↗

---
name: bio-comparative-genomics-positive-selection
description: Detect positive selection using dN/dS (omega) tests with PAML codeml and HyPhy. Identify sites and branches under adaptive evolution through codon models and branch-site tests. Use when testing for adaptive evolution in gene families or identifying positively selected sites.
tool_type: mixed
primary_tool: PAML
---

## Version Compatibility

Reference examples tested with: BioPython 1.83+, HyPhy 2.5+, PAML 4.10+, PRANK 170427+, scipy 1.12+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Positive Selection Analysis

**"Test my gene for positive selection"** → Detect adaptive evolution using dN/dS (omega) codon models including site models, branch models, and branch-site tests to identify positively selected sites.
- Python: PAML `codeml` for site and branch-site dN/dS tests
- CLI: `hyphy busted`, `hyphy meme` for HyPhy selection tests

## Critical Prerequisites

### Alignment Quality

PRANK is the recommended aligner for selection studies -- it correctly models insertions rather than forcing excess deletions (which create alignment artifacts that inflate dN/dS). Alternative: MACSE handles frameshifts and pseudogenes natively.

After alignment:
- Verify sequences are in-frame (length divisible by 3, no internal stop codons)
- Remove sequences with frameshifts or >50% gaps
- Use GUIDANCE or HoT to assess per-column alignment confidence
- Avoid block-filtering tools (Gblocks, trimAl) -- these frequently remove informative sites and can worsen downstream inference

### Recombination Screening

Run GARD (HyPhy) BEFORE any selection test. Recombinant sequences cannot be described by a single tree, and ignoring recombination inflates false positive rates for both PAML and HyPhy.

```bash
hyphy gard --alignment codon_alignment.fasta --output gard_results.json
```

If GARD detects breakpoints, partition the alignment at breakpoints and analyze segments independently, or exclude recombinant sequences.

## dN/dS Overview

```python
'''
dN/dS (omega, ω) interpretation:
- ω < 1: Purifying (negative) selection - deleterious mutations removed
- ω = 1: Neutral evolution - no selective pressure
- ω > 1: Positive (diversifying) selection - advantageous mutations favored

Most genes: ω << 1 (strong purifying selection)
Immune genes, reproduction: Often show ω > 1 at specific sites
'''
```

### When dN/dS > 1 Does NOT Indicate Positive Selection

| False Signal | Mechanism | Detection |
|---|---|---|
| GC-biased gene conversion (gBGC) | Fixes AT->GC mutations regardless of fitness; causes ~20% of apparent positive selection in primates | W->S substitution bias, correlation with recombination rate, subtelomeric enrichment |
| Synonymous site selection | Codon usage bias deflates dS, inflating omega | High codon bias (ENC < 35), highly expressed genes |
| Saturation | Multiple substitutions erase signal when dS >> 1 | dS > 3 unreliable; dS > 1 treat with caution |
| Alignment errors | Misaligned codons create artificial nonsynonymous changes | Inspect alignment at BEB-flagged sites; check GUIDANCE scores |
| Non-allelic gene conversion | Concerted evolution between paralogs creates substitution bursts | Check for copy number variation; exclude recent duplicates |
| Small sample size | Stochastic variation in short genes or closely related species | Need minimum ~8 sequences with sufficient tree depth |

### Recommended Analysis Pipeline

1. **Codon-aware alignment**: PRANK (preferred) or MACSE
2. **Recombination screening**: GARD -- partition if breakpoints found
3. **Gene-level screen**: BUSTED -- any positive selection anywhere in the gene?
4. **Branch-level**: aBSREL -- which lineages show selection? (a priori hypothesis preferred)
5. **Site-level**: MEME for episodic, FEL/FUBAR for pervasive selection
6. **Validate**: Multiple methods agreeing at same sites increases confidence
7. **Multiple testing correction**: FDR (Benjamini-Hochberg) when testing across many genes

## PAML Codeml Analysis

**Goal:** Test for positive selection across a gene using codon-based site models.

**Approach:** Prepare a codon alignment in PHYLIP format, create codeml control files for nested models (M7 vs M8 or M1a vs M2a), run codeml, extract log-likelihoods, perform a likelihood ratio test, and identify positively selected sites from the BEB analysis.

```python
'''Run PAML codeml for selection analysis'''

import subprocess
import os
from Bio import SeqIO
from Bio.Seq import Seq


def prepare_codon_alignment(cds_fasta, output_phy):
    '''Prepare codon alignment in PHYLIP format

    Requirements:
    - CDS sequences (in-frame, no stop codons except terminal)
    - Multiple sequence alignment already performed
    - Sequence length divisible by 3
    '''
    records = list(SeqIO.parse(cds_fasta, 'fasta'))

    # Validate codon alignment
    for rec in records:
        if len(rec.seq) % 3 != 0:
            print(f'Warning: {rec.id} length not divisible by 3')

    # Write PHYLIP format
    n_seq = len(records)
    seq_len = len(records[0].seq)

    with open(output_phy, 'w') as f:
        f.write(f' {n_seq} {seq_len}\n')
        for rec in records:
            # PHYLIP names: 10 characters, padded
            name = rec.id[:10].ljust(10)
            f.write(f'{name}{str(rec.seq)}\n')

    return output_phy


def create_codeml_control(alignment_file, tree_file, output_dir, model='M8'):
    '''Create codeml control file

    Site models for detecting positive selection:
    - M0: One ratio (single ω for all sites)
    - M1a: Nearly neutral (ω0 < 1, ω1 = 1)
    - M2a: Positive selection (ω0 < 1, ω1 = 1, ω2 > 1)
    - M7: Beta (ω from beta distribution, 0 < ω < 1)
    - M8: Beta + ω > 1 (allows positive selection)
    - M8a: Beta + ω = 1 (null for M8 comparison)

    Standard comparisons:
    - M7 vs M8: More power but can give false positives from relaxed constraint
    - M8 vs M8a: More stringent, recommended as primary test
    - M1a vs M2a: Conservative, fewer false positives
    '''
    model_params = {
        'M0': {'NSsites': 0, 'model': 0},
        'M1a': {'NSsites': 1, 'model': 0},
        'M2a': {'NSsites': 2, 'model': 0},
        'M7': {'NSsites': 7, 'model': 0},
        'M8': {'NSsites': 8, 'model': 0},
        'M8a': {'NSsites': 8, 'model': 0, 'fix_omega': 1, 'omega': 1},
    }

    params = model_params.get(model, model_params['M8'])

    ctl_content = f'''
      seqfile = {alignment_file}
     treefile = {tree_file}
      outfile = {output_dir}/mlc

        noisy = 9
      verbose = 1
      runmode = 0
      seqtype = 1
    CodonFreq = 2
      * CodonFreq: 2=F3x4 (default), 7=FMutSel (recommended, accounts for mutation-selection balance) *
        model = {params.get('model', 0)}
      NSsites = {params.get('NSsites', 8)}
        icode = 0
    fix_kappa = 0
        kappa = 2
    fix_omega = {params.get('fix_omega', 0)}
        omega = {params.get('omega', 1)}
    '''

    ctl_file = f'{output_dir}/codeml_{model}.ctl'
    with open(ctl_file, 'w') as f:
        f.write(ctl_content)

    return ctl_file


def run_codeml(ctl_file):
    '''Run PAML codeml'''
    cmd = f'codeml {ctl_file}'
    result = subprocess.run(cmd, shell=True, capture_output=True, text=True)

    if result.returncode != 0:
        print(f'Codeml error: {result.stderr}')

    return result


def parse_codeml_output(mlc_file):
    '''Parse codeml output for likelihood and parameters'''
    results = {'lnL': None, 'omega': None, 'kappa': None, 'sites': []}

    with open(mlc_file) as f:
        content = f.read()

    # Extract log-likelihood
    for line in content.split('\n'):
        if 'lnL' in line and 'np' in line:
            parts = line.split()
            for i, p in enumerate(parts):
                if p == 'lnL':
                    results['lnL'] = float(parts[i + 2])
                    break

        # Extract omega values
        if 'omega' in line.lower() and '=' in line:
            parts = line.split('=')
            if len(parts) >= 2:
                try:
                    results['omega'] = float(parts[-1].strip().split()[0])
                except ValueError:
                    pass

    # Extract positively selected sites (BEB analysis)
    if 'Bayes Empirical Bayes' in content:
        beb_section = content.split('Bayes Empirical Bayes')[1]
        for line in beb_section.split('\n'):
            parts = line.split()
            if len(parts) >= 5:
                try:
                    site = int(parts[0])
                    aa = parts[1]
                    prob = float(parts[2])
                    # Sites with P > 0.95 considered significant
                    # Sites with P > 0.99 highly significant
                    if prob > 0.95:
                        results['sites'].append({
                            'position': site,
                            'amino_acid': aa,
                            'probability': prob,
                            'significance': '**' if prob > 0.99 else '*'
                        })
                except (ValueError, IndexError):
                    continue

    return results


def likelihood_ratio_test(lnL_null, lnL_alt, df=2, branch_site_test=False):
    '''Perform likelihood ratio test

    For M8 vs M7: df = 2
    For M2a vs M1a: df = 2
    For M8 vs M8a: df = 1
    For branch-site test: df = 1 (use 50:50 chi2(0):chi2(1) mixture;
        critical value 2.71 at 5%, NOT the standard 3.84)

    Significance thresholds (chi-square):
    - df=1: 3.84 (p<0.05), 6.63 (p<0.01)
    - df=2: 5.99 (p<0.05), 9.21 (p<0.01)
    '''
    from scipy import stats

    lrt = 2 * (lnL_alt - lnL_null)

    if branch_site_test:
        # 50:50 mixture of chi2(0) and chi2(1) for branch-site null
        p_value = 0.5 * (1 - stats.chi2.cdf(lrt, 1)) if lrt > 0 else 0.5
    else:
        p_value = 1 - stats.chi2.cdf(lrt, df)

    return {
        'LRT_statistic': lrt,
        'degrees_freedom': df,
        'p_value': p_value,
        'significant': p_value < 0.05
    }
```

## Branch-Site Models

**Goal:** Test for positive selection on a specific lineage (foreground branch) while allowing variable rates across the tree.

**Approach:** Mark the foreground lineage with #1 in the tree file, create branch-site model A control files for both alternative and null hypotheses, run codeml, and perform an LRT with df=1.

```python
def create_branch_site_control(alignment, tree, foreground_branch, output_dir):
    '''Create control file for branch-site test

    Branch-site model A tests for positive selection on
    specific branches (foreground) while allowing variation
    across the tree.

    Foreground branch: Mark with #1 in tree file
    Example: ((A,B),(C #1,D));

    Comparison: Model A vs null (fix_omega = 1)
    '''
    ctl_content = f'''
      seqfile = {alignment}
     treefile = {tree}
      outfile = {output_dir}/branch_site.mlc

      runmode = 0
      seqtype = 1
    CodonFreq = 2
        model = 2
      NSsites = 2
    fix_kappa = 0
        kappa = 2
    fix_omega = 0
        omega = 1
    '''

    ctl_file = f'{output_dir}/branch_site.ctl'
    with open(ctl_file, 'w') as f:
        f.write(ctl_content)

    return ctl_file


def mark_foreground_branch(tree_file, foreground_taxa, output_file):
    '''Mark foreground lineage in Newick tree

    For testing selection on specific lineage:
    - Add #1 after the clade of interest
    - Can mark multiple branches with different tags
    '''
    with open(tree_file) as f:
        tree = f.read().strip()

    # Simple marking - for complex trees use ete3
    for taxon in foreground_taxa:
        tree = tree.replace(taxon, f'{taxon} #1')

    with open(output_file, 'w') as f:
        f.write(tree)

    return output_file
```

## HyPhy Methods

**Goal:** Detect positive selection using HyPhy's suite of methods for gene-wide and site-specific tests.

**Approach:** Follow the recommended pipeline: BUSTED for gene-wide screening, aBSREL for branch identification, MEME/FEL/FUBAR for site-level inference. Each method answers a different question.

### Method Selection Guide

| Method | Question Answered | Significance | Best For |
|---|---|---|---|
| BUSTED | Any positive selection anywhere in gene? | p <= 0.05 | Initial screening |
| aBSREL | Which branches show selection? | p <= 0.05 (a priori) | Lineage-specific hypotheses |
| MEME | Which sites show episodic selection? | p <= 0.1 | Selection varying across branches |
| FEL | Which sites show pervasive selection? | p <= 0.1 | Constant selection across tree |
| FUBAR | Which sites show pervasive selection? | posterior > 0.9 | Large datasets (faster than FEL) |
| RELAX | Is selection relaxed or intensified? | p <= 0.05 (k<1 relaxed, k>1 intensified) | Comparing selection regimes |

Note: MEME is the ONLY method detecting episodic selection at individual sites. FEL/FUBAR assume constant selection pressure across all branches. RELAX tests selection intensity, NOT positive selection.

```python
def run_hyphy_method(alignment_file, tree_file, method, output_json, foreground=None):
    '''Run HyPhy selection analysis

    Methods: busted, meme, fel, fubar, absrel, relax
    foreground: label foreground branches for BUSTED, aBSREL, RELAX
    '''
    cmd = f'hyphy {method} --alignment {alignment_file} --tree {tree_file} --output {output_json}'

    # For branch-aware methods, specify test branches
    if foreground and method in ('busted', 'absrel', 'relax'):
        cmd += f' --branches {foreground}'

    subprocess.run(cmd, shell=True)
    return output_json


def parse_hyphy_json(json_file):
    '''Parse HyPhy JSON output for p-values and selected sites'''
    import json

    with open(json_file) as f:
        results = json.load(f)

    test_results = results.get('test results', {})
    p_value = test_results.get('p-value')

    # Extract per-site results (MEME/FEL/FUBAR)
    site_results = []
    mle = results.get('MLE', {}).get('content', {})
    headers = results.get('MLE', {}).get('headers', [[]])
    if headers and isinstance(headers[0], list):
        col_names = [h[0] if isinstance(h, list) else h for h in headers[0]]
    else:
        col_names = []

    for key, values in mle.get('0', {}).items():
        if isinstance(values, list):
            site_data = dict(zip(col_names, values)) if col_names else {'values': values}
            site_data['site'] = int(key)
            site_results.append(site_data)

    return {'p_value': p_value, 'LRT': test_results.get('LRT'), 'sites': site_results}
```

## Multiple Testing Correction

When running selection tests across many genes:
- Use FDR (Benjamini-Hochberg) rather than Bonferroni (genes in syntenic blocks are non-independent)
- For PAML branch-site tests across gene families: Holm-Bonferroni or FDR
- For HyPhy aBSREL in exploratory mode (testing all branches): power drops dramatically after correction; prefer a priori hypothesis testing
- HyPhy site-level methods (FEL, MEME) already use conservative thresholds (p <= 0.1); additional correction is typically not applied within a single gene

## Related Skills

- comparative-genomics/synteny-analysis - Find syntenic genes for selection tests
- comparative-genomics/ortholog-inference - Identify orthologs for analysis
- comparative-genomics/ancestral-reconstruction - Reconstruct ancestral sequences at selected branches
- alignment/msa-parsing - Parse and manipulate codon alignments
- alignment/multiple-alignment - PRANK codon-aware alignment for selection studies
- phylogenetics/modern-tree-inference - Generate trees for codeml

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.