bio-comparative-genomics-ancestral-reconstruction

Name: bio-comparative-genomics-ancestral-reconstruction
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-comparative-genomics-ancestral-reconstruction

Reconstruct ancient protein sequences using PAML and IQ-TREE.

Enables protein resurrection and evolutionary trajectory analysis.
Depends on PAML codeml and IQ-TREE marginal likelihood methods.
Validates Python package versions before executing reconstruction.
Outputs inferred ancestral sequences at phylogenetic nodes.

SKILL.md

.github/skills/bio-comparative-genomics-ancestral-reconstructionView on GitHub ↗

---
name: bio-comparative-genomics-ancestral-reconstruction
description: Reconstruct ancestral sequences at phylogenetic nodes using PAML and IQ-TREE marginal likelihood methods. Infer ancient protein sequences and trace evolutionary trajectories through sequence history. Use when inferring ancestral states for protein resurrection or tracing evolutionary history.
tool_type: mixed
primary_tool: PAML
---

## Version Compatibility

Reference examples tested with: BioPython 1.83+, IQ-TREE 2.2+, PAML 4.10+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Ancestral Sequence Reconstruction

**"Infer ancestral protein sequences at phylogenetic nodes"** → Reconstruct ancient sequences using marginal or joint likelihood methods on a phylogeny for ancestral protein resurrection or evolutionary trajectory analysis.
- Python: PAML `codeml` with `RateAncestor = 1` for ancestral reconstruction
- CLI: `iqtree2 -s aln -m model --ancestral` for marginal reconstruction

## When ASR is Reliable vs Unreliable

ASR accuracy depends on several factors -- understanding these limits is critical before interpreting results:

| Factor | Good for ASR | Poor for ASR |
|---|---|---|
| Tree depth | Shallow-moderate divergence | Very deep divergence (pre-Cambrian) |
| Branch lengths | Short-moderate, evenly distributed | Long branches (low confidence at those nodes) |
| Taxon sampling | Dense sampling near node of interest | Sparse sampling with long gaps |
| Alignment quality | High-confidence alignment columns | Ambiguously aligned regions |
| Model fit | Model matches data well (use ModelFinder) | Severe model misspecification |
| Sequence type | Proteins (20 states = more signal) | Nucleotides at 3rd codon positions (saturated) |

**Key limitation**: ASR reconstructs the most likely ancestral sequence given the model, but the reconstructed protein may not be functional due to epistatic interactions not captured by site-independent models. Experimentally resurrected proteins should always be tested for activity.

## Joint vs Marginal Reconstruction

- **Marginal**: Integrates over uncertainty at all other nodes; provides posterior probabilities per site. Better for identifying ambiguous positions and designing alternative constructs. Default in IQ-TREE.
- **Joint**: Finds the single most likely set of ancestral states across all nodes simultaneously. No per-site probabilities. Faster but less informative.
- Use marginal reconstruction for protein resurrection studies (need per-site confidence).

## Alignment Recommendations

- Use PRANK for coding sequences -- correctly models insertions (other aligners overestimate deletions)
- Gaps in alignments represent indels; indel reconstruction is a separate problem from substitution reconstruction
- PAML treats gaps as missing data or removes gapped columns (`cleandata=0` keeps them as ambiguity; `cleandata=1` removes all columns with any gap)
- IQ-TREE handles gaps better than PAML for large datasets with extensive indel variation

## PAML Ancestral Reconstruction

**Goal:** Reconstruct ancestral sequences at internal phylogenetic nodes using maximum likelihood.

**Approach:** Create a codeml/baseml control file with RateAncestor=1, run PAML, parse the RST file for ancestral sequences and site-wise posterior probabilities.

```python
'''Ancestral sequence reconstruction with PAML codeml/baseml'''

import subprocess
import re
from Bio import SeqIO
from Bio.Seq import Seq


def create_asr_control(alignment, tree, output_dir, seq_type='protein'):
    '''Create control file for ancestral reconstruction

    RateAncestor = 1: Enable ancestral reconstruction
    Generates RST file with ancestral sequences

    IMPORTANT: Tree MUST be rooted -- ancestral states at internal nodes
    depend on the root position. Use outgroup rooting or midpoint rooting.

    For codons: Use codeml with seqtype = 1
    For amino acids: Use codeml with seqtype = 2
    For nucleotides: Use baseml

    Model selection:
    - Proteins: LG or WAG with +G (rate variation); use ModelFinder for data-driven choice
    - Codons: M0 (single omega) for ASR; site models are for selection testing, not ASR
    - Use the same model for tree inference and ASR for consistency
    '''
    if seq_type == 'protein':
        ctl = f'''
      seqfile = {alignment}
     treefile = {tree}
      outfile = {output_dir}/asr.mlc

      seqtype = 2
        model = 3
    aaRatefile = wag.dat

 RateAncestor = 1
    cleandata = 0
        '''
    else:  # codon
        ctl = f'''
      seqfile = {alignment}
     treefile = {tree}
      outfile = {output_dir}/asr.mlc

      seqtype = 1
    CodonFreq = 2
        model = 0
      NSsites = 0

 RateAncestor = 1
    cleandata = 0
        '''

    ctl_file = f'{output_dir}/asr.ctl'
    with open(ctl_file, 'w') as f:
        f.write(ctl)

    return ctl_file


def parse_rst_file(rst_file):
    '''Parse PAML RST file for ancestral sequences

    RST contains:
    - Tree with node numbers
    - Ancestral sequences at each node
    - Posterior probabilities for each site

    Node numbering: Extant sequences first, then internal nodes
    '''
    ancestors = {}
    current_node = None
    current_seq = []

    with open(rst_file) as f:
        content = f.read()

    # Find ancestral sequence section
    if 'Ancestral reconstruction by' in content:
        sections = content.split('Ancestral reconstruction by')
        for section in sections[1:]:
            lines = section.strip().split('\n')
            for line in lines:
                if line.startswith('node #'):
                    if current_node and current_seq:
                        ancestors[current_node] = ''.join(current_seq)
                    match = re.search(r'node #(\d+)', line)
                    if match:
                        current_node = f'Node_{match.group(1)}'
                        current_seq = []
                elif current_node and line.strip() and not line.startswith(' '):
                    # Sequence line
                    seq_part = ''.join(line.split()[1:]) if len(line.split()) > 1 else ''
                    current_seq.append(seq_part)

    if current_node and current_seq:
        ancestors[current_node] = ''.join(current_seq)

    return ancestors


def extract_marginal_probabilities(rst_file):
    '''Extract site-wise posterior probabilities

    High confidence: P > 0.95 (commonly used threshold)
    Moderate confidence: P > 0.80
    Low confidence: P < 0.80 (consider alternatives)

    Report ambiguous sites for experimental validation
    '''
    site_probs = []

    with open(rst_file) as f:
        in_probs = False
        for line in f:
            if 'Prob of best state' in line:
                in_probs = True
                continue
            if in_probs and line.strip():
                parts = line.split()
                if len(parts) >= 3:
                    try:
                        site = int(parts[0])
                        state = parts[1]
                        prob = float(parts[2])
                        site_probs.append({
                            'site': site,
                            'state': state,
                            'probability': prob,
                            'confidence': 'high' if prob > 0.95 else 'moderate' if prob > 0.8 else 'low'
                        })
                    except ValueError:
                        in_probs = False

    return site_probs
```

## IQ-TREE Ancestral Reconstruction

**Goal:** Perform ancestral reconstruction using IQ-TREE's marginal likelihood method.

**Approach:** Run iqtree2 with --ancestral flag to produce a .state file containing per-node, per-site state probabilities, then parse into ancestral sequences.

```python
def run_iqtree_asr(alignment, tree=None, model='LG+G4', output_prefix='asr'):
    '''Run IQ-TREE for ancestral sequence reconstruction

    IQ-TREE provides:
    - Marginal reconstruction (default)
    - Joint reconstruction (-asr-joint)
    - State file (.state) with probabilities

    Advantages over PAML:
    - Automatic model selection
    - Better handling of gaps
    - Faster for large datasets
    '''
    cmd = f'iqtree2 -s {alignment} -m {model} --ancestral -pre {output_prefix}'

    if tree:
        cmd += f' -te {tree}'

    subprocess.run(cmd, shell=True)

    return f'{output_prefix}.state'


def parse_iqtree_state(state_file):
    '''Parse IQ-TREE .state file

    Format: Node  Site  State  Probability  [other states and probs]
    '''
    ancestors = {}

    with open(state_file) as f:
        next(f)  # Skip header
        for line in f:
            parts = line.strip().split('\t')
            if len(parts) >= 4:
                node = parts[0]
                site = int(parts[1])
                state = parts[2]
                prob = float(parts[3])

                if node not in ancestors:
                    ancestors[node] = {'sequence': [], 'probabilities': []}
                ancestors[node]['sequence'].append(state)
                ancestors[node]['probabilities'].append(prob)

    # Convert to sequences
    for node in ancestors:
        ancestors[node]['sequence'] = ''.join(ancestors[node]['sequence'])

    return ancestors
```

## Alternative State Analysis

```python
def get_alternative_states(site_probs, threshold=0.1):
    '''Identify sites with plausible alternative ancestral states

    Alternative states with P > 0.1 should be considered
    for experimental validation (ancestral protein resurrection)

    These sites may:
    - Affect function differently
    - Represent true ancestral ambiguity
    - Be targets for directed evolution
    '''
    ambiguous_sites = []

    for site_data in site_probs:
        if 'alternatives' in site_data:
            significant_alts = [
                alt for alt in site_data['alternatives']
                if alt['probability'] > threshold
            ]
            if significant_alts:
                ambiguous_sites.append({
                    'site': site_data['site'],
                    'best_state': site_data['state'],
                    'best_prob': site_data['probability'],
                    'alternatives': significant_alts
                })

    return ambiguous_sites


def calculate_sequence_confidence(site_probs):
    '''Calculate overall confidence in ancestral sequence

    Metrics:
    - Mean posterior probability
    - Fraction of high-confidence sites (P > 0.95)
    - Number of ambiguous positions
    '''
    if not site_probs:
        return None

    probs = [s['probability'] for s in site_probs]
    high_conf = sum(1 for p in probs if p > 0.95) / len(probs)
    low_conf = sum(1 for p in probs if p < 0.8)

    return {
        'mean_probability': sum(probs) / len(probs),
        'high_confidence_fraction': high_conf,
        'low_confidence_sites': low_conf,
        'total_sites': len(probs),
        'overall_quality': 'high' if high_conf > 0.9 else 'moderate' if high_conf > 0.7 else 'low'
    }
```

## Ancestral Protein Resurrection

**Goal:** Design protein constructs for experimental resurrection of ancestral proteins.

**Approach:** Use the maximum-likelihood ancestral sequence as the primary construct, then create alternative constructs at ambiguous sites (P < 0.95) for experimental validation.

```python
def design_asr_construct(ancestral_seq, extant_reference, ambiguous_sites):
    '''Design constructs for ancestral protein resurrection

    Strategy:
    1. Use most probable (ML) state at each position as primary construct
    2. Create alternative constructs at ambiguous sites (P < 0.95)
    3. Consider codon optimization for expression host

    Epistasis warning: Site-independent models assume positions evolve
    independently, but intramolecular epistasis means certain residue
    combinations may be incompatible. The ML ancestral sequence may contain
    never-tested combinations of residues, potentially yielding non-functional
    proteins. Test multiple alternative constructs, not just the ML sequence.

    Validation:
    - Test activity of primary + alternative resurrected proteins
    - Compare activity to extant homologs as positive controls
    - Titrate ambiguous positions systematically (especially buried residues)
    '''
    constructs = [{'name': 'ASR_ML', 'sequence': ancestral_seq, 'description': 'Maximum likelihood ancestral'}]

    # Create alternative constructs for ambiguous sites
    for site in ambiguous_sites[:5]:  # Limit to top 5 ambiguous
        alt_seq = list(ancestral_seq)
        best_alt = site['alternatives'][0]
        alt_seq[site['site'] - 1] = best_alt['state']
        constructs.append({
            'name': f"ASR_alt_{site['site']}",
            'sequence': ''.join(alt_seq),
            'description': f"Alternative at position {site['site']}: {best_alt['state']}"
        })

    return constructs
```

## Related Skills

- comparative-genomics/positive-selection - Selection analysis on ancestral branches
- comparative-genomics/ortholog-inference - Identify orthologs for reconstruction
- phylogenetics/modern-tree-inference - Generate rooted trees for ASR
- alignment/multiple-alignment - PRANK alignment for ASR (correctly handles indels)
- alignment/pairwise-alignment - Prepare MSA for reconstruction

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.