bio-phylo-distance-calculations
$
npx mdskill add GPTomics/bioSkills/bio-phylo-distance-calculationsBuilds phylogenetic trees using distance-based methods from sequence alignments
- Solves the task of generating exploratory phylogenies from aligned sequences
- Uses Biopython's Bio.Phylo.TreeConstruction for distance calculations and tree building
- Chooses Neighbor-Joining or UPGMA based on dataset size and user goals
- Returns phylogenetic trees with optional bootstrap support as output
SKILL.md
.github/skills/bio-phylo-distance-calculationsView on GitHub ↗
---
name: bio-phylo-distance-calculations
description: Compute evolutionary distances and build phylogenetic trees using Biopython Bio.Phylo.TreeConstruction. Use when creating distance matrices from alignments, building NJ/UPGMA trees, generating bootstrap consensus, or needing quick exploratory phylogenies before running full ML analysis.
tool_type: python
primary_tool: Bio.Phylo.TreeConstruction
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+, NCBI BLAST+ 2.15+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Distance Calculations and Tree Building
**"Build a phylogenetic tree from my alignment"** → Compute evolutionary distance matrices from sequence alignments and construct neighbor-joining or UPGMA trees with bootstrap support.
- Python: `Bio.Phylo.TreeConstruction.DistanceCalculator()`, `DistanceTreeConstructor()`
Compute distances from alignments and construct phylogenetic trees.
## When to Use Distance Methods vs ML
| Scenario | Recommended Method |
|----------|-------------------|
| Quick exploratory tree before committing to a long ML run | NJ |
| Sanity check on data quality (unexpected groupings?) | NJ |
| Very large datasets where ML is prohibitive | NJ |
| Molecular clock data (ultrametric trees) | UPGMA (rare) |
| Publication-quality trees | **ML (IQ-TREE2/RAxML-NG)** or Bayesian |
| Formal hypothesis testing | **ML or Bayesian** |
NJ trees are fast (O(n^3)) and useful for exploration. For any analysis intended for publication, use ML methods (see modern-tree-inference skill). NJ starting trees are used internally by IQ-TREE (BIONJ) and RAxML-NG.
**UPGMA warning:** UPGMA assumes a molecular clock (equal rates across all lineages). This assumption is almost never met for molecular data. Use NJ instead unless clocklike behavior has been verified.
## Evolutionary Distance Corrections
Raw identity-based distances underestimate true evolutionary distance because they do not account for multiple substitutions at the same site. For divergent sequences, corrected distances are more appropriate:
| Model | Correction | Use When |
|-------|------------|----------|
| Identity | None (raw mismatch proportion) | Closely related sequences; quick exploration |
| Jukes-Cantor | Assumes equal substitution rates | Simple correction for moderate divergence |
| Kimura 2-parameter | Distinguishes transitions from transversions | Better for DNA when Ti/Tv ratio differs from 1 |
Biopython's `DistanceCalculator` models (`identity`, `blastn`, `trans`) provide basic corrections. For more sophisticated evolutionary distance estimation, use ML-based distances from IQ-TREE2 (`.mldist` output file).
## Required Import
```python
from Bio import Phylo, AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Phylo.TreeConstruction import DistanceMatrix
from Bio.Phylo.TreeConstruction import ParsimonyScorer, ParsimonyTreeConstructor, NNITreeSearcher
from Bio.Phylo.Consensus import strict_consensus, majority_consensus, bootstrap_trees, bootstrap_consensus
```
## Distance Matrix from Alignment
```python
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator
alignment = AlignIO.read('alignment.fasta', 'fasta')
# Create calculator with distance model
calculator = DistanceCalculator('identity') # Simple identity-based distance
dm = calculator.get_distance(alignment)
print(dm)
# Available models for DNA
calculator = DistanceCalculator('blastn') # BLASTN-style distance
# Available models for protein
calculator = DistanceCalculator('blosum62') # BLOSUM62-based distance
```
## Available Distance Models
| Model | Type | Description |
|-------|------|-------------|
| `identity` | DNA/Protein | 1 - (identical positions / total) |
| `blastn` | DNA | BLASTN scoring distance |
| `trans` | DNA | Transition/transversion weighted |
| `blosum62` | Protein | BLOSUM62 matrix distance |
| `blosum45` | Protein | BLOSUM45 matrix distance |
| `blosum80` | Protein | BLOSUM80 matrix distance |
| `pam250` | Protein | PAM250 matrix distance |
| `pam30` | Protein | PAM30 matrix distance |
## Building Trees with Distance Methods
### Neighbor Joining (NJ)
```python
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
alignment = AlignIO.read('alignment.fasta', 'fasta')
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)
constructor = DistanceTreeConstructor()
nj_tree = constructor.nj(dm)
Phylo.draw_ascii(nj_tree)
```
### UPGMA
```python
constructor = DistanceTreeConstructor()
upgma_tree = constructor.upgma(dm)
Phylo.draw_ascii(upgma_tree)
```
### One-Step Tree Building
```python
# Build tree directly from alignment
constructor = DistanceTreeConstructor(calculator, 'nj')
tree = constructor.build_tree(alignment)
# Or with UPGMA
constructor = DistanceTreeConstructor(calculator, 'upgma')
tree = constructor.build_tree(alignment)
```
## Pairwise Distances Between Taxa
```python
from Bio import Phylo
tree = Phylo.read('tree.nwk', 'newick')
# Distance between two taxa (sum of branch lengths)
taxon1 = tree.find_any(name='Human')
taxon2 = tree.find_any(name='Mouse')
dist = tree.distance(taxon1, taxon2)
print(f'Distance Human-Mouse: {dist:.4f}')
# All pairwise distances
terminals = tree.get_terminals()
for i, t1 in enumerate(terminals):
for t2 in terminals[i+1:]:
d = tree.distance(t1, t2)
print(f'{t1.name}-{t2.name}: {d:.4f}')
```
## Creating Distance Matrix Manually
```python
from Bio.Phylo.TreeConstruction import DistanceMatrix
names = ['A', 'B', 'C', 'D']
# Lower triangular matrix (including diagonal)
matrix = [
[0],
[0.1, 0],
[0.2, 0.15, 0],
[0.3, 0.25, 0.2, 0]
]
dm = DistanceMatrix(names, matrix)
print(dm)
# Build tree from custom matrix
constructor = DistanceTreeConstructor()
tree = constructor.nj(dm)
```
## Parsimony Tree Construction
Parsimony is largely superseded by ML for most molecular phylogenetics. It remains appropriate for morphological cladistics, rare genomic changes (retroelement insertions, gene order), and as a starting point for ML searches. Parsimony is statistically inconsistent in the Felsenstein zone (long branch attraction).
```python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import ParsimonyScorer, NNITreeSearcher, ParsimonyTreeConstructor
alignment = AlignIO.read('alignment.fasta', 'fasta')
scorer = ParsimonyScorer()
searcher = NNITreeSearcher(scorer)
# Parsimony needs a starting tree (NJ is standard)
constructor = DistanceTreeConstructor(DistanceCalculator('identity'), 'nj')
starting_tree = constructor.build_tree(alignment)
pars_constructor = ParsimonyTreeConstructor(searcher, starting_tree)
pars_tree = pars_constructor.build_tree(alignment)
print(f'Parsimony score: {scorer.get_score(pars_tree, alignment)}')
Phylo.draw_ascii(pars_tree)
```
## Bootstrap Analysis
```python
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Phylo.Consensus import bootstrap_trees, bootstrap_consensus, majority_consensus
alignment = AlignIO.read('alignment.fasta', 'fasta')
calculator = DistanceCalculator('identity')
constructor = DistanceTreeConstructor(calculator, 'nj')
# Generate bootstrap trees
boot_trees = list(bootstrap_trees(alignment, 100, constructor))
print(f'Generated {len(boot_trees)} bootstrap trees')
# Get bootstrap consensus
consensus = bootstrap_consensus(alignment, 100, constructor, majority_consensus)
Phylo.draw_ascii(consensus)
```
## Consensus Tree Methods
```python
from Bio.Phylo.Consensus import strict_consensus, majority_consensus, adam_consensus
trees = list(Phylo.parse('bootstrap.nwk', 'newick'))
# Strict consensus (only clades in ALL trees)
strict = strict_consensus(trees)
# Majority rule consensus (clades in >50% of trees)
majority = majority_consensus(trees, cutoff=0.5)
# Adam consensus
adam = adam_consensus(trees)
Phylo.draw_ascii(majority)
```
## Tree Depths and Total Length
```python
tree = Phylo.read('tree.nwk', 'newick')
# Total branch length
total = tree.total_branch_length()
print(f'Total branch length: {total:.4f}')
# Depths from root to each node
depths = tree.depths()
for clade, depth in depths.items():
if clade.is_terminal():
print(f'{clade.name}: {depth:.4f}')
# Maximum depth (tree height)
tree_height = max(depths.values())
print(f'Tree height: {tree_height:.4f}')
```
## Comparing Tree Distances
```python
tree1 = Phylo.read('tree1.nwk', 'newick')
tree2 = Phylo.read('tree2.nwk', 'newick')
# Compare total branch lengths
len1 = tree1.total_branch_length()
len2 = tree2.total_branch_length()
print(f'Tree 1 total: {len1:.4f}')
print(f'Tree 2 total: {len2:.4f}')
# Compare specific pairwise distances
taxa = ['Human', 'Mouse']
t1 = [tree1.find_any(name=t) for t in taxa]
t2 = [tree2.find_any(name=t) for t in taxa]
d1 = tree1.distance(t1[0], t1[1])
d2 = tree2.distance(t2[0], t2[1])
print(f'Human-Mouse distance: Tree1={d1:.4f}, Tree2={d2:.4f}')
```
## Complete Pipeline: Alignment to Bootstrapped Tree
**Goal:** Build a phylogenetic tree from a sequence alignment with bootstrap support assessment for branch confidence.
**Approach:** Read the alignment, compute an identity-based distance matrix, construct a neighbor-joining tree, then generate a majority-rule bootstrap consensus from 100 replicates.
```python
from Bio import AlignIO, Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
from Bio.Phylo.Consensus import bootstrap_consensus, majority_consensus
alignment = AlignIO.read('sequences.aln', 'clustal')
print(f'Alignment: {len(alignment)} sequences, {alignment.get_alignment_length()} positions')
calculator = DistanceCalculator('identity')
constructor = DistanceTreeConstructor(calculator, 'nj')
# Build simple tree
simple_tree = constructor.build_tree(alignment)
simple_tree.ladderize()
# Build bootstrap consensus (100 replicates)
consensus_tree = bootstrap_consensus(alignment, 100, constructor, majority_consensus)
consensus_tree.ladderize()
Phylo.write(simple_tree, 'nj_tree.nwk', 'newick')
Phylo.write(consensus_tree, 'bootstrap_consensus.nwk', 'newick')
```
## Quick Reference: Distance Models
### DNA Models
| Model | Description |
|-------|-------------|
| `identity` | Simple mismatch counting |
| `blastn` | BLASTN-style scoring |
| `trans` | Weights transitions vs transversions |
### Protein Models
| Model | Description |
|-------|-------------|
| `blosum62` | General proteins |
| `blosum45` | Divergent proteins |
| `blosum80` | Similar proteins |
| `pam250` | Distant homologs |
| `pam30` | Close homologs |
## Related Skills
- tree-io - Save constructed trees to files
- tree-visualization - Draw resulting trees
- tree-manipulation - Root and process built trees
- modern-tree-inference - ML tree inference for publication-quality results
- alignment/alignment-io - Read alignments for tree building
- alignment/msa-statistics - Alignment quality before tree building