bio-comparative-genomics-ortholog-inference
$
npx mdskill add GPTomics/bioSkills/bio-comparative-genomics-ortholog-inferenceInfer orthologs across species for evolutionary analysis
- Identify gene groups and transfer functional annotations between organisms.
- Depends on OrthoFinder CLI, ProteinOrtho API, NCBI BLAST+, and BioPython libraries.
- Selects tree-based reconciliation or graph methods based on species count and accuracy needs.
- Outputs orthogroups with labeled orthologs, paralogs via standard text files.
SKILL.md
.github/skills/bio-comparative-genomics-ortholog-inferenceView on GitHub ↗
---
name: bio-comparative-genomics-ortholog-inference
description: Infer orthologous gene groups across species using OrthoFinder and ProteinOrtho. Identify orthologs, paralogs, and co-orthologs for comparative genomics and functional annotation transfer. Use when identifying gene orthologs across species or building orthogroups for evolutionary analysis.
tool_type: cli
primary_tool: OrthoFinder
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+, BUSCO 5.5+, NCBI BLAST+ 2.15+, OrthoFinder 2.5+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Ortholog Inference
**"Find orthologs across my species"** → Identify orthologous gene groups, paralogs, and co-orthologs across multiple species using sequence similarity clustering and gene tree reconciliation.
- CLI: `orthofinder -f proteomes/` for all-vs-all orthogroup inference
## Method Selection
| Method | Approach | Best For | Tradeoff |
|---|---|---|---|
| OrthoFinder | Tree-based (gene tree reconciliation with species tree) | Accuracy, evolutionary analysis, gene duplication events | Slower, needs sufficient species |
| ProteinOrtho | Graph-based (reciprocal best hits + connectivity) | Speed, many genomes, quick surveys | Less accurate for complex gene families |
| OMA/FastOMA | Graph-based (strict pairwise, hierarchical groups) | Precision-critical applications, large-scale (1000+ genomes) | Lowest recall (misses distant orthologs) |
| SonicParanoid2 | Graph-based (ML predictor + protein language model) | Fast + accurate graph-based | Newer, less community testing |
**Tree-based methods** (OrthoFinder) build gene trees and reconcile with the species tree to distinguish speciation (orthology) from duplication (paralogy). More accurate but computationally expensive.
**Graph-based methods** (ProteinOrtho, OMA, SonicParanoid) use sequence similarity with clustering. Faster but can confuse paralogs with orthologs when evolutionary rates vary.
Default recommendation: OrthoFinder for most analyses. ProteinOrtho for quick surveys or 50+ genomes. OMA/FastOMA when precision is paramount.
## Input Quality
Annotation quality directly affects orthology inference. Heterogeneous annotations across species spuriously inflate lineage-specific gene counts, creating false gene family expansions/contractions in downstream CAFE analysis.
- Use consistent annotation pipelines across species when possible
- Verify proteome completeness with BUSCO/Compleasm before running orthology
- Remove isoforms (keep longest per gene) to avoid inflating copy numbers
- Incomplete gene models produce truncated proteins that split true orthogroups
## Orthology Subtypes
- **One-to-one orthologs**: single gene in each species, ideal for phylogenomics
- **One-to-many / many-to-many**: lineage-specific duplications after speciation
- **In-paralogs**: paralogs from duplication AFTER the speciation event of reference
- **Out-paralogs**: paralogs from duplication BEFORE the speciation event
- **Co-orthologs**: in-paralogous genes collectively orthologous to a gene in the outgroup
## OrthoFinder Workflow
**Goal:** Infer orthologous gene groups across multiple species from their proteomes.
**Approach:** Run OrthoFinder on a directory of per-species FASTA files to perform all-vs-all DIAMOND search, gene/species tree inference, and ortholog/paralog classification, then parse the resulting orthogroups and classify by copy number pattern.
```python
'''Ortholog inference with OrthoFinder'''
import subprocess
import pandas as pd
import os
def run_orthofinder(proteome_dir, output_dir=None, threads=4):
'''Run OrthoFinder on directory of proteomes
Input: Directory with one FASTA file per species
File naming: Species name derived from filename
OrthoFinder pipeline:
1. All-vs-all DIAMOND/BLAST search
2. Gene tree inference per orthogroup
3. Species tree inference (STAG/STRIDE)
4. Gene tree rooting and reconciliation
5. Ortholog/paralog classification via DLC model
Key options:
-M msa: Use MSA-based gene trees (more accurate, slower; recommended for <20 species)
-M dendroblast: Distance-based trees (default, faster; sufficient for >20 species)
-S diamond: Fast search (default)
-S blast: More sensitive (use for divergent species or small proteomes)
'''
cmd = f'orthofinder -f {proteome_dir} -t {threads}'
if output_dir:
cmd += f' -o {output_dir}'
# Add -M msa for MSA-based gene trees (more accurate for evolutionary analysis)
result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
# Output location
if output_dir:
results_dir = output_dir
else:
# OrthoFinder creates Results_MonDD in proteome_dir
results_dir = None
for d in os.listdir(proteome_dir):
if d.startswith('OrthoFinder/Results_'):
results_dir = os.path.join(proteome_dir, d)
break
return results_dir
def parse_orthogroups(orthogroups_file):
'''Parse OrthoFinder Orthogroups.tsv
Columns: Orthogroup, Species1, Species2, ...
Values: Gene IDs (comma-separated if multiple)
Orthogroup types:
- Single-copy: One gene per species (ideal for phylogenomics)
- Multi-copy: Duplications in some lineages
- Species-specific: Genes unique to one species
'''
df = pd.read_csv(orthogroups_file, sep='\t')
df = df.set_index('Orthogroup')
orthogroups = {}
for og_id, row in df.iterrows():
genes = {}
for species in df.columns:
cell = row[species]
if pd.notna(cell) and cell:
genes[species] = cell.split(', ')
else:
genes[species] = []
orthogroups[og_id] = genes
return orthogroups
def classify_orthogroups(orthogroups, species_list):
'''Classify orthogroups by copy number pattern
Categories:
- single_copy: Exactly one gene per species (best for phylogenomics)
- universal: Present in all species (possibly multicopy)
- partial: Missing from some species
- species_specific: Only in one species
'''
classification = {
'single_copy': [],
'universal': [],
'partial': [],
'species_specific': []
}
for og_id, genes in orthogroups.items():
present_in = [sp for sp in species_list if genes.get(sp)]
copy_counts = [len(genes.get(sp, [])) for sp in species_list]
if len(present_in) == 1:
classification['species_specific'].append(og_id)
elif len(present_in) == len(species_list):
if all(c == 1 for c in copy_counts):
classification['single_copy'].append(og_id)
else:
classification['universal'].append(og_id)
else:
classification['partial'].append(og_id)
return classification
def get_single_copy_orthologs(orthogroups_file):
'''Extract single-copy orthologs for phylogenomics
Single-copy orthologs are ideal because:
- Clear 1:1 relationships
- No paralogy complications
- Suitable for concatenated alignments
'''
df = pd.read_csv(orthogroups_file, sep='\t')
df = df.set_index('Orthogroup')
single_copy = []
for og_id, row in df.iterrows():
is_single = True
for species in df.columns:
cell = row[species]
if pd.isna(cell) or cell == '':
is_single = False
break
if ',' in str(cell):
is_single = False
break
if is_single:
single_copy.append(og_id)
return df.loc[single_copy]
```
## Gene Trees and Reconciliation
```python
def parse_gene_trees(gene_trees_dir):
'''Load gene trees from OrthoFinder
Gene trees show evolutionary history within orthogroups
Duplication/loss events inferred by species tree reconciliation
'''
from Bio import Phylo
import glob
trees = {}
for tree_file in glob.glob(f'{gene_trees_dir}/*.txt'):
og_id = os.path.basename(tree_file).replace('_tree.txt', '')
trees[og_id] = Phylo.read(tree_file, 'newick')
return trees
def identify_paralogs(orthogroup, species):
'''Identify in-paralogs within an orthogroup
In-paralogs: Duplications AFTER speciation (within one lineage)
Out-paralogs: Duplications BEFORE speciation (separate orthogroups)
Multiple genes from same species in an orthogroup = in-paralogs
Distinguishing in- vs out-paralogs requires the species tree context
and depends on which speciation event is being considered.
OrthoFinder resolves this via gene tree reconciliation.
'''
genes = orthogroup.get(species, [])
if len(genes) > 1:
return {
'species': species,
'paralogs': genes,
'count': len(genes)
}
return None
def find_co_orthologs(orthogroups, gene_id, species):
'''Find co-orthologs of a gene
Co-orthologs: Multiple genes in one species that are
all orthologous to a single gene in another species
Result of gene duplication after speciation
'''
for og_id, genes in orthogroups.items():
if gene_id in genes.get(species, []):
co_orthologs = {}
for sp, sp_genes in genes.items():
if sp != species and sp_genes:
co_orthologs[sp] = sp_genes
return {'orthogroup': og_id, 'co_orthologs': co_orthologs}
return None
```
## ProteinOrtho Alternative
**Goal:** Detect orthologs using ProteinOrtho as a faster alternative for many-genome comparisons.
**Approach:** Run ProteinOrtho with DIAMOND backend on multiple proteome FASTA files and parse the output table for orthologous groups with connectivity scores.
```python
def run_proteinortho(proteome_files, output_prefix, threads=4):
'''Run ProteinOrtho for ortholog detection
Faster than OrthoFinder for many genomes
Uses synteny information if available
-p=blastp+: Use DIAMOND (faster)
-conn: Connectivity threshold (default 0.1)
'''
files_str = ' '.join(proteome_files)
cmd = f'proteinortho -cpus={threads} -project={output_prefix} {files_str}'
subprocess.run(cmd, shell=True)
return f'{output_prefix}.proteinortho.tsv'
def parse_proteinortho(ortho_file):
'''Parse ProteinOrtho output
Columns: # Species, Genes, Alg.-Conn., Species1, Species2, ...
'''
df = pd.read_csv(ortho_file, sep='\t')
orthogroups = {}
for i, row in df.iterrows():
og_id = f'OG{i:06d}'
n_species = row['# Species']
conn = row['Alg.-Conn.']
genes = {}
for col in df.columns[3:]:
val = row[col]
if pd.notna(val) and val != '*':
genes[col] = val.split(',')
else:
genes[col] = []
orthogroups[og_id] = {
'genes': genes,
'n_species': n_species,
'connectivity': conn
}
return orthogroups
```
## Functional Annotation Transfer
```python
def transfer_annotation(query_gene, orthologs, annotation_db):
'''Transfer functional annotation via orthology
Confidence hierarchy:
- One-to-one orthologs: Highest confidence; direct functional equivalence
- Co-orthologs: Transfer to all, but note potential sub/neofunctionalization
- In-paralogs (recent duplicates): Transfer with caution; function may have diverged
- Distant orthologs (dS > 2): Lowest confidence; verify with domain conservation
GO evidence codes:
- ISO: Inferred from Sequence Orthology (recommended for 1:1 orthologs)
- IBA: Inferred from Biological Aspect of Ancestor (phylogenetic propagation)
- IEA: Inferred from Electronic Annotation (automated, lower confidence)
Synteny context (see synteny-analysis) increases transfer confidence
for genes in conserved genomic neighborhoods.
'''
annotations = []
for species, genes in orthologs.items():
for gene in genes:
if gene in annotation_db:
ann = annotation_db[gene]
annotations.append({
'source_gene': gene,
'source_species': species,
'annotation': ann,
'evidence': 'ISO' # Sequence orthology
})
return annotations
```
## Completeness Assessment
Before orthology analysis, verify proteome completeness with BUSCO or Compleasm:
```bash
# BUSCO: standard benchmark against OrthoDB single-copy orthologs
busco -i proteome.fasta -m proteins -l <lineage> -o busco_out
# Compleasm: 14x faster alternative using miniprot
compleasm run -a genome.fasta -l <lineage> -o compleasm_out
```
BUSCO categories: Complete (single-copy + duplicated), Fragmented, Missing. Expect >90% complete for well-assembled genomes. High duplication rates may indicate assembly collapse or recent WGD. Choose the most specific available lineage for the clade being compared.
## Related Skills
- comparative-genomics/synteny-analysis - Synteny-based ortholog verification and context
- comparative-genomics/positive-selection - Selection analysis on ortholog alignments
- phylogenetics/modern-tree-inference - Build species trees from single-copy orthologs
- alignment/pairwise-alignment - Align orthogroup sequences
- genome-annotation/annotation-transfer - Transfer annotations via orthology