bio-genome-annotation-annotation-transfer
$
npx mdskill add GPTomics/bioSkills/bio-genome-annotation-annotation-transferTransfer gene annotations between genome assemblies using Liftoff and MiniProt
- Solve the problem of annotating new genome assemblies using existing reference annotations
- Uses Liftoff for same-species liftover and MiniProt for cross-species alignment
- Decides between tools based on whether the target is same-species or related species
- Delivers aligned annotations in GFF format for downstream analysis
SKILL.md
.github/skills/bio-genome-annotation-annotation-transferView on GitHub ↗
---
name: bio-genome-annotation-annotation-transfer
description: Transfer gene annotations between genome assemblies using Liftoff for same-species annotation liftover and MiniProt for cross-species protein-to-genome alignment. Enables rapid annotation of new assemblies using existing reference annotations. Use when annotating a new assembly of a species with an existing reference annotation or mapping annotations across related species.
tool_type: cli
primary_tool: Liftoff
---
## Version Compatibility
Reference examples tested with: BioPython 1.83+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Annotation Transfer
**"Transfer annotations from a reference to my new assembly"** → Map gene models from a well-annotated reference genome onto a new assembly using coordinate liftover or protein-to-genome alignment.
- CLI: `liftoff -g reference.gff -o target.gff ref.fa target.fa` (same species), `miniprot ref.mpi target.fa` (cross-species)
Transfer gene annotations from a reference genome to a new assembly (same species with Liftoff) or across species (with MiniProt protein-to-genome alignment). Faster and more consistent than de novo prediction when a high-quality reference annotation exists.
## Liftoff (Same-Species Transfer)
Liftoff maps annotations from a reference genome to a target assembly using Minimap2 alignments. Ideal for transferring annotations between different assemblies of the same species.
### Basic Usage
```bash
# Transfer annotations from reference to target
liftoff \
-g reference_annotation.gff3 \
-o lifted_annotation.gff3 \
-u unmapped_features.txt \
-dir liftoff_intermediates \
-p 16 \
target_assembly.fasta \
reference_genome.fasta
```
### Key Options
| Option | Description |
|--------|-------------|
| `-g` | Reference annotation (GFF3 or GTF) |
| `-o` | Output annotation file |
| `-u` | File listing unmapped features |
| `-dir` | Directory for intermediate files |
| `-p` | CPU threads |
| `-sc` | Coverage threshold (default: 0.5; fraction of ref feature aligned) |
| `-s` | Sequence identity threshold (default: 0.5) |
| `-a` | Alignment coverage cutoff (default: 0.5) |
| `-copies` | Look for extra gene copies in target |
| `-exclude_partial` | Exclude partially mapped genes |
| `-chroms` | Chromosome name mapping file (tab-separated: ref\ttarget) |
### Strict Parameters for High-Quality Transfer
```bash
# Stricter thresholds for closely related assemblies
# sc 0.95: 95% of reference feature must align
# s 0.90: 90% sequence identity required
liftoff \
-g reference.gff3 \
-o lifted.gff3 \
-u unmapped.txt \
-dir liftoff_tmp \
-sc 0.95 \
-s 0.90 \
-exclude_partial \
-p 16 \
target.fasta \
reference.fasta
```
### With Chromosome Name Mapping
```bash
# Create chromosome mapping file (tab-separated)
# ref_chr1 target_scaffold_1
# ref_chr2 target_scaffold_2
liftoff \
-g reference.gff3 \
-o lifted.gff3 \
-chroms chrom_map.txt \
-p 16 \
target.fasta \
reference.fasta
```
### Output
The output GFF3 contains transferred annotations with additional attributes:
| Attribute | Description |
|-----------|-------------|
| `coverage` | Fraction of reference feature aligned |
| `sequence_ID` | Sequence identity of alignment |
| `extra_copy_number` | Copy number if `-copies` used |
| `valid_ORF` | Whether transferred CDS has valid ORF |
## LiftOn (Newer Successor)
LiftOn improves on Liftoff by combining Liftoff liftover with MiniProt protein alignment to correct gene models that do not transfer cleanly.
```bash
# LiftOn combines Liftoff + MiniProt
lifton \
-g reference.gff3 \
-o lifton_annotation.gff3 \
-ref reference.fasta \
-p 16 \
target.fasta
```
## MiniProt (Cross-Species Protein Alignment)
MiniProt aligns protein sequences to a genome with splicing awareness. Ideal for cross-species annotation transfer using proteins from related species.
### Basic Usage
```bash
# Index target genome
miniprot -t 16 -d target.mpi target_assembly.fasta
# Align proteins to genome
miniprot -t 16 --gff target.mpi reference_proteins.faa > miniprot_alignments.gff
```
### Key Options
| Option | Description |
|--------|-------------|
| `-t` | CPU threads |
| `-d` | Build index database |
| `--gff` | Output in GFF3 format |
| `--gtf` | Output in GTF format |
| `-G` | Max intron size (default: 200000) |
| `-S` | Output alignment score |
| `--outs` | Output secondary alignments (for paralogs) |
| `-C` | Min alignment coverage (0-1; default: 0.5) |
| `-k` | K-mer size for indexing |
### Cross-Species Transfer
```bash
# Use proteins from closely related species
# -G: Adjust max intron size based on target species
# Vertebrates: -G 500000; Insects: -G 50000; Fungi: -G 5000
miniprot -t 16 --gff -G 500000 target.mpi related_species_proteins.faa > cross_species.gff
```
### Convert MiniProt GFF to Gene Models
```python
import gffutils
def miniprot_gff_to_gene_models(miniprot_gff, output_gff):
'''Convert MiniProt alignment GFF to standard gene models.
MiniProt outputs mRNA features with CDS children.
This adds gene-level parent features for compatibility.
'''
db = gffutils.create_db(miniprot_gff, ':memory:', merge_strategy='merge')
gene_id = 0
with open(output_gff, 'w') as out:
out.write('##gff-version 3\n')
for mrna in db.features_of_type('mRNA'):
gene_id += 1
gene_line = f'{mrna.seqid}\tMiniProt\tgene\t{mrna.start}\t{mrna.end}\t{mrna.score}\t{mrna.strand}\t.\tID=mpgene_{gene_id}\n'
mrna_line = f'{mrna.seqid}\tMiniProt\tmRNA\t{mrna.start}\t{mrna.end}\t{mrna.score}\t{mrna.strand}\t.\tID={mrna.id};Parent=mpgene_{gene_id}\n'
out.write(gene_line)
out.write(mrna_line)
for child in db.children(mrna):
child_line = f'{child.seqid}\tMiniProt\t{child.featuretype}\t{child.start}\t{child.end}\t{child.score}\t{child.strand}\t{child.frame}\tParent={mrna.id}\n'
out.write(child_line)
return output_gff
```
### Distinguish from Orthology-Based Transfer
MiniProt performs protein-to-genome alignment, which maps protein sequences to genomic coordinates with intron prediction. This is different from orthology-based transfer (see comparative-genomics/ortholog-inference), which identifies evolutionary relationships between gene families without genome alignment.
## Quality Assessment
**Goal:** Evaluate annotation transfer quality by comparing gene/transcript counts and validating that transferred CDSs have intact open reading frames.
**Approach:** Count genes and transcripts in both reference and transferred GFF files to compute a transfer rate, then extract each transferred CDS sequence from the target assembly and check for valid start codon, single stop codon, and correct frame.
```python
import gffutils
import pandas as pd
def compare_annotations(reference_gff, transferred_gff):
'''Compare reference and transferred annotations for QC.'''
ref_db = gffutils.create_db(reference_gff, ':memory:', merge_strategy='merge')
tgt_db = gffutils.create_db(transferred_gff, ':memory:', merge_strategy='merge')
ref_genes = list(ref_db.features_of_type('gene'))
tgt_genes = list(tgt_db.features_of_type('gene'))
ref_mrnas = list(ref_db.features_of_type(['mRNA', 'transcript']))
tgt_mrnas = list(tgt_db.features_of_type(['mRNA', 'transcript']))
stats = {
'ref_genes': len(ref_genes),
'transferred_genes': len(tgt_genes),
'transfer_rate': len(tgt_genes) / len(ref_genes) if ref_genes else 0,
'ref_transcripts': len(ref_mrnas),
'transferred_transcripts': len(tgt_mrnas),
}
print('=== Annotation Transfer QC ===')
print(f'Reference genes: {stats["ref_genes"]}')
print(f'Transferred genes: {stats["transferred_genes"]}')
print(f'Transfer rate: {stats["transfer_rate"]:.1%}')
print(f'Reference transcripts: {stats["ref_transcripts"]}')
print(f'Transferred transcripts: {stats["transferred_transcripts"]}')
# Transfer rate > 95% is excellent for same-species liftover
# Transfer rate > 80% is typical for closely related species
# Transfer rate < 70% suggests distant species or assembly issues
if stats['transfer_rate'] > 0.95:
print('Quality: Excellent (>95% transfer rate)')
elif stats['transfer_rate'] > 0.80:
print('Quality: Good (>80% transfer rate)')
else:
print('Quality: Low transfer rate - check assembly quality or species distance')
return stats
def check_transferred_orfs(transferred_gff, target_fasta):
'''Check how many transferred CDSs have valid open reading frames.'''
from Bio import SeqIO
genome = SeqIO.to_dict(SeqIO.parse(target_fasta, 'fasta'))
db = gffutils.create_db(transferred_gff, ':memory:', merge_strategy='merge')
valid, invalid, total = 0, 0, 0
for cds in db.features_of_type('CDS'):
total += 1
seq = genome[cds.seqid].seq[cds.start - 1:cds.end]
if cds.strand == '-':
seq = seq.reverse_complement()
protein = seq.translate()
if protein.startswith('M') and protein.endswith('*') and protein.count('*') == 1:
valid += 1
else:
invalid += 1
print(f'\n=== ORF Validation ===')
print(f'Total CDSs: {total}')
print(f'Valid ORFs: {valid} ({valid/total:.1%})')
print(f'Invalid ORFs: {invalid} ({invalid/total:.1%})')
return valid, invalid, total
```
## Troubleshooting
### Many Unmapped Features with Liftoff
- Check assembly contiguity (fragmented assemblies lose features at contig boundaries)
- Relax thresholds: `-sc 0.5 -s 0.5`
- Verify chromosome naming consistency
### MiniProt Misses Short Genes
- Reduce minimum alignment coverage: `-C 0.3`
- Check that protein sequences include short ORFs
### Invalid ORFs After Transfer
- Assembly may have variants causing frameshifts
- Try LiftOn which combines Liftoff + MiniProt for correction
- Consider re-predicting genes de novo in problem regions
## Related Skills
- eukaryotic-gene-prediction - De novo prediction alternative
- comparative-genomics/ortholog-inference - Orthology-based functional transfer
- comparative-genomics/synteny-analysis - Synteny context for annotation transfer
- genome-intervals/gtf-gff-handling - Parse and manipulate transferred annotations