bio-genome-annotation-eukaryotic-gene-prediction
$
npx mdskill add GPTomics/bioSkills/bio-genome-annotation-eukaryotic-gene-predictionPredicts eukaryotic genes using RNA-seq and protein evidence with BRAKER3 or GALBA
- Annotates newly assembled eukaryotic genomes or improves existing gene models
- Uses BRAKER3 for RNA-seq and protein evidence, GALBA for protein-only evidence
- Runs Augustus with trained parameters to generate accurate gene structures
- Delivers gene predictions as structured output files for downstream analysis
SKILL.md
.github/skills/bio-genome-annotation-eukaryotic-gene-predictionView on GitHub ↗
---
name: bio-genome-annotation-eukaryotic-gene-prediction
description: Predict protein-coding genes in eukaryotic genomes using BRAKER3 for combined RNA-seq and protein evidence, or GALBA for protein-only evidence. Runs Augustus with trained parameters for accurate gene models. Use when annotating a newly assembled eukaryotic genome or improving existing gene predictions.
tool_type: cli
primary_tool: BRAKER3
---
## Version Compatibility
Reference examples tested with: BUSCO 5.5+, HISAT2 2.2.1+, pandas 2.2+, samtools 1.19+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Eukaryotic Gene Prediction
**"Predict genes in my eukaryotic genome"** → Identify protein-coding gene structures (exons, introns, UTRs) using RNA-seq alignment evidence and/or protein homology to train ab initio predictors.
- CLI: `braker.pl --genome=assembly.fa --bam=rnaseq.bam --prot_seq=proteins.fa` (BRAKER3)
Predict protein-coding genes in eukaryotic genomes using evidence-based methods. BRAKER3 combines RNA-seq and protein homology evidence for the most accurate predictions. GALBA provides an alternative when only protein evidence is available.
## Prerequisites
**CRITICAL:** The input assembly must be softmasked (repeats in lowercase). Run repeat-annotation first to softmask the genome. Unmasked assemblies produce many false positive gene predictions in repetitive regions.
## BRAKER3 (RNA-seq + Protein Evidence)
BRAKER3 is the preferred pipeline when RNA-seq data exists. It combines GeneMark-ETP with Augustus training for high-accuracy gene models.
### Evidence Preparation
```bash
# Align RNA-seq with HISAT2
hisat2-build assembly_softmasked.fasta genome_index
hisat2 -x genome_index -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz \
--dta -p 16 | samtools sort -@ 4 -o rnaseq_sorted.bam
samtools index rnaseq_sorted.bam
# Download OrthoDB proteins for the relevant clade
# Available: Metazoa, Viridiplantae, Fungi, Arthropoda, Vertebrata
wget https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Viridiplantae.fa.gz
gunzip Viridiplantae.fa.gz
```
### Run BRAKER3
```bash
braker.pl \
--genome=assembly_softmasked.fasta \
--bam=rnaseq_sorted.bam \
--prot_seq=Viridiplantae.fa \
--softmasking \
--threads=16 \
--species=my_species \
--workingdir=braker3_out \
--gff3
```
### Key Options
| Option | Description |
|--------|-------------|
| `--genome` | Softmasked genome assembly (FASTA) |
| `--bam` | RNA-seq alignments (BAM, can specify multiple comma-separated) |
| `--prot_seq` | Protein evidence (FASTA) |
| `--softmasking` | Genome is softmasked (required) |
| `--species` | Species name for Augustus training |
| `--threads` | CPU threads |
| `--gff3` | Output in GFF3 format (default is GTF) |
| `--fungus` | Use fungal-specific parameters |
| `--workingdir` | Output directory |
| `--UTR` | Predict UTRs (requires sufficient RNA-seq coverage) |
### Output Files
```
braker3_out/
├── braker.gtf # Gene predictions (GTF)
├── braker.gff3 # Gene predictions (GFF3, if --gff3)
├── braker.codingseq # CDS nucleotide sequences
├── braker.aa # Protein sequences
├── hintsfile.gff # All evidence hints
├── Augustus/ # Trained Augustus parameters
└── GeneMark-ETP/ # GeneMark intermediate files
```
## GALBA (Protein-Only Evidence)
GALBA works best with closely related species proteins. Always prefer BRAKER3 when RNA-seq data is available, as intron evidence from RNA-seq substantially improves splice site accuracy.
```bash
galba.pl \
--genome=assembly_softmasked.fasta \
--prot_seq=closely_related_proteins.fa \
--softmasking \
--threads=16 \
--species=my_species \
--workingdir=galba_out \
--gff3
```
### When to Use GALBA vs BRAKER3
| Scenario | Tool |
|----------|------|
| RNA-seq + proteins available | BRAKER3 |
| Proteins only, closely related species | GALBA |
| Proteins only, distant species | BRAKER3 --prot_seq only (uses ProtHint) |
| No external evidence | Augustus with pre-trained species (last resort) |
## Augustus Standalone
For genomes with pre-trained parameters or when rerunning predictions with different hints.
```bash
# List available pre-trained species
augustus --species=help
# Run with pre-trained species
augustus \
--species=arabidopsis \
--softmasking=1 \
--gff3=on \
--UTR=off \
assembly_softmasked.fasta > augustus_predictions.gff3
# Run with hints (evidence)
augustus \
--species=my_species \
--softmasking=1 \
--gff3=on \
--hintsfile=hints.gff \
--extrinsicCfgFile=extrinsic.cfg \
assembly_softmasked.fasta > augustus_hints.gff3
```
## Evaluation with BUSCO
```bash
# Evaluate predicted proteins
busco -i braker3_out/braker.aa -m proteins -l embryophyta_odb10 -o busco_proteins -c 8
# Compare to genome-mode BUSCO
busco -i assembly_softmasked.fasta -m genome -l embryophyta_odb10 -o busco_genome -c 8
```
### Interpretation
| Metric | Meaning |
|--------|---------|
| Protein BUSCO > Genome BUSCO | Annotation captures most genes |
| Protein BUSCO < Genome BUSCO | Missing gene models, re-examine evidence |
| High duplication | Check for retained duplicates vs artifacts |
## Python: Parse Gene Models
**Goal:** Compute summary statistics for predicted gene models to assess annotation quality and compare against known species benchmarks.
**Approach:** Load the GFF3 into a gffutils database, iterate through gene and transcript features to compute counts, exons per transcript, intron lengths, and single-exon gene fraction.
```python
import gffutils
import pandas as pd
def load_gene_models(gff_file):
'''Load eukaryotic gene models from GFF3/GTF.'''
db = gffutils.create_db(str(gff_file), ':memory:', merge_strategy='merge')
return db
def gene_model_stats(db):
'''Compute statistics for predicted gene models.'''
genes = list(db.features_of_type('gene'))
transcripts = list(db.features_of_type(['mRNA', 'transcript']))
exon_counts, intron_lengths, gene_lengths = [], [], []
for gene in genes:
gene_lengths.append(gene.end - gene.start + 1)
for tx in db.children(gene, featuretype=['mRNA', 'transcript']):
exons = list(db.children(tx, featuretype='exon'))
exon_counts.append(len(exons))
sorted_exons = sorted(exons, key=lambda e: e.start)
for i in range(len(sorted_exons) - 1):
intron_len = sorted_exons[i + 1].start - sorted_exons[i].end - 1
intron_lengths.append(intron_len)
stats = {
'total_genes': len(genes),
'total_transcripts': len(transcripts),
'transcripts_per_gene': len(transcripts) / len(genes) if genes else 0,
'median_exons_per_transcript': pd.Series(exon_counts).median() if exon_counts else 0,
'median_gene_length': pd.Series(gene_lengths).median() if gene_lengths else 0,
'median_intron_length': pd.Series(intron_lengths).median() if intron_lengths else 0,
'single_exon_genes': sum(1 for e in exon_counts if e == 1),
}
return stats
db = load_gene_models('braker3_out/braker.gff3')
stats = gene_model_stats(db)
for key, val in stats.items():
print(f'{key}: {val}')
```
## Troubleshooting
### Low Gene Count
- Verify softmasking (lowercase repeats present in assembly)
- Check RNA-seq alignment rate (>70% expected)
- Ensure OrthoDB proteins match the correct clade
### Many Single-Exon Genes
- May indicate bacterial contamination
- Check if assembly was properly softmasked
- Consider filtering predictions by evidence support
### BRAKER3 Fails
- Ensure all dependencies are in PATH (GeneMark, Augustus, DIAMOND)
- Check that genome headers have no special characters (pipes, spaces)
- Simplify FASTA headers: `sed 's/ .*//' assembly.fasta > clean.fasta`
## Related Skills
- repeat-annotation - PREREQUISITE: softmask repeats before gene prediction
- functional-annotation - Add GO/KEGG/Pfam to predicted proteins
- read-alignment/star-alignment - Alternative RNA-seq aligner for evidence
- genome-assembly/assembly-qc - Verify assembly quality before prediction