bio-genome-annotation-eukaryotic-gene-prediction

Name: bio-genome-annotation-eukaryotic-gene-prediction
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-genome-annotation-eukaryotic-gene-prediction

Predicts eukaryotic genes using RNA-seq and protein evidence with BRAKER3 or GALBA

Annotates newly assembled eukaryotic genomes or improves existing gene models
Uses BRAKER3 for RNA-seq and protein evidence, GALBA for protein-only evidence
Runs Augustus with trained parameters to generate accurate gene structures
Delivers gene predictions as structured output files for downstream analysis

SKILL.md

.github/skills/bio-genome-annotation-eukaryotic-gene-predictionView on GitHub ↗

---
name: bio-genome-annotation-eukaryotic-gene-prediction
description: Predict protein-coding genes in eukaryotic genomes using BRAKER3 for combined RNA-seq and protein evidence, or GALBA for protein-only evidence. Runs Augustus with trained parameters for accurate gene models. Use when annotating a newly assembled eukaryotic genome or improving existing gene predictions.
tool_type: cli
primary_tool: BRAKER3
---

## Version Compatibility

Reference examples tested with: BUSCO 5.5+, HISAT2 2.2.1+, pandas 2.2+, samtools 1.19+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Eukaryotic Gene Prediction

**"Predict genes in my eukaryotic genome"** → Identify protein-coding gene structures (exons, introns, UTRs) using RNA-seq alignment evidence and/or protein homology to train ab initio predictors.
- CLI: `braker.pl --genome=assembly.fa --bam=rnaseq.bam --prot_seq=proteins.fa` (BRAKER3)

Predict protein-coding genes in eukaryotic genomes using evidence-based methods. BRAKER3 combines RNA-seq and protein homology evidence for the most accurate predictions. GALBA provides an alternative when only protein evidence is available.

## Prerequisites

**CRITICAL:** The input assembly must be softmasked (repeats in lowercase). Run repeat-annotation first to softmask the genome. Unmasked assemblies produce many false positive gene predictions in repetitive regions.

## BRAKER3 (RNA-seq + Protein Evidence)

BRAKER3 is the preferred pipeline when RNA-seq data exists. It combines GeneMark-ETP with Augustus training for high-accuracy gene models.

### Evidence Preparation

```bash
# Align RNA-seq with HISAT2
hisat2-build assembly_softmasked.fasta genome_index
hisat2 -x genome_index -1 reads_R1.fastq.gz -2 reads_R2.fastq.gz \
    --dta -p 16 | samtools sort -@ 4 -o rnaseq_sorted.bam
samtools index rnaseq_sorted.bam

# Download OrthoDB proteins for the relevant clade
# Available: Metazoa, Viridiplantae, Fungi, Arthropoda, Vertebrata
wget https://bioinf.uni-greifswald.de/bioinf/partitioned_odb11/Viridiplantae.fa.gz
gunzip Viridiplantae.fa.gz
```

### Run BRAKER3

```bash
braker.pl \
    --genome=assembly_softmasked.fasta \
    --bam=rnaseq_sorted.bam \
    --prot_seq=Viridiplantae.fa \
    --softmasking \
    --threads=16 \
    --species=my_species \
    --workingdir=braker3_out \
    --gff3
```

### Key Options

| Option | Description |
|--------|-------------|
| `--genome` | Softmasked genome assembly (FASTA) |
| `--bam` | RNA-seq alignments (BAM, can specify multiple comma-separated) |
| `--prot_seq` | Protein evidence (FASTA) |
| `--softmasking` | Genome is softmasked (required) |
| `--species` | Species name for Augustus training |
| `--threads` | CPU threads |
| `--gff3` | Output in GFF3 format (default is GTF) |
| `--fungus` | Use fungal-specific parameters |
| `--workingdir` | Output directory |
| `--UTR` | Predict UTRs (requires sufficient RNA-seq coverage) |

### Output Files

```
braker3_out/
├── braker.gtf           # Gene predictions (GTF)
├── braker.gff3          # Gene predictions (GFF3, if --gff3)
├── braker.codingseq     # CDS nucleotide sequences
├── braker.aa            # Protein sequences
├── hintsfile.gff        # All evidence hints
├── Augustus/            # Trained Augustus parameters
└── GeneMark-ETP/        # GeneMark intermediate files
```

## GALBA (Protein-Only Evidence)

GALBA works best with closely related species proteins. Always prefer BRAKER3 when RNA-seq data is available, as intron evidence from RNA-seq substantially improves splice site accuracy.

```bash
galba.pl \
    --genome=assembly_softmasked.fasta \
    --prot_seq=closely_related_proteins.fa \
    --softmasking \
    --threads=16 \
    --species=my_species \
    --workingdir=galba_out \
    --gff3
```

### When to Use GALBA vs BRAKER3

| Scenario | Tool |
|----------|------|
| RNA-seq + proteins available | BRAKER3 |
| Proteins only, closely related species | GALBA |
| Proteins only, distant species | BRAKER3 --prot_seq only (uses ProtHint) |
| No external evidence | Augustus with pre-trained species (last resort) |

## Augustus Standalone

For genomes with pre-trained parameters or when rerunning predictions with different hints.

```bash
# List available pre-trained species
augustus --species=help

# Run with pre-trained species
augustus \
    --species=arabidopsis \
    --softmasking=1 \
    --gff3=on \
    --UTR=off \
    assembly_softmasked.fasta > augustus_predictions.gff3

# Run with hints (evidence)
augustus \
    --species=my_species \
    --softmasking=1 \
    --gff3=on \
    --hintsfile=hints.gff \
    --extrinsicCfgFile=extrinsic.cfg \
    assembly_softmasked.fasta > augustus_hints.gff3
```

## Evaluation with BUSCO

```bash
# Evaluate predicted proteins
busco -i braker3_out/braker.aa -m proteins -l embryophyta_odb10 -o busco_proteins -c 8

# Compare to genome-mode BUSCO
busco -i assembly_softmasked.fasta -m genome -l embryophyta_odb10 -o busco_genome -c 8
```

### Interpretation

| Metric | Meaning |
|--------|---------|
| Protein BUSCO > Genome BUSCO | Annotation captures most genes |
| Protein BUSCO < Genome BUSCO | Missing gene models, re-examine evidence |
| High duplication | Check for retained duplicates vs artifacts |

## Python: Parse Gene Models

**Goal:** Compute summary statistics for predicted gene models to assess annotation quality and compare against known species benchmarks.

**Approach:** Load the GFF3 into a gffutils database, iterate through gene and transcript features to compute counts, exons per transcript, intron lengths, and single-exon gene fraction.

```python
import gffutils
import pandas as pd

def load_gene_models(gff_file):
    '''Load eukaryotic gene models from GFF3/GTF.'''
    db = gffutils.create_db(str(gff_file), ':memory:', merge_strategy='merge')
    return db

def gene_model_stats(db):
    '''Compute statistics for predicted gene models.'''
    genes = list(db.features_of_type('gene'))
    transcripts = list(db.features_of_type(['mRNA', 'transcript']))

    exon_counts, intron_lengths, gene_lengths = [], [], []
    for gene in genes:
        gene_lengths.append(gene.end - gene.start + 1)
        for tx in db.children(gene, featuretype=['mRNA', 'transcript']):
            exons = list(db.children(tx, featuretype='exon'))
            exon_counts.append(len(exons))
            sorted_exons = sorted(exons, key=lambda e: e.start)
            for i in range(len(sorted_exons) - 1):
                intron_len = sorted_exons[i + 1].start - sorted_exons[i].end - 1
                intron_lengths.append(intron_len)

    stats = {
        'total_genes': len(genes),
        'total_transcripts': len(transcripts),
        'transcripts_per_gene': len(transcripts) / len(genes) if genes else 0,
        'median_exons_per_transcript': pd.Series(exon_counts).median() if exon_counts else 0,
        'median_gene_length': pd.Series(gene_lengths).median() if gene_lengths else 0,
        'median_intron_length': pd.Series(intron_lengths).median() if intron_lengths else 0,
        'single_exon_genes': sum(1 for e in exon_counts if e == 1),
    }
    return stats

db = load_gene_models('braker3_out/braker.gff3')
stats = gene_model_stats(db)
for key, val in stats.items():
    print(f'{key}: {val}')
```

## Troubleshooting

### Low Gene Count
- Verify softmasking (lowercase repeats present in assembly)
- Check RNA-seq alignment rate (>70% expected)
- Ensure OrthoDB proteins match the correct clade

### Many Single-Exon Genes
- May indicate bacterial contamination
- Check if assembly was properly softmasked
- Consider filtering predictions by evidence support

### BRAKER3 Fails
- Ensure all dependencies are in PATH (GeneMark, Augustus, DIAMOND)
- Check that genome headers have no special characters (pipes, spaces)
- Simplify FASTA headers: `sed 's/ .*//' assembly.fasta > clean.fasta`

## Related Skills

- repeat-annotation - PREREQUISITE: softmask repeats before gene prediction
- functional-annotation - Add GO/KEGG/Pfam to predicted proteins
- read-alignment/star-alignment - Alternative RNA-seq aligner for evidence
- genome-assembly/assembly-qc - Verify assembly quality before prediction

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.