bio-genome-annotation-prokaryotic-annotation

Name: bio-genome-annotation-prokaryotic-annotation
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-genome-annotation-prokaryotic-annotation

Annotate prokaryotic genomes with Bakta or Prokka for NCBI submission.

Generates GFF3, GenBank, and FASTA files with NCBI-compatible locus tags.
Depends on Bakta CLI for comprehensive annotation or Prokka for lightweight results.
Selects tool based on user preference for depth versus speed.
Delivers structured genomic data ready for database submission.

SKILL.md

.github/skills/bio-genome-annotation-prokaryotic-annotationView on GitHub ↗

---
name: bio-genome-annotation-prokaryotic-annotation
description: Annotate bacterial and archaeal genomes with Bakta for comprehensive structural and functional annotation, or Prokka for lightweight annotation. Generates GFF3, GenBank, and FASTA outputs with NCBI-compatible locus tags. Use when annotating a newly assembled prokaryotic genome or preparing annotations for NCBI submission.
tool_type: cli
primary_tool: Bakta
---

## Version Compatibility

Reference examples tested with: BUSCO 5.5+, scanpy 1.10+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Prokaryotic Genome Annotation

**"Annotate my bacterial genome"** → Predict and functionally annotate coding sequences, rRNAs, tRNAs, and other features in a prokaryotic genome assembly.
- CLI: `bakta --db db/ assembly.fa` (preferred), `prokka --outdir annot assembly.fa` (legacy)

Annotate prokaryotic genomes with Bakta (preferred) or Prokka (legacy). Bakta provides more comprehensive functional annotation through up-to-date databases and NCBI-compatible output formatting.

## Bakta

### Database Setup

```bash
# Download the full database (~30 GB, recommended for comprehensive annotation)
bakta_db download --output /path/to/bakta_db --type full

# Lightweight database (~1.5 GB, faster but less comprehensive)
bakta_db download --output /path/to/bakta_db --type light

# Update existing database
bakta_db update --db /path/to/bakta_db
```

### Basic Annotation

```bash
bakta \
    --db /path/to/bakta_db \
    --output bakta_out \
    --prefix my_genome \
    --locus-tag MYORG \
    --threads 8 \
    assembly.fasta
```

### Key Options

| Option | Description |
|--------|-------------|
| `--db` | Path to Bakta database |
| `--output` | Output directory |
| `--prefix` | Output file prefix |
| `--locus-tag` | NCBI-compatible locus tag prefix |
| `--genus` / `--species` | Organism taxonomy |
| `--strain` | Strain designation |
| `--complete` | Flag for complete genomes (enables oriC/oriV detection) |
| `--gram` | Gram type (+ or -) for signal peptide prediction |
| `--threads` | CPU threads |
| `--min-contig-length` | Minimum contig length to annotate (default: 1) |
| `--translation-table` | Genetic code (default: 11 for bacteria) |

### With Organism Metadata

```bash
bakta \
    --db /path/to/bakta_db \
    --output bakta_out \
    --prefix ecoli_k12 \
    --locus-tag ECK12 \
    --genus Escherichia --species coli --strain K-12 \
    --gram - \
    --complete \
    --threads 16 \
    assembly.fasta
```

### Output Files

```
bakta_out/
├── my_genome.gff3       # GFF3 annotation (primary output)
├── my_genome.gbff       # GenBank format
├── my_genome.ffn        # Nucleotide CDS sequences
├── my_genome.faa        # Protein sequences
├── my_genome.fna        # Annotated genome sequence
├── my_genome.embl       # EMBL format
├── my_genome.tsv        # Tab-separated feature table
├── my_genome.json       # Machine-readable JSON
└── my_genome.txt        # Summary statistics
```

## Prokka (Legacy Alternative)

Prokka is lighter weight and faster but uses older databases. Prefer Bakta for new projects.

```bash
prokka \
    --outdir prokka_out \
    --prefix my_genome \
    --locustag MYORG \
    --genus Escherichia --species coli \
    --cpus 8 \
    --rfam \
    assembly.fasta
```

### Prokka vs Bakta

| Feature | Bakta | Prokka |
|---------|-------|--------|
| Database updates | Active (2024+) | Unmaintained since 2021 |
| Functional annotation | Comprehensive (UniProt, COG, Pfam) | Basic (UniProt) |
| ncRNA detection | Infernal + Rfam 14.x | Infernal + Rfam 12.x |
| NCBI compatibility | Full SQN output | Requires tbl2asn |
| Speed | Moderate | Fast |

## Parsing Annotations with Python

**Goal:** Load Bakta/Prokka GFF3 output into a queryable database to extract CDS features and compute annotation quality metrics like coding density.

**Approach:** Create a gffutils in-memory database from the GFF3 file, iterate CDS features to extract locus tags and product names, and calculate coding density as total CDS bp divided by genome length.

```python
import gffutils

def load_annotation(gff_file):
    '''Load GFF3 into a queryable database.'''
    db = gffutils.create_db(gff_file, ':memory:', merge_strategy='merge')
    return db

def extract_cds_features(db):
    '''Extract all CDS features with product annotations.'''
    features = []
    for cds in db.features_of_type('CDS'):
        features.append({
            'id': cds.id,
            'seqid': cds.seqid,
            'start': cds.start,
            'end': cds.end,
            'strand': cds.strand,
            'product': cds.attributes.get('product', ['unknown'])[0],
            'locus_tag': cds.attributes.get('locus_tag', [''])[0]
        })
    return features

def compute_coding_density(db, genome_length):
    '''Compute fraction of genome encoding proteins.

    Typical prokaryotic coding density: 85-95%.
    Values below 80% may indicate pseudogenes or annotation gaps.
    Values above 95% may indicate overlapping annotations.
    '''
    coding_bp = sum(cds.end - cds.start + 1 for cds in db.features_of_type('CDS'))
    return coding_bp / genome_length

db = load_annotation('bakta_out/my_genome.gff3')
cds_features = extract_cds_features(db)
print(f'Total CDSs: {len(cds_features)}')
```

## Annotation QC

### Expected Metrics by Genome Size

| Genome Size | Expected Genes | Coding Density |
|-------------|---------------|----------------|
| 1-2 Mb | 900-2,000 | 85-92% |
| 2-5 Mb | 1,800-5,000 | 85-90% |
| 5-10 Mb | 4,500-9,000 | 82-88% |

### QC Checks

```bash
# Count annotated features
grep -c $'\tCDS\t' bakta_out/my_genome.gff3
grep -c $'\ttRNA\t' bakta_out/my_genome.gff3
grep -c $'\trRNA\t' bakta_out/my_genome.gff3

# Check for hypothetical proteins (ideally <40% of total CDSs)
grep -c 'hypothetical protein' bakta_out/my_genome.tsv
```

### BUSCO on Predicted Proteins

```bash
busco -i bakta_out/my_genome.faa -m proteins -l bacteria_odb10 -o busco_proteins
```

## Troubleshooting

### Low Gene Count
- Check assembly completeness with BUSCO (genome mode)
- Verify correct translation table (--translation-table 4 for Mycoplasma)
- Inspect minimum contig length filter

### Many Hypothetical Proteins
- Normal for novel organisms (30-50% is common)
- Try running InterProScan on the .faa file for additional annotations
- Consider eggNOG-mapper for orthology-based functional assignment

### NCBI Submission
- Use `--compliant` flag for NCBI-ready output
- Ensure locus tags follow NCBI format (3-12 uppercase alphanumeric)
- Review .tsv output for annotation warnings

## Related Skills

- functional-annotation - Add GO/KEGG/Pfam to predicted proteins
- ncrna-annotation - Detailed ncRNA identification with Infernal
- genome-assembly/assembly-qc - Assess assembly quality before annotation
- genome-intervals/gtf-gff-handling - Parse and manipulate GFF3 output

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.