bio-genome-annotation-prokaryotic-annotation
$
npx mdskill add GPTomics/bioSkills/bio-genome-annotation-prokaryotic-annotationAnnotate prokaryotic genomes with Bakta or Prokka for NCBI submission.
- Generates GFF3, GenBank, and FASTA files with NCBI-compatible locus tags.
- Depends on Bakta CLI for comprehensive annotation or Prokka for lightweight results.
- Selects tool based on user preference for depth versus speed.
- Delivers structured genomic data ready for database submission.
SKILL.md
.github/skills/bio-genome-annotation-prokaryotic-annotationView on GitHub ↗
---
name: bio-genome-annotation-prokaryotic-annotation
description: Annotate bacterial and archaeal genomes with Bakta for comprehensive structural and functional annotation, or Prokka for lightweight annotation. Generates GFF3, GenBank, and FASTA outputs with NCBI-compatible locus tags. Use when annotating a newly assembled prokaryotic genome or preparing annotations for NCBI submission.
tool_type: cli
primary_tool: Bakta
---
## Version Compatibility
Reference examples tested with: BUSCO 5.5+, scanpy 1.10+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Prokaryotic Genome Annotation
**"Annotate my bacterial genome"** → Predict and functionally annotate coding sequences, rRNAs, tRNAs, and other features in a prokaryotic genome assembly.
- CLI: `bakta --db db/ assembly.fa` (preferred), `prokka --outdir annot assembly.fa` (legacy)
Annotate prokaryotic genomes with Bakta (preferred) or Prokka (legacy). Bakta provides more comprehensive functional annotation through up-to-date databases and NCBI-compatible output formatting.
## Bakta
### Database Setup
```bash
# Download the full database (~30 GB, recommended for comprehensive annotation)
bakta_db download --output /path/to/bakta_db --type full
# Lightweight database (~1.5 GB, faster but less comprehensive)
bakta_db download --output /path/to/bakta_db --type light
# Update existing database
bakta_db update --db /path/to/bakta_db
```
### Basic Annotation
```bash
bakta \
--db /path/to/bakta_db \
--output bakta_out \
--prefix my_genome \
--locus-tag MYORG \
--threads 8 \
assembly.fasta
```
### Key Options
| Option | Description |
|--------|-------------|
| `--db` | Path to Bakta database |
| `--output` | Output directory |
| `--prefix` | Output file prefix |
| `--locus-tag` | NCBI-compatible locus tag prefix |
| `--genus` / `--species` | Organism taxonomy |
| `--strain` | Strain designation |
| `--complete` | Flag for complete genomes (enables oriC/oriV detection) |
| `--gram` | Gram type (+ or -) for signal peptide prediction |
| `--threads` | CPU threads |
| `--min-contig-length` | Minimum contig length to annotate (default: 1) |
| `--translation-table` | Genetic code (default: 11 for bacteria) |
### With Organism Metadata
```bash
bakta \
--db /path/to/bakta_db \
--output bakta_out \
--prefix ecoli_k12 \
--locus-tag ECK12 \
--genus Escherichia --species coli --strain K-12 \
--gram - \
--complete \
--threads 16 \
assembly.fasta
```
### Output Files
```
bakta_out/
├── my_genome.gff3 # GFF3 annotation (primary output)
├── my_genome.gbff # GenBank format
├── my_genome.ffn # Nucleotide CDS sequences
├── my_genome.faa # Protein sequences
├── my_genome.fna # Annotated genome sequence
├── my_genome.embl # EMBL format
├── my_genome.tsv # Tab-separated feature table
├── my_genome.json # Machine-readable JSON
└── my_genome.txt # Summary statistics
```
## Prokka (Legacy Alternative)
Prokka is lighter weight and faster but uses older databases. Prefer Bakta for new projects.
```bash
prokka \
--outdir prokka_out \
--prefix my_genome \
--locustag MYORG \
--genus Escherichia --species coli \
--cpus 8 \
--rfam \
assembly.fasta
```
### Prokka vs Bakta
| Feature | Bakta | Prokka |
|---------|-------|--------|
| Database updates | Active (2024+) | Unmaintained since 2021 |
| Functional annotation | Comprehensive (UniProt, COG, Pfam) | Basic (UniProt) |
| ncRNA detection | Infernal + Rfam 14.x | Infernal + Rfam 12.x |
| NCBI compatibility | Full SQN output | Requires tbl2asn |
| Speed | Moderate | Fast |
## Parsing Annotations with Python
**Goal:** Load Bakta/Prokka GFF3 output into a queryable database to extract CDS features and compute annotation quality metrics like coding density.
**Approach:** Create a gffutils in-memory database from the GFF3 file, iterate CDS features to extract locus tags and product names, and calculate coding density as total CDS bp divided by genome length.
```python
import gffutils
def load_annotation(gff_file):
'''Load GFF3 into a queryable database.'''
db = gffutils.create_db(gff_file, ':memory:', merge_strategy='merge')
return db
def extract_cds_features(db):
'''Extract all CDS features with product annotations.'''
features = []
for cds in db.features_of_type('CDS'):
features.append({
'id': cds.id,
'seqid': cds.seqid,
'start': cds.start,
'end': cds.end,
'strand': cds.strand,
'product': cds.attributes.get('product', ['unknown'])[0],
'locus_tag': cds.attributes.get('locus_tag', [''])[0]
})
return features
def compute_coding_density(db, genome_length):
'''Compute fraction of genome encoding proteins.
Typical prokaryotic coding density: 85-95%.
Values below 80% may indicate pseudogenes or annotation gaps.
Values above 95% may indicate overlapping annotations.
'''
coding_bp = sum(cds.end - cds.start + 1 for cds in db.features_of_type('CDS'))
return coding_bp / genome_length
db = load_annotation('bakta_out/my_genome.gff3')
cds_features = extract_cds_features(db)
print(f'Total CDSs: {len(cds_features)}')
```
## Annotation QC
### Expected Metrics by Genome Size
| Genome Size | Expected Genes | Coding Density |
|-------------|---------------|----------------|
| 1-2 Mb | 900-2,000 | 85-92% |
| 2-5 Mb | 1,800-5,000 | 85-90% |
| 5-10 Mb | 4,500-9,000 | 82-88% |
### QC Checks
```bash
# Count annotated features
grep -c $'\tCDS\t' bakta_out/my_genome.gff3
grep -c $'\ttRNA\t' bakta_out/my_genome.gff3
grep -c $'\trRNA\t' bakta_out/my_genome.gff3
# Check for hypothetical proteins (ideally <40% of total CDSs)
grep -c 'hypothetical protein' bakta_out/my_genome.tsv
```
### BUSCO on Predicted Proteins
```bash
busco -i bakta_out/my_genome.faa -m proteins -l bacteria_odb10 -o busco_proteins
```
## Troubleshooting
### Low Gene Count
- Check assembly completeness with BUSCO (genome mode)
- Verify correct translation table (--translation-table 4 for Mycoplasma)
- Inspect minimum contig length filter
### Many Hypothetical Proteins
- Normal for novel organisms (30-50% is common)
- Try running InterProScan on the .faa file for additional annotations
- Consider eggNOG-mapper for orthology-based functional assignment
### NCBI Submission
- Use `--compliant` flag for NCBI-ready output
- Ensure locus tags follow NCBI format (3-12 uppercase alphanumeric)
- Review .tsv output for annotation warnings
## Related Skills
- functional-annotation - Add GO/KEGG/Pfam to predicted proteins
- ncrna-annotation - Detailed ncRNA identification with Infernal
- genome-assembly/assembly-qc - Assess assembly quality before annotation
- genome-intervals/gtf-gff-handling - Parse and manipulate GFF3 output