bio-genome-annotation-functional-annotation
$
npx mdskill add GPTomics/bioSkills/bio-genome-annotation-functional-annotationAnnotate predicted proteins with functional data
- Assigns GO terms and KEGG orthologs to protein sequences
- Depends on eggNOG-mapper, InterProScan, and Pfam databases
- Decides annotation targets by matching sequence homology patterns
- Delivers structured functional summaries for downstream analysis
SKILL.md
.github/skills/bio-genome-annotation-functional-annotationView on GitHub ↗
---
name: bio-genome-annotation-functional-annotation
description: Assign GO terms, KEGG orthologs, Pfam domains, and EC numbers to predicted proteins using eggNOG-mapper and InterProScan. Produces functional summaries for downstream pathway and enrichment analysis. Use when adding functional annotation to predicted genes or characterizing protein functions in a new genome.
tool_type: cli
primary_tool: eggNOG-mapper
---
## Version Compatibility
Reference examples tested with: pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Functional Annotation
**"Functionally annotate my predicted proteins"** → Assign GO terms, KEGG orthologs, Pfam domains, and EC numbers to predicted protein sequences using orthology-based and domain-scan methods.
- CLI: `emapper.py -i proteins.fa --output annotations` (eggNOG-mapper), `interproscan.sh -i proteins.fa` (InterProScan)
Assign functional annotations (GO terms, KEGG orthologs, Pfam domains, EC numbers) to predicted protein sequences using eggNOG-mapper and InterProScan.
## eggNOG-mapper
### Database Setup
```bash
# Download eggNOG v5.0 database (~44 GB)
# Required for local searches; use --data_dir to specify location
download_eggnog_data.py --data_dir /path/to/eggnog_db -y
# Download DIAMOND database only (~9 GB, faster setup)
download_eggnog_data.py --data_dir /path/to/eggnog_db -y -D
# Download taxon-specific databases (optional, smaller)
download_eggnog_data.py --data_dir /path/to/eggnog_db -y -t 2 # Bacteria
download_eggnog_data.py --data_dir /path/to/eggnog_db -y -t 2759 # Eukaryota
```
### Basic Usage
```bash
emapper.py \
-i predicted_proteins.faa \
--output functional_annot \
--output_dir eggnog_out \
--data_dir /path/to/eggnog_db \
--cpu 16 \
-m diamond
```
### Key Options
| Option | Description |
|--------|-------------|
| `-i` | Input protein FASTA |
| `--output` | Output file prefix |
| `--data_dir` | Path to eggNOG database |
| `-m` | Search mode: diamond (fast), mmseqs (sensitive), hmmer |
| `--cpu` | CPU threads |
| `--tax_scope` | Taxonomic scope (auto, Bacteria, Eukaryota, etc.) |
| `--go_evidence` | GO evidence filter (experimental, non-electronic, all) |
| `--target_orthologs` | Ortholog type (one2one, all) |
| `--seed_ortholog_evalue` | E-value cutoff (default: 0.001) |
| `--seed_ortholog_score` | Min bit score (default: 60) |
| `--override` | Overwrite existing output |
### With Taxonomic Scope
```bash
# Restrict to bacterial orthologs for a prokaryotic genome
emapper.py \
-i proteins.faa \
--output annot \
--output_dir eggnog_out \
--data_dir /path/to/eggnog_db \
--cpu 16 \
-m diamond \
--tax_scope Bacteria \
--go_evidence non-electronic
```
### Output Files
```
eggnog_out/
├── annot.emapper.annotations # Main annotation table
├── annot.emapper.hits # DIAMOND/mmseqs hits
├── annot.emapper.seed_orthologs # Best orthologs
└── annot.emapper.pfam # Pfam domain annotations
```
### Key Output Columns
| Column | Content |
|--------|---------|
| seed_ortholog | Best matching ortholog |
| evalue | E-value of best hit |
| GOs | GO term annotations |
| EC | Enzyme Commission numbers |
| KEGG_ko | KEGG ortholog IDs |
| KEGG_Pathway | KEGG pathway mappings |
| COG_category | COG functional category |
| PFAMs | Pfam domain annotations |
| Description | Functional description |
## InterProScan
InterProScan searches multiple protein signature databases simultaneously.
### Basic Usage
```bash
interproscan.sh \
-i predicted_proteins.faa \
-o interpro_results.tsv \
-f tsv,gff3 \
-cpu 16 \
-goterms \
-pa
```
### Key Options
| Option | Description |
|--------|-------------|
| `-i` | Input protein FASTA |
| `-o` | Output file |
| `-f` | Output formats: tsv, gff3, xml, json |
| `-cpu` | CPU threads |
| `-goterms` | Include GO term mappings |
| `-pa` | Include pathway annotations |
| `-appl` | Specific applications to run (comma-separated) |
| `-dp` | Disable precalculated match lookup |
### Select Specific Databases
```bash
# Run only Pfam, TIGRFAM, and CDD
interproscan.sh \
-i proteins.faa \
-o interpro_results.tsv \
-f tsv,gff3 \
-cpu 16 \
-goterms -pa \
-appl Pfam,TIGRFAM,CDD
```
### Available Applications
| Application | Description |
|-------------|-------------|
| Pfam | Protein families |
| TIGRFAM | Functionally equivalent protein families |
| SUPERFAMILY | Structural domain assignments |
| CDD | Conserved Domain Database |
| PANTHER | Protein classification |
| Gene3D | Structural domain predictions |
| Coils | Coiled-coil predictions |
| MobiDBLite | Disordered regions |
| SignalP | Signal peptides |
| TMHMM | Transmembrane helices |
## Merging eggNOG and InterProScan Results
**Goal:** Combine functional annotations from eggNOG-mapper and InterProScan into a single per-protein table with unified GO terms.
**Approach:** Parse the eggNOG annotation table and InterProScan TSV output separately, aggregate InterProScan hits per protein, merge on protein ID, and deduplicate GO terms from both sources.
```python
import pandas as pd
def parse_eggnog(annotations_file):
'''Parse eggNOG-mapper annotations output.'''
df = pd.read_csv(annotations_file, sep='\t', comment='#',
header=None, skiprows=5)
col_names = [
'query', 'seed_ortholog', 'evalue', 'score', 'eggNOG_OGs',
'max_annot_lvl', 'COG_category', 'Description', 'Preferred_name',
'GOs', 'EC', 'KEGG_ko', 'KEGG_Pathway', 'KEGG_Module',
'KEGG_Reaction', 'KEGG_rclass', 'BRITE', 'KEGG_TC', 'CAZy',
'BiGG_Reaction', 'PFAMs'
]
df.columns = col_names[:len(df.columns)]
return df
def parse_interproscan_tsv(tsv_file):
'''Parse InterProScan TSV output.'''
col_names = [
'protein_id', 'md5', 'length', 'analysis', 'signature_acc',
'signature_desc', 'start', 'stop', 'score', 'status', 'date',
'interpro_acc', 'interpro_desc', 'go_terms', 'pathways'
]
df = pd.read_csv(tsv_file, sep='\t', header=None, names=col_names)
return df
def merge_annotations(eggnog_file, interpro_file):
'''Merge eggNOG and InterProScan annotations per protein.'''
eggnog_df = parse_eggnog(eggnog_file)
interpro_df = parse_interproscan_tsv(interpro_file)
interpro_summary = interpro_df.groupby('protein_id').agg({
'signature_acc': lambda x: ','.join(x.dropna().unique()),
'interpro_acc': lambda x: ','.join(x.dropna().unique()),
'go_terms': lambda x: '|'.join(x.dropna().unique()),
}).reset_index()
interpro_summary.columns = ['query', 'interpro_signatures', 'interpro_ids', 'interpro_go']
merged = eggnog_df.merge(interpro_summary, on='query', how='outer')
merged['all_go'] = merged.apply(
lambda row: combine_go_terms(row.get('GOs', ''), row.get('interpro_go', '')), axis=1
)
return merged
def combine_go_terms(eggnog_go, interpro_go):
'''Combine GO terms from both sources, removing duplicates.'''
terms = set()
for go_str in [eggnog_go, interpro_go]:
if pd.notna(go_str) and go_str != '-':
terms.update(t.strip() for t in str(go_str).replace('|', ',').split(',') if t.strip().startswith('GO:'))
return ','.join(sorted(terms)) if terms else '-'
```
## Annotation Statistics
```python
def annotation_summary(merged_df):
'''Summarize functional annotation coverage.'''
total = len(merged_df)
has_go = (merged_df['all_go'] != '-').sum()
has_kegg = merged_df['KEGG_ko'].notna().sum() if 'KEGG_ko' in merged_df else 0
has_pfam = merged_df['PFAMs'].notna().sum() if 'PFAMs' in merged_df else 0
has_ec = merged_df['EC'].notna().sum() if 'EC' in merged_df else 0
has_desc = (merged_df['Description'] != '-').sum() if 'Description' in merged_df else 0
print(f'Total proteins: {total}')
print(f'With GO terms: {has_go} ({has_go/total:.1%})')
print(f'With KEGG orthologs: {has_kegg} ({has_kegg/total:.1%})')
print(f'With Pfam domains: {has_pfam} ({has_pfam/total:.1%})')
print(f'With EC numbers: {has_ec} ({has_ec/total:.1%})')
print(f'With description: {has_desc} ({has_desc/total:.1%})')
# Annotation coverage target: >60% with at least one functional term
has_any = ((merged_df['all_go'] != '-') | merged_df['PFAMs'].notna() | merged_df['KEGG_ko'].notna()).sum()
print(f'With any annotation: {has_any} ({has_any/total:.1%})')
```
## Troubleshooting
### Low Annotation Rate
- Check protein sequence quality (no fragmented ORFs)
- Try broader taxonomic scope (--tax_scope auto)
- Run both eggNOG-mapper and InterProScan and merge results
### eggNOG Database Errors
- Verify database version matches emapper version
- Re-download with `download_eggnog_data.py --data_dir /path -y`
### InterProScan Memory Issues
- Reduce batch size with `-b` option
- Split input FASTA into smaller chunks
## Related Skills
- prokaryotic-annotation - Bakta includes basic functional annotation
- eukaryotic-gene-prediction - Produces protein sequences for functional annotation
- pathway-analysis/go-enrichment - Enrichment analysis using GO annotations
- pathway-analysis/kegg-pathways - Pathway mapping with KEGG orthologs