bio-genome-annotation-functional-annotation

Name: bio-genome-annotation-functional-annotation
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-genome-annotation-functional-annotation

Annotate predicted proteins with functional data

Assigns GO terms and KEGG orthologs to protein sequences
Depends on eggNOG-mapper, InterProScan, and Pfam databases
Decides annotation targets by matching sequence homology patterns
Delivers structured functional summaries for downstream analysis

SKILL.md

.github/skills/bio-genome-annotation-functional-annotationView on GitHub ↗

---
name: bio-genome-annotation-functional-annotation
description: Assign GO terms, KEGG orthologs, Pfam domains, and EC numbers to predicted proteins using eggNOG-mapper and InterProScan. Produces functional summaries for downstream pathway and enrichment analysis. Use when adding functional annotation to predicted genes or characterizing protein functions in a new genome.
tool_type: cli
primary_tool: eggNOG-mapper
---

## Version Compatibility

Reference examples tested with: pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Functional Annotation

**"Functionally annotate my predicted proteins"** → Assign GO terms, KEGG orthologs, Pfam domains, and EC numbers to predicted protein sequences using orthology-based and domain-scan methods.
- CLI: `emapper.py -i proteins.fa --output annotations` (eggNOG-mapper), `interproscan.sh -i proteins.fa` (InterProScan)

Assign functional annotations (GO terms, KEGG orthologs, Pfam domains, EC numbers) to predicted protein sequences using eggNOG-mapper and InterProScan.

## eggNOG-mapper

### Database Setup

```bash
# Download eggNOG v5.0 database (~44 GB)
# Required for local searches; use --data_dir to specify location
download_eggnog_data.py --data_dir /path/to/eggnog_db -y

# Download DIAMOND database only (~9 GB, faster setup)
download_eggnog_data.py --data_dir /path/to/eggnog_db -y -D

# Download taxon-specific databases (optional, smaller)
download_eggnog_data.py --data_dir /path/to/eggnog_db -y -t 2 # Bacteria
download_eggnog_data.py --data_dir /path/to/eggnog_db -y -t 2759 # Eukaryota
```

### Basic Usage

```bash
emapper.py \
    -i predicted_proteins.faa \
    --output functional_annot \
    --output_dir eggnog_out \
    --data_dir /path/to/eggnog_db \
    --cpu 16 \
    -m diamond
```

### Key Options

| Option | Description |
|--------|-------------|
| `-i` | Input protein FASTA |
| `--output` | Output file prefix |
| `--data_dir` | Path to eggNOG database |
| `-m` | Search mode: diamond (fast), mmseqs (sensitive), hmmer |
| `--cpu` | CPU threads |
| `--tax_scope` | Taxonomic scope (auto, Bacteria, Eukaryota, etc.) |
| `--go_evidence` | GO evidence filter (experimental, non-electronic, all) |
| `--target_orthologs` | Ortholog type (one2one, all) |
| `--seed_ortholog_evalue` | E-value cutoff (default: 0.001) |
| `--seed_ortholog_score` | Min bit score (default: 60) |
| `--override` | Overwrite existing output |

### With Taxonomic Scope

```bash
# Restrict to bacterial orthologs for a prokaryotic genome
emapper.py \
    -i proteins.faa \
    --output annot \
    --output_dir eggnog_out \
    --data_dir /path/to/eggnog_db \
    --cpu 16 \
    -m diamond \
    --tax_scope Bacteria \
    --go_evidence non-electronic
```

### Output Files

```
eggnog_out/
├── annot.emapper.annotations    # Main annotation table
├── annot.emapper.hits           # DIAMOND/mmseqs hits
├── annot.emapper.seed_orthologs # Best orthologs
└── annot.emapper.pfam           # Pfam domain annotations
```

### Key Output Columns

| Column | Content |
|--------|---------|
| seed_ortholog | Best matching ortholog |
| evalue | E-value of best hit |
| GOs | GO term annotations |
| EC | Enzyme Commission numbers |
| KEGG_ko | KEGG ortholog IDs |
| KEGG_Pathway | KEGG pathway mappings |
| COG_category | COG functional category |
| PFAMs | Pfam domain annotations |
| Description | Functional description |

## InterProScan

InterProScan searches multiple protein signature databases simultaneously.

### Basic Usage

```bash
interproscan.sh \
    -i predicted_proteins.faa \
    -o interpro_results.tsv \
    -f tsv,gff3 \
    -cpu 16 \
    -goterms \
    -pa
```

### Key Options

| Option | Description |
|--------|-------------|
| `-i` | Input protein FASTA |
| `-o` | Output file |
| `-f` | Output formats: tsv, gff3, xml, json |
| `-cpu` | CPU threads |
| `-goterms` | Include GO term mappings |
| `-pa` | Include pathway annotations |
| `-appl` | Specific applications to run (comma-separated) |
| `-dp` | Disable precalculated match lookup |

### Select Specific Databases

```bash
# Run only Pfam, TIGRFAM, and CDD
interproscan.sh \
    -i proteins.faa \
    -o interpro_results.tsv \
    -f tsv,gff3 \
    -cpu 16 \
    -goterms -pa \
    -appl Pfam,TIGRFAM,CDD
```

### Available Applications

| Application | Description |
|-------------|-------------|
| Pfam | Protein families |
| TIGRFAM | Functionally equivalent protein families |
| SUPERFAMILY | Structural domain assignments |
| CDD | Conserved Domain Database |
| PANTHER | Protein classification |
| Gene3D | Structural domain predictions |
| Coils | Coiled-coil predictions |
| MobiDBLite | Disordered regions |
| SignalP | Signal peptides |
| TMHMM | Transmembrane helices |

## Merging eggNOG and InterProScan Results

**Goal:** Combine functional annotations from eggNOG-mapper and InterProScan into a single per-protein table with unified GO terms.

**Approach:** Parse the eggNOG annotation table and InterProScan TSV output separately, aggregate InterProScan hits per protein, merge on protein ID, and deduplicate GO terms from both sources.

```python
import pandas as pd

def parse_eggnog(annotations_file):
    '''Parse eggNOG-mapper annotations output.'''
    df = pd.read_csv(annotations_file, sep='\t', comment='#',
                     header=None, skiprows=5)
    col_names = [
        'query', 'seed_ortholog', 'evalue', 'score', 'eggNOG_OGs',
        'max_annot_lvl', 'COG_category', 'Description', 'Preferred_name',
        'GOs', 'EC', 'KEGG_ko', 'KEGG_Pathway', 'KEGG_Module',
        'KEGG_Reaction', 'KEGG_rclass', 'BRITE', 'KEGG_TC', 'CAZy',
        'BiGG_Reaction', 'PFAMs'
    ]
    df.columns = col_names[:len(df.columns)]
    return df

def parse_interproscan_tsv(tsv_file):
    '''Parse InterProScan TSV output.'''
    col_names = [
        'protein_id', 'md5', 'length', 'analysis', 'signature_acc',
        'signature_desc', 'start', 'stop', 'score', 'status', 'date',
        'interpro_acc', 'interpro_desc', 'go_terms', 'pathways'
    ]
    df = pd.read_csv(tsv_file, sep='\t', header=None, names=col_names)
    return df

def merge_annotations(eggnog_file, interpro_file):
    '''Merge eggNOG and InterProScan annotations per protein.'''
    eggnog_df = parse_eggnog(eggnog_file)
    interpro_df = parse_interproscan_tsv(interpro_file)

    interpro_summary = interpro_df.groupby('protein_id').agg({
        'signature_acc': lambda x: ','.join(x.dropna().unique()),
        'interpro_acc': lambda x: ','.join(x.dropna().unique()),
        'go_terms': lambda x: '|'.join(x.dropna().unique()),
    }).reset_index()
    interpro_summary.columns = ['query', 'interpro_signatures', 'interpro_ids', 'interpro_go']

    merged = eggnog_df.merge(interpro_summary, on='query', how='outer')

    merged['all_go'] = merged.apply(
        lambda row: combine_go_terms(row.get('GOs', ''), row.get('interpro_go', '')), axis=1
    )
    return merged

def combine_go_terms(eggnog_go, interpro_go):
    '''Combine GO terms from both sources, removing duplicates.'''
    terms = set()
    for go_str in [eggnog_go, interpro_go]:
        if pd.notna(go_str) and go_str != '-':
            terms.update(t.strip() for t in str(go_str).replace('|', ',').split(',') if t.strip().startswith('GO:'))
    return ','.join(sorted(terms)) if terms else '-'
```

## Annotation Statistics

```python
def annotation_summary(merged_df):
    '''Summarize functional annotation coverage.'''
    total = len(merged_df)
    has_go = (merged_df['all_go'] != '-').sum()
    has_kegg = merged_df['KEGG_ko'].notna().sum() if 'KEGG_ko' in merged_df else 0
    has_pfam = merged_df['PFAMs'].notna().sum() if 'PFAMs' in merged_df else 0
    has_ec = merged_df['EC'].notna().sum() if 'EC' in merged_df else 0
    has_desc = (merged_df['Description'] != '-').sum() if 'Description' in merged_df else 0

    print(f'Total proteins: {total}')
    print(f'With GO terms: {has_go} ({has_go/total:.1%})')
    print(f'With KEGG orthologs: {has_kegg} ({has_kegg/total:.1%})')
    print(f'With Pfam domains: {has_pfam} ({has_pfam/total:.1%})')
    print(f'With EC numbers: {has_ec} ({has_ec/total:.1%})')
    print(f'With description: {has_desc} ({has_desc/total:.1%})')

    # Annotation coverage target: >60% with at least one functional term
    has_any = ((merged_df['all_go'] != '-') | merged_df['PFAMs'].notna() | merged_df['KEGG_ko'].notna()).sum()
    print(f'With any annotation: {has_any} ({has_any/total:.1%})')
```

## Troubleshooting

### Low Annotation Rate
- Check protein sequence quality (no fragmented ORFs)
- Try broader taxonomic scope (--tax_scope auto)
- Run both eggNOG-mapper and InterProScan and merge results

### eggNOG Database Errors
- Verify database version matches emapper version
- Re-download with `download_eggnog_data.py --data_dir /path -y`

### InterProScan Memory Issues
- Reduce batch size with `-b` option
- Split input FASTA into smaller chunks

## Related Skills

- prokaryotic-annotation - Bakta includes basic functional annotation
- eukaryotic-gene-prediction - Produces protein sequences for functional annotation
- pathway-analysis/go-enrichment - Enrichment analysis using GO annotations
- pathway-analysis/kegg-pathways - Pathway mapping with KEGG orthologs

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.