bio-genome-annotation-ncrna-annotation

Name: bio-genome-annotation-ncrna-annotation
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-genome-annotation-ncrna-annotation

Annotates non-coding RNAs in genome assemblies using Infernal and tRNAscan-SE

Identifies tRNAs, rRNAs, snoRNAs, and regulatory RNAs in genome sequences
Uses Infernal with Rfam covariance models and tRNAscan-SE for tRNA detection
Scans assemblies for RNA families based on sequence and structural similarity
Produces GFF-formatted annotations of detected non-coding RNA features

SKILL.md

.github/skills/bio-genome-annotation-ncrna-annotationView on GitHub ↗

---
name: bio-genome-annotation-ncrna-annotation
description: Identify non-coding RNAs including tRNAs, rRNAs, snoRNAs, and regulatory RNAs using Infernal covariance model searches against Rfam and tRNAscan-SE for tRNA prediction. Use when performing genome-wide ncRNA annotation with assembly input producing GFF output.
tool_type: cli
primary_tool: Infernal
---

## Version Compatibility

Reference examples tested with: pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Non-Coding RNA Annotation

**"Find non-coding RNAs in my genome"** → Scan a genome assembly for rRNAs, tRNAs, snoRNAs, and other ncRNA families using covariance model search and specialized detectors.
- CLI: `cmscan --rfam --tblout hits.tbl Rfam.cm assembly.fa` (Infernal), `tRNAscan-SE -o trnas.txt assembly.fa`

Identify and annotate non-coding RNAs in genome assemblies using Infernal (general ncRNAs via Rfam covariance models) and tRNAscan-SE (specialized tRNA detection).

## Infernal / cmscan

Infernal uses covariance models (CMs) from Rfam to identify ncRNA families by both sequence and secondary structure similarity.

### Rfam Database Setup

```bash
# Download Rfam covariance models (~600 MB compressed)
wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
gunzip Rfam.cm.gz

# Press the CM database (required for cmscan)
cmpress Rfam.cm

# Download clan information (for overlap resolution)
wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.clanin
```

### Basic cmscan

```bash
# Search genome against all Rfam families
# --cut_ga: Use Rfam gathering threshold (recommended, family-specific)
# --rfam: Run in Rfam mode (skip slow alignment for large models)
# --nohmmonly: Use CM scoring only (more accurate, slower)
# --fmt 2: Output in table format
cmscan \
    --cut_ga \
    --rfam \
    --nohmmonly \
    --tblout ncrna_hits.tbl \
    --fmt 2 \
    --cpu 16 \
    --clanin Rfam.clanin \
    Rfam.cm \
    genome.fasta > ncrna_hits.out
```

### Key Options

| Option | Description |
|--------|-------------|
| `--cut_ga` | Use Rfam gathering threshold (recommended) |
| `--rfam` | Speed optimization for Rfam-scale searches |
| `--nohmmonly` | Force CM-only scoring (more sensitive for structured RNAs) |
| `--clanin` | Rfam clan file for resolving overlapping hits |
| `--tblout` | Tabular output file |
| `--fmt 2` | Extended table format with accessions |
| `--cpu` | CPU threads |
| `-E` | E-value threshold (default: 10; --cut_ga is preferred) |

### Convert cmscan Output to GFF3

```bash
# Using Infernal's built-in conversion
grep -v '^#' ncrna_hits.tbl | \
awk 'BEGIN{OFS="\t"} {
    if ($10 == "+") strand = "+"; else strand = "-";
    if ($10 == "+") {start=$8; end=$9} else {start=$9; end=$8};
    print $4, "Infernal", "ncRNA", start, end, $17, strand, ".", "ID="$2";Name="$3";rfam_acc="$2";evalue="$17
}' > ncrna_infernal.gff3
```

### E-value Guidelines

| Category | Typical E-value | Notes |
|----------|----------------|-------|
| High confidence | < 1e-10 | Unambiguous family assignment |
| Moderate | 1e-10 to 1e-5 | Likely real, verify context |
| Rfam GA threshold | family-specific | Recommended cutoff with --cut_ga |
| Marginal | > 1e-3 | May be pseudogenes or degraded copies |

## tRNAscan-SE

Specialized tRNA detector using covariance models. More sensitive and specific for tRNAs than Infernal alone.

### Basic Usage

```bash
# Bacterial/archaeal mode
tRNAscan-SE -B -o trna_results.txt --gff trna.gff3 genome.fasta

# Eukaryotic mode
tRNAscan-SE -E -o trna_results.txt --gff trna.gff3 genome.fasta

# General mode (auto-detect domain)
tRNAscan-SE -G -o trna_results.txt --gff trna.gff3 genome.fasta
```

### Key Options

| Option | Description |
|--------|-------------|
| `-B` | Bacterial mode |
| `-A` | Archaeal mode |
| `-E` | Eukaryotic mode |
| `-G` | General (mixed/unknown) mode |
| `-o` | Tabular output |
| `--gff` | GFF3 output |
| `--detail` | Detailed output with isotype info |
| `--thread` | CPU threads |
| `-Q` | Covariance model scoring only (slower, more accurate) |

### Domain-Specific Models

| Domain | Flag | Notes |
|--------|------|-------|
| Bacteria | `-B` | Shorter introns, smaller genomes |
| Archaea | `-A` | Split tRNAs, bulge-helix-bulge introns |
| Eukaryota | `-E` | Longer introns, pseudogenes common |
| Organelle | `-O` | Mitochondrial/chloroplast tRNAs |
| Mitochondrial | `-M` | Mammalian mitochondrial-specific |

### Expected tRNA Counts

| Organism Type | Expected tRNAs | Notes |
|---------------|----------------|-------|
| Bacteria | 30-90 | Fewer isotypes, some shared |
| Archaea | 30-60 | Similar to bacteria |
| Yeast | 275-400 | Many isodecoders |
| Nematode | 600-900 | Gene family expansion |
| Mammal | 400-600 | Many pseudogenes |
| Plant | 500-1,000+ | Large gene families |

## barrnap (rRNA Detection)

barrnap predicts rRNA genes using HMM models. Note: barrnap has been unmaintained since 2018. Bakta handles rRNA detection internally for prokaryotes.

```bash
# Detect rRNAs
barrnap --kingdom bac genome.fasta > rrna.gff3

# Kingdoms: bac (bacteria), arc (archaea), euk (eukaryote), mito (mitochondria)
barrnap --kingdom euk --threads 4 genome.fasta > rrna.gff3
```

## Combining ncRNA Annotations

**Goal:** Merge Infernal and tRNAscan-SE results into a unified ncRNA annotation, using the best tool for each RNA type.

**Approach:** Parse Infernal tabular output and classify hits by Rfam family name into broad ncRNA categories, parse tRNAscan-SE GFF for tRNA calls, then combine by dropping Infernal tRNA hits (tRNAscan-SE is more accurate for tRNAs) and keeping Infernal for all other ncRNA types.

```python
import pandas as pd
from collections import defaultdict

def parse_infernal_tbl(tbl_file):
    '''Parse Infernal cmscan tabular output.'''
    hits = []
    with open(tbl_file) as f:
        for line in f:
            if line.startswith('#'):
                continue
            parts = line.split()
            if len(parts) < 18:
                continue
            strand = '+' if parts[9] == '+' else '-'
            start, end = (int(parts[7]), int(parts[8])) if strand == '+' else (int(parts[8]), int(parts[7]))
            hits.append({
                'target': parts[0],
                'rfam_acc': parts[1],
                'rfam_name': parts[2],
                'seqid': parts[3],
                'start': start,
                'end': end,
                'strand': strand,
                'evalue': float(parts[16]),
                'score': float(parts[15]),
                'rna_type': classify_rfam(parts[2]),
            })
    return pd.DataFrame(hits)

def classify_rfam(rfam_name):
    '''Classify Rfam family into broad ncRNA categories.'''
    name_lower = rfam_name.lower()
    if 'rrna' in name_lower or 'ssu' in name_lower or 'lsu' in name_lower:
        return 'rRNA'
    if 'trna' in name_lower:
        return 'tRNA'
    if 'snorna' in name_lower or 'snord' in name_lower or 'snora' in name_lower:
        return 'snoRNA'
    if 'mir' in name_lower:
        return 'miRNA'
    if 'riboswitch' in name_lower or 'thermoregulator' in name_lower:
        return 'riboswitch'
    if 'ires' in name_lower or 'leader' in name_lower:
        return 'cis-reg'
    return 'other_ncRNA'

def parse_trnascan_gff(gff_file):
    '''Parse tRNAscan-SE GFF3 output.'''
    trnas = []
    with open(gff_file) as f:
        for line in f:
            if line.startswith('#'):
                continue
            parts = line.strip().split('\t')
            if len(parts) < 9 or parts[2] != 'tRNA':
                continue
            attrs = dict(item.split('=') for item in parts[8].split(';') if '=' in item)
            trnas.append({
                'seqid': parts[0],
                'start': int(parts[3]),
                'end': int(parts[4]),
                'strand': parts[6],
                'isotype': attrs.get('isotype', 'unknown'),
                'anticodon': attrs.get('anticodon', 'unknown'),
                'score': float(parts[5]) if parts[5] != '.' else 0,
                'rna_type': 'tRNA',
            })
    return pd.DataFrame(trnas)

def combine_ncrna_annotations(infernal_tbl, trnascan_gff, output_gff):
    '''Combine Infernal and tRNAscan-SE results, preferring tRNAscan for tRNAs.'''
    infernal_df = parse_infernal_tbl(infernal_tbl)
    trna_df = parse_trnascan_gff(trnascan_gff)

    # Remove Infernal tRNA hits (tRNAscan-SE is more accurate for tRNAs)
    infernal_no_trna = infernal_df[infernal_df['rna_type'] != 'tRNA']

    summary = defaultdict(int)
    for rna_type in infernal_no_trna['rna_type'].unique():
        summary[rna_type] = len(infernal_no_trna[infernal_no_trna['rna_type'] == rna_type])
    summary['tRNA'] = len(trna_df)

    print('=== ncRNA Summary ===')
    for rna_type, count in sorted(summary.items()):
        print(f'  {rna_type}: {count}')
    print(f'  Total: {sum(summary.values())}')

    return infernal_no_trna, trna_df, summary
```

## Troubleshooting

### cmscan Is Slow
- Use `--rfam` flag for Rfam-mode optimizations
- Use `--FZ 2` to further speed up (less sensitive)
- Split genome into chunks and run in parallel

### Missing Expected ncRNAs
- Verify Rfam.cm is current version
- Check that cmpress was run successfully (look for .i1m, .i1p, .i1f, .i1i files)
- For mitochondrial/chloroplast ncRNAs, search organelle sequence separately

### tRNAscan-SE Finds Too Many Pseudogenes
- Normal for eukaryotic genomes (especially mammals)
- Filter by score: high-confidence tRNAs typically score > 50 bits
- Use `--detail` flag to identify pseudogenes in output

## Related Skills

- prokaryotic-annotation - Bakta includes ncRNA annotation
- eukaryotic-gene-prediction - Gene prediction excludes ncRNAs
- rna-structure/ncrna-search - Targeted ncRNA homology searches

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.