bio-genome-annotation-ncrna-annotation
$
npx mdskill add GPTomics/bioSkills/bio-genome-annotation-ncrna-annotationAnnotates non-coding RNAs in genome assemblies using Infernal and tRNAscan-SE
- Identifies tRNAs, rRNAs, snoRNAs, and regulatory RNAs in genome sequences
- Uses Infernal with Rfam covariance models and tRNAscan-SE for tRNA detection
- Scans assemblies for RNA families based on sequence and structural similarity
- Produces GFF-formatted annotations of detected non-coding RNA features
SKILL.md
.github/skills/bio-genome-annotation-ncrna-annotationView on GitHub ↗
---
name: bio-genome-annotation-ncrna-annotation
description: Identify non-coding RNAs including tRNAs, rRNAs, snoRNAs, and regulatory RNAs using Infernal covariance model searches against Rfam and tRNAscan-SE for tRNA prediction. Use when performing genome-wide ncRNA annotation with assembly input producing GFF output.
tool_type: cli
primary_tool: Infernal
---
## Version Compatibility
Reference examples tested with: pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Non-Coding RNA Annotation
**"Find non-coding RNAs in my genome"** → Scan a genome assembly for rRNAs, tRNAs, snoRNAs, and other ncRNA families using covariance model search and specialized detectors.
- CLI: `cmscan --rfam --tblout hits.tbl Rfam.cm assembly.fa` (Infernal), `tRNAscan-SE -o trnas.txt assembly.fa`
Identify and annotate non-coding RNAs in genome assemblies using Infernal (general ncRNAs via Rfam covariance models) and tRNAscan-SE (specialized tRNA detection).
## Infernal / cmscan
Infernal uses covariance models (CMs) from Rfam to identify ncRNA families by both sequence and secondary structure similarity.
### Rfam Database Setup
```bash
# Download Rfam covariance models (~600 MB compressed)
wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
gunzip Rfam.cm.gz
# Press the CM database (required for cmscan)
cmpress Rfam.cm
# Download clan information (for overlap resolution)
wget https://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.clanin
```
### Basic cmscan
```bash
# Search genome against all Rfam families
# --cut_ga: Use Rfam gathering threshold (recommended, family-specific)
# --rfam: Run in Rfam mode (skip slow alignment for large models)
# --nohmmonly: Use CM scoring only (more accurate, slower)
# --fmt 2: Output in table format
cmscan \
--cut_ga \
--rfam \
--nohmmonly \
--tblout ncrna_hits.tbl \
--fmt 2 \
--cpu 16 \
--clanin Rfam.clanin \
Rfam.cm \
genome.fasta > ncrna_hits.out
```
### Key Options
| Option | Description |
|--------|-------------|
| `--cut_ga` | Use Rfam gathering threshold (recommended) |
| `--rfam` | Speed optimization for Rfam-scale searches |
| `--nohmmonly` | Force CM-only scoring (more sensitive for structured RNAs) |
| `--clanin` | Rfam clan file for resolving overlapping hits |
| `--tblout` | Tabular output file |
| `--fmt 2` | Extended table format with accessions |
| `--cpu` | CPU threads |
| `-E` | E-value threshold (default: 10; --cut_ga is preferred) |
### Convert cmscan Output to GFF3
```bash
# Using Infernal's built-in conversion
grep -v '^#' ncrna_hits.tbl | \
awk 'BEGIN{OFS="\t"} {
if ($10 == "+") strand = "+"; else strand = "-";
if ($10 == "+") {start=$8; end=$9} else {start=$9; end=$8};
print $4, "Infernal", "ncRNA", start, end, $17, strand, ".", "ID="$2";Name="$3";rfam_acc="$2";evalue="$17
}' > ncrna_infernal.gff3
```
### E-value Guidelines
| Category | Typical E-value | Notes |
|----------|----------------|-------|
| High confidence | < 1e-10 | Unambiguous family assignment |
| Moderate | 1e-10 to 1e-5 | Likely real, verify context |
| Rfam GA threshold | family-specific | Recommended cutoff with --cut_ga |
| Marginal | > 1e-3 | May be pseudogenes or degraded copies |
## tRNAscan-SE
Specialized tRNA detector using covariance models. More sensitive and specific for tRNAs than Infernal alone.
### Basic Usage
```bash
# Bacterial/archaeal mode
tRNAscan-SE -B -o trna_results.txt --gff trna.gff3 genome.fasta
# Eukaryotic mode
tRNAscan-SE -E -o trna_results.txt --gff trna.gff3 genome.fasta
# General mode (auto-detect domain)
tRNAscan-SE -G -o trna_results.txt --gff trna.gff3 genome.fasta
```
### Key Options
| Option | Description |
|--------|-------------|
| `-B` | Bacterial mode |
| `-A` | Archaeal mode |
| `-E` | Eukaryotic mode |
| `-G` | General (mixed/unknown) mode |
| `-o` | Tabular output |
| `--gff` | GFF3 output |
| `--detail` | Detailed output with isotype info |
| `--thread` | CPU threads |
| `-Q` | Covariance model scoring only (slower, more accurate) |
### Domain-Specific Models
| Domain | Flag | Notes |
|--------|------|-------|
| Bacteria | `-B` | Shorter introns, smaller genomes |
| Archaea | `-A` | Split tRNAs, bulge-helix-bulge introns |
| Eukaryota | `-E` | Longer introns, pseudogenes common |
| Organelle | `-O` | Mitochondrial/chloroplast tRNAs |
| Mitochondrial | `-M` | Mammalian mitochondrial-specific |
### Expected tRNA Counts
| Organism Type | Expected tRNAs | Notes |
|---------------|----------------|-------|
| Bacteria | 30-90 | Fewer isotypes, some shared |
| Archaea | 30-60 | Similar to bacteria |
| Yeast | 275-400 | Many isodecoders |
| Nematode | 600-900 | Gene family expansion |
| Mammal | 400-600 | Many pseudogenes |
| Plant | 500-1,000+ | Large gene families |
## barrnap (rRNA Detection)
barrnap predicts rRNA genes using HMM models. Note: barrnap has been unmaintained since 2018. Bakta handles rRNA detection internally for prokaryotes.
```bash
# Detect rRNAs
barrnap --kingdom bac genome.fasta > rrna.gff3
# Kingdoms: bac (bacteria), arc (archaea), euk (eukaryote), mito (mitochondria)
barrnap --kingdom euk --threads 4 genome.fasta > rrna.gff3
```
## Combining ncRNA Annotations
**Goal:** Merge Infernal and tRNAscan-SE results into a unified ncRNA annotation, using the best tool for each RNA type.
**Approach:** Parse Infernal tabular output and classify hits by Rfam family name into broad ncRNA categories, parse tRNAscan-SE GFF for tRNA calls, then combine by dropping Infernal tRNA hits (tRNAscan-SE is more accurate for tRNAs) and keeping Infernal for all other ncRNA types.
```python
import pandas as pd
from collections import defaultdict
def parse_infernal_tbl(tbl_file):
'''Parse Infernal cmscan tabular output.'''
hits = []
with open(tbl_file) as f:
for line in f:
if line.startswith('#'):
continue
parts = line.split()
if len(parts) < 18:
continue
strand = '+' if parts[9] == '+' else '-'
start, end = (int(parts[7]), int(parts[8])) if strand == '+' else (int(parts[8]), int(parts[7]))
hits.append({
'target': parts[0],
'rfam_acc': parts[1],
'rfam_name': parts[2],
'seqid': parts[3],
'start': start,
'end': end,
'strand': strand,
'evalue': float(parts[16]),
'score': float(parts[15]),
'rna_type': classify_rfam(parts[2]),
})
return pd.DataFrame(hits)
def classify_rfam(rfam_name):
'''Classify Rfam family into broad ncRNA categories.'''
name_lower = rfam_name.lower()
if 'rrna' in name_lower or 'ssu' in name_lower or 'lsu' in name_lower:
return 'rRNA'
if 'trna' in name_lower:
return 'tRNA'
if 'snorna' in name_lower or 'snord' in name_lower or 'snora' in name_lower:
return 'snoRNA'
if 'mir' in name_lower:
return 'miRNA'
if 'riboswitch' in name_lower or 'thermoregulator' in name_lower:
return 'riboswitch'
if 'ires' in name_lower or 'leader' in name_lower:
return 'cis-reg'
return 'other_ncRNA'
def parse_trnascan_gff(gff_file):
'''Parse tRNAscan-SE GFF3 output.'''
trnas = []
with open(gff_file) as f:
for line in f:
if line.startswith('#'):
continue
parts = line.strip().split('\t')
if len(parts) < 9 or parts[2] != 'tRNA':
continue
attrs = dict(item.split('=') for item in parts[8].split(';') if '=' in item)
trnas.append({
'seqid': parts[0],
'start': int(parts[3]),
'end': int(parts[4]),
'strand': parts[6],
'isotype': attrs.get('isotype', 'unknown'),
'anticodon': attrs.get('anticodon', 'unknown'),
'score': float(parts[5]) if parts[5] != '.' else 0,
'rna_type': 'tRNA',
})
return pd.DataFrame(trnas)
def combine_ncrna_annotations(infernal_tbl, trnascan_gff, output_gff):
'''Combine Infernal and tRNAscan-SE results, preferring tRNAscan for tRNAs.'''
infernal_df = parse_infernal_tbl(infernal_tbl)
trna_df = parse_trnascan_gff(trnascan_gff)
# Remove Infernal tRNA hits (tRNAscan-SE is more accurate for tRNAs)
infernal_no_trna = infernal_df[infernal_df['rna_type'] != 'tRNA']
summary = defaultdict(int)
for rna_type in infernal_no_trna['rna_type'].unique():
summary[rna_type] = len(infernal_no_trna[infernal_no_trna['rna_type'] == rna_type])
summary['tRNA'] = len(trna_df)
print('=== ncRNA Summary ===')
for rna_type, count in sorted(summary.items()):
print(f' {rna_type}: {count}')
print(f' Total: {sum(summary.values())}')
return infernal_no_trna, trna_df, summary
```
## Troubleshooting
### cmscan Is Slow
- Use `--rfam` flag for Rfam-mode optimizations
- Use `--FZ 2` to further speed up (less sensitive)
- Split genome into chunks and run in parallel
### Missing Expected ncRNAs
- Verify Rfam.cm is current version
- Check that cmpress was run successfully (look for .i1m, .i1p, .i1f, .i1i files)
- For mitochondrial/chloroplast ncRNAs, search organelle sequence separately
### tRNAscan-SE Finds Too Many Pseudogenes
- Normal for eukaryotic genomes (especially mammals)
- Filter by score: high-confidence tRNAs typically score > 50 bits
- Use `--detail` flag to identify pseudogenes in output
## Related Skills
- prokaryotic-annotation - Bakta includes ncRNA annotation
- eukaryotic-gene-prediction - Gene prediction excludes ncRNAs
- rna-structure/ncrna-search - Targeted ncRNA homology searches