bio-genome-annotation-repeat-annotation
$
npx mdskill add GPTomics/bioSkills/bio-genome-annotation-repeat-annotationAnnotate and mask repetitive elements to enable accurate gene prediction.
- Enables de novo repeat library construction and genome-wide masking.
- Depends on RepeatModeler, RepeatMasker, and TEtranscripts tools.
- Executes based on user need for repeat masking before gene prediction.
- Delivers softmasked genome assemblies and TE expression quantification.
SKILL.md
.github/skills/bio-genome-annotation-repeat-annotationView on GitHub ↗
---
name: bio-genome-annotation-repeat-annotation
description: Identify and classify repetitive elements and transposable elements using RepeatModeler for de novo repeat library construction and RepeatMasker for genome-wide repeat annotation. Quantify TE expression from RNA-seq with TEtranscripts. Use when masking repeats before gene prediction or analyzing transposable element activity.
tool_type: cli
primary_tool: RepeatMasker
---
## Version Compatibility
Reference examples tested with: DESeq2 1.42+, STAR 2.7.11+, matplotlib 3.8+, pandas 2.2+
Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags
If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.
# Repeat and Transposable Element Annotation
**"Mask repeats in my genome assembly"** → Build a de novo repeat library and annotate/softmask repetitive elements as a prerequisite for gene prediction.
- CLI: `RepeatModeler -database mydb` (library), `RepeatMasker -lib custom-lib.fa -xsmall assembly.fa` (masking)
Identify, classify, and mask repetitive elements using RepeatModeler (de novo library construction) and RepeatMasker (genome-wide annotation). Softmasked output is a prerequisite for eukaryotic gene prediction.
## RepeatModeler (De Novo Library)
RepeatModeler builds a species-specific repeat library by detecting repetitive elements de novo from the assembly.
### Build Database and Run
```bash
# Build RepeatModeler database
BuildDatabase -name my_genome -engine ncbi assembly.fasta
# Run RepeatModeler (this takes hours to days depending on genome size)
# -LTRStruct enables LTR structural detection (recommended)
RepeatModeler -database my_genome -pa 16 -LTRStruct
```
### Key Options
| Option | Description |
|--------|-------------|
| `-database` | Database name from BuildDatabase |
| `-pa` | Parallel processes |
| `-LTRStruct` | Enable LTR structural detection pipeline |
| `-engine` | Search engine: ncbi (RMBLAST) or abblast |
### Output
```
my_genome-families.fa # Consensus repeat library
my_genome-families.stk # Stockholm alignments
RM_*/ # Working directory with intermediate files
```
The output `*-families.fa` is the repeat library used by RepeatMasker.
## RepeatMasker (Genome-Wide Annotation)
### With De Novo Library
```bash
# Use species-specific de novo library (recommended)
RepeatMasker \
-lib my_genome-families.fa \
-pa 16 \
-xsmall \
-gff \
-dir repeatmasker_out \
assembly.fasta
```
### With Dfam/RepBase Library
```bash
# Use Dfam curated library for a known species
RepeatMasker \
-species "Homo sapiens" \
-pa 16 \
-xsmall \
-gff \
-dir repeatmasker_out \
assembly.fasta
```
### Combined Library (De Novo + Known)
```bash
# Combine de novo and curated libraries for best results
cat my_genome-families.fa known_repeats.fa > combined_lib.fa
RepeatMasker \
-lib combined_lib.fa \
-pa 16 \
-xsmall \
-gff \
-dir repeatmasker_out \
assembly.fasta
```
### Key Options
| Option | Description |
|--------|-------------|
| `-lib` | Custom repeat library FASTA |
| `-species` | Species name (uses Dfam database) |
| `-pa` | Parallel processes |
| `-xsmall` | Softmask output (lowercase repeats, required for gene prediction) |
| `-gff` | Generate GFF output |
| `-dir` | Output directory |
| `-nolow` | Skip low-complexity masking |
| `-noint` | Skip interspersed repeats |
| `-e` | Search engine: crossmatch, ncbi, hmmer, abblast |
| `-s` | Slow/sensitive search |
| `-q` | Quick search (5-10% less sensitive) |
### Output Files
```
repeatmasker_out/
├── assembly.fasta.masked # Hardmasked genome (N's replace repeats)
├── assembly.fasta.out # Detailed repeat annotation table
├── assembly.fasta.tbl # Summary statistics table
├── assembly.fasta.out.gff # GFF annotation of repeats
└── assembly.fasta.cat.gz # Search result alignments
```
### Softmasking for Gene Prediction
The `-xsmall` flag produces softmasked output where repeats are lowercase. This is the required input format for BRAKER3 and most gene prediction tools.
```bash
# The softmasked genome is written in place of the input
# Copy original first
cp assembly.fasta assembly_original.fasta
RepeatMasker -lib my_genome-families.fa -pa 16 -xsmall assembly.fasta
# assembly.fasta.masked is the softmasked output
mv assembly.fasta.masked assembly_softmasked.fasta
```
## TEtranscripts (TE Expression)
Quantify transposable element expression from RNA-seq data using TEtranscripts, which works with DESeq2 for differential TE expression.
```bash
# Requires STAR alignment with multi-mapping reads retained
STAR --runThreadN 16 \
--genomeDir star_index \
--readFilesIn reads_R1.fq.gz reads_R2.fq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--winAnchorMultimapNmax 100 \
--outFilterMultimapNmax 100 \
--outFileNamePrefix sample_
# Run TEtranscripts for differential expression
TEtranscripts \
--treatment sample1.bam sample2.bam sample3.bam \
--control ctrl1.bam ctrl2.bam ctrl3.bam \
--GTF genes.gtf \
--TE te_annotation.gtf \
--mode multi \
--sortByPos
```
### Key TEtranscripts Options
| Option | Description |
|--------|-------------|
| `--treatment` | Treatment BAM files |
| `--control` | Control BAM files |
| `--GTF` | Gene annotation GTF |
| `--TE` | TE annotation GTF (from RepeatMasker) |
| `--mode` | multi (recommended) or uniq |
| `--sortByPos` | Input sorted by position |
| `--stranded` | Strand-specific protocol (yes, no, reverse) |
## Python: Repeat Statistics
**Goal:** Parse RepeatMasker output to summarize repeat content by class and visualize the repeat divergence landscape.
**Approach:** Read the RepeatMasker `.out` file into a DataFrame, group by repeat class to compute total bp and genome percentage, then plot a Kimura divergence histogram stratified by major TE classes (LINE, SINE, LTR, DNA).
```python
import pandas as pd
import re
def parse_repeatmasker_out(out_file):
'''Parse RepeatMasker .out file into a DataFrame.'''
records = []
with open(out_file) as f:
for i, line in enumerate(f):
if i < 3:
continue
parts = line.split()
if len(parts) < 15:
continue
records.append({
'score': int(parts[0]),
'perc_div': float(parts[1]),
'perc_del': float(parts[2]),
'perc_ins': float(parts[3]),
'seqid': parts[4],
'start': int(parts[5]),
'end': int(parts[6]),
'strand': '+' if parts[8] == '+' else '-',
'repeat_name': parts[9],
'repeat_class': parts[10],
'length': int(parts[6]) - int(parts[5]) + 1,
})
return pd.DataFrame(records)
def repeat_summary(rm_df, genome_size):
'''Summarize repeat content by class.'''
class_summary = rm_df.groupby('repeat_class').agg(
count=('repeat_name', 'count'),
total_bp=('length', 'sum'),
).sort_values('total_bp', ascending=False)
class_summary['pct_genome'] = class_summary['total_bp'] / genome_size * 100
total_masked = rm_df['length'].sum()
print(f'=== Repeat Summary ===')
print(f'Total masked: {total_masked:,} bp ({total_masked/genome_size:.1%} of genome)')
print(f'\nBy class:')
for cls, row in class_summary.iterrows():
print(f' {cls}: {row["count"]:,} elements, {row["total_bp"]:,} bp ({row["pct_genome"]:.1f}%)')
return class_summary
def repeat_landscape(rm_df, output_file='repeat_landscape.png'):
'''Plot repeat divergence landscape (Kimura distance).'''
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(12, 6))
major_classes = ['LINE', 'SINE', 'LTR', 'DNA']
colors = {'LINE': '#1f77b4', 'SINE': '#ff7f0e', 'LTR': '#2ca02c', 'DNA': '#d62728'}
for cls in major_classes:
subset = rm_df[rm_df['repeat_class'].str.contains(cls, case=False, na=False)]
if len(subset) > 0:
ax.hist(subset['perc_div'], bins=50, range=(0, 50), alpha=0.6, label=cls, color=colors.get(cls))
ax.set_xlabel('Kimura Divergence (%)')
ax.set_ylabel('Count')
ax.set_title('Repeat Landscape')
ax.legend()
plt.savefig(output_file, dpi=150, bbox_inches='tight')
plt.close()
```
## Expected Repeat Content
| Organism | Repeat Content | Notes |
|----------|---------------|-------|
| Bacteria | 1-5% | Mostly IS elements |
| Yeast | 3-5% | Ty elements |
| Drosophila | 15-25% | LTR-rich |
| Zebrafish | 45-55% | DNA transposon-rich |
| Human | 45-50% | LINE/SINE-rich |
| Maize | 80-85% | LTR-rich |
## Troubleshooting
### RepeatModeler Runs Very Slowly
- Normal for large genomes (days for mammalian-size)
- Use `-pa` for parallelization
- Consider EDTA as alternative for plant genomes
### Low Masking Percentage
- May indicate novel repeats not in database
- Always run RepeatModeler before RepeatMasker
- Combine de novo + known libraries
### Gene Prediction Finds Too Many Genes After Masking
- Verify softmasking with: `grep -v '^>' assembly.fasta | tr -cd 'a-z' | wc -c`
- Ensure using `-xsmall` not default hardmasking
## Related Skills
- eukaryotic-gene-prediction - Requires softmasked genome from repeat annotation
- genome-assembly/assembly-qc - Assess assembly quality including repeat content
- differential-expression/deseq2-basics - Differential TE expression analysis