bio-genome-annotation-repeat-annotation

Name: bio-genome-annotation-repeat-annotation
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-genome-annotation-repeat-annotation

Annotate and mask repetitive elements to enable accurate gene prediction.

Enables de novo repeat library construction and genome-wide masking.
Depends on RepeatModeler, RepeatMasker, and TEtranscripts tools.
Executes based on user need for repeat masking before gene prediction.
Delivers softmasked genome assemblies and TE expression quantification.

SKILL.md

.github/skills/bio-genome-annotation-repeat-annotationView on GitHub ↗

---
name: bio-genome-annotation-repeat-annotation
description: Identify and classify repetitive elements and transposable elements using RepeatModeler for de novo repeat library construction and RepeatMasker for genome-wide repeat annotation. Quantify TE expression from RNA-seq with TEtranscripts. Use when masking repeats before gene prediction or analyzing transposable element activity.
tool_type: cli
primary_tool: RepeatMasker
---

## Version Compatibility

Reference examples tested with: DESeq2 1.42+, STAR 2.7.11+, matplotlib 3.8+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Repeat and Transposable Element Annotation

**"Mask repeats in my genome assembly"** → Build a de novo repeat library and annotate/softmask repetitive elements as a prerequisite for gene prediction.
- CLI: `RepeatModeler -database mydb` (library), `RepeatMasker -lib custom-lib.fa -xsmall assembly.fa` (masking)

Identify, classify, and mask repetitive elements using RepeatModeler (de novo library construction) and RepeatMasker (genome-wide annotation). Softmasked output is a prerequisite for eukaryotic gene prediction.

## RepeatModeler (De Novo Library)

RepeatModeler builds a species-specific repeat library by detecting repetitive elements de novo from the assembly.

### Build Database and Run

```bash
# Build RepeatModeler database
BuildDatabase -name my_genome -engine ncbi assembly.fasta

# Run RepeatModeler (this takes hours to days depending on genome size)
# -LTRStruct enables LTR structural detection (recommended)
RepeatModeler -database my_genome -pa 16 -LTRStruct
```

### Key Options

| Option | Description |
|--------|-------------|
| `-database` | Database name from BuildDatabase |
| `-pa` | Parallel processes |
| `-LTRStruct` | Enable LTR structural detection pipeline |
| `-engine` | Search engine: ncbi (RMBLAST) or abblast |

### Output

```
my_genome-families.fa     # Consensus repeat library
my_genome-families.stk    # Stockholm alignments
RM_*/                     # Working directory with intermediate files
```

The output `*-families.fa` is the repeat library used by RepeatMasker.

## RepeatMasker (Genome-Wide Annotation)

### With De Novo Library

```bash
# Use species-specific de novo library (recommended)
RepeatMasker \
    -lib my_genome-families.fa \
    -pa 16 \
    -xsmall \
    -gff \
    -dir repeatmasker_out \
    assembly.fasta
```

### With Dfam/RepBase Library

```bash
# Use Dfam curated library for a known species
RepeatMasker \
    -species "Homo sapiens" \
    -pa 16 \
    -xsmall \
    -gff \
    -dir repeatmasker_out \
    assembly.fasta
```

### Combined Library (De Novo + Known)

```bash
# Combine de novo and curated libraries for best results
cat my_genome-families.fa known_repeats.fa > combined_lib.fa

RepeatMasker \
    -lib combined_lib.fa \
    -pa 16 \
    -xsmall \
    -gff \
    -dir repeatmasker_out \
    assembly.fasta
```

### Key Options

| Option | Description |
|--------|-------------|
| `-lib` | Custom repeat library FASTA |
| `-species` | Species name (uses Dfam database) |
| `-pa` | Parallel processes |
| `-xsmall` | Softmask output (lowercase repeats, required for gene prediction) |
| `-gff` | Generate GFF output |
| `-dir` | Output directory |
| `-nolow` | Skip low-complexity masking |
| `-noint` | Skip interspersed repeats |
| `-e` | Search engine: crossmatch, ncbi, hmmer, abblast |
| `-s` | Slow/sensitive search |
| `-q` | Quick search (5-10% less sensitive) |

### Output Files

```
repeatmasker_out/
├── assembly.fasta.masked    # Hardmasked genome (N's replace repeats)
├── assembly.fasta.out       # Detailed repeat annotation table
├── assembly.fasta.tbl       # Summary statistics table
├── assembly.fasta.out.gff   # GFF annotation of repeats
└── assembly.fasta.cat.gz    # Search result alignments
```

### Softmasking for Gene Prediction

The `-xsmall` flag produces softmasked output where repeats are lowercase. This is the required input format for BRAKER3 and most gene prediction tools.

```bash
# The softmasked genome is written in place of the input
# Copy original first
cp assembly.fasta assembly_original.fasta

RepeatMasker -lib my_genome-families.fa -pa 16 -xsmall assembly.fasta

# assembly.fasta.masked is the softmasked output
mv assembly.fasta.masked assembly_softmasked.fasta
```

## TEtranscripts (TE Expression)

Quantify transposable element expression from RNA-seq data using TEtranscripts, which works with DESeq2 for differential TE expression.

```bash
# Requires STAR alignment with multi-mapping reads retained
STAR --runThreadN 16 \
    --genomeDir star_index \
    --readFilesIn reads_R1.fq.gz reads_R2.fq.gz \
    --readFilesCommand zcat \
    --outSAMtype BAM SortedByCoordinate \
    --winAnchorMultimapNmax 100 \
    --outFilterMultimapNmax 100 \
    --outFileNamePrefix sample_

# Run TEtranscripts for differential expression
TEtranscripts \
    --treatment sample1.bam sample2.bam sample3.bam \
    --control ctrl1.bam ctrl2.bam ctrl3.bam \
    --GTF genes.gtf \
    --TE te_annotation.gtf \
    --mode multi \
    --sortByPos
```

### Key TEtranscripts Options

| Option | Description |
|--------|-------------|
| `--treatment` | Treatment BAM files |
| `--control` | Control BAM files |
| `--GTF` | Gene annotation GTF |
| `--TE` | TE annotation GTF (from RepeatMasker) |
| `--mode` | multi (recommended) or uniq |
| `--sortByPos` | Input sorted by position |
| `--stranded` | Strand-specific protocol (yes, no, reverse) |

## Python: Repeat Statistics

**Goal:** Parse RepeatMasker output to summarize repeat content by class and visualize the repeat divergence landscape.

**Approach:** Read the RepeatMasker `.out` file into a DataFrame, group by repeat class to compute total bp and genome percentage, then plot a Kimura divergence histogram stratified by major TE classes (LINE, SINE, LTR, DNA).

```python
import pandas as pd
import re

def parse_repeatmasker_out(out_file):
    '''Parse RepeatMasker .out file into a DataFrame.'''
    records = []
    with open(out_file) as f:
        for i, line in enumerate(f):
            if i < 3:
                continue
            parts = line.split()
            if len(parts) < 15:
                continue
            records.append({
                'score': int(parts[0]),
                'perc_div': float(parts[1]),
                'perc_del': float(parts[2]),
                'perc_ins': float(parts[3]),
                'seqid': parts[4],
                'start': int(parts[5]),
                'end': int(parts[6]),
                'strand': '+' if parts[8] == '+' else '-',
                'repeat_name': parts[9],
                'repeat_class': parts[10],
                'length': int(parts[6]) - int(parts[5]) + 1,
            })
    return pd.DataFrame(records)

def repeat_summary(rm_df, genome_size):
    '''Summarize repeat content by class.'''
    class_summary = rm_df.groupby('repeat_class').agg(
        count=('repeat_name', 'count'),
        total_bp=('length', 'sum'),
    ).sort_values('total_bp', ascending=False)

    class_summary['pct_genome'] = class_summary['total_bp'] / genome_size * 100
    total_masked = rm_df['length'].sum()

    print(f'=== Repeat Summary ===')
    print(f'Total masked: {total_masked:,} bp ({total_masked/genome_size:.1%} of genome)')
    print(f'\nBy class:')
    for cls, row in class_summary.iterrows():
        print(f'  {cls}: {row["count"]:,} elements, {row["total_bp"]:,} bp ({row["pct_genome"]:.1f}%)')

    return class_summary

def repeat_landscape(rm_df, output_file='repeat_landscape.png'):
    '''Plot repeat divergence landscape (Kimura distance).'''
    import matplotlib.pyplot as plt

    fig, ax = plt.subplots(figsize=(12, 6))

    major_classes = ['LINE', 'SINE', 'LTR', 'DNA']
    colors = {'LINE': '#1f77b4', 'SINE': '#ff7f0e', 'LTR': '#2ca02c', 'DNA': '#d62728'}

    for cls in major_classes:
        subset = rm_df[rm_df['repeat_class'].str.contains(cls, case=False, na=False)]
        if len(subset) > 0:
            ax.hist(subset['perc_div'], bins=50, range=(0, 50), alpha=0.6, label=cls, color=colors.get(cls))

    ax.set_xlabel('Kimura Divergence (%)')
    ax.set_ylabel('Count')
    ax.set_title('Repeat Landscape')
    ax.legend()
    plt.savefig(output_file, dpi=150, bbox_inches='tight')
    plt.close()
```

## Expected Repeat Content

| Organism | Repeat Content | Notes |
|----------|---------------|-------|
| Bacteria | 1-5% | Mostly IS elements |
| Yeast | 3-5% | Ty elements |
| Drosophila | 15-25% | LTR-rich |
| Zebrafish | 45-55% | DNA transposon-rich |
| Human | 45-50% | LINE/SINE-rich |
| Maize | 80-85% | LTR-rich |

## Troubleshooting

### RepeatModeler Runs Very Slowly
- Normal for large genomes (days for mammalian-size)
- Use `-pa` for parallelization
- Consider EDTA as alternative for plant genomes

### Low Masking Percentage
- May indicate novel repeats not in database
- Always run RepeatModeler before RepeatMasker
- Combine de novo + known libraries

### Gene Prediction Finds Too Many Genes After Masking
- Verify softmasking with: `grep -v '^>' assembly.fasta | tr -cd 'a-z' | wc -c`
- Ensure using `-xsmall` not default hardmasking

## Related Skills

- eukaryotic-gene-prediction - Requires softmasked genome from repeat annotation
- genome-assembly/assembly-qc - Assess assembly quality including repeat content
- differential-expression/deseq2-basics - Differential TE expression analysis

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.