bio-splicing-pipeline

Name: bio-splicing-pipeline
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-splicing-pipeline

Analyzes alternative splicing from bulk RNA-seq data to identify differential splicing events

Processes raw FASTQ data to detect splicing changes across conditions
Uses STAR, rMATS-turbo, leafcutter, and R packages for QC and analysis
Evaluates splicing patterns with junction QC, isoform switching, and visualization tools
Delivers sashimi plots, differential splicing tables, and NMD/ORF impact summaries

SKILL.md

.github/skills/bio-splicing-pipelineView on GitHub ↗

---
name: bio-splicing-pipeline
description: End-to-end alternative splicing analysis from FASTQ to differential splicing results for short-read bulk RNA-seq. Aligns with STAR 2-pass cohort-style, performs junction QC (RSeQC, MaxEntScan, SpliceAI), runs rMATS-turbo and leafcutter for concordant differential analysis, optionally MAJIQ V3 for complex events / heterogeneous cohorts, isoform-switching with NMD/ORF/domain consequences (IsoformSwitchAnalyzeR v2 + DRIMSeq+DEXSeq+stageR DTU), and sashimi visualizations. Use when performing comprehensive splicing analysis from raw bulk RNA-seq data; for variant-driven splice prediction see splice-variant-prediction; for rare-disease single-patient outlier detection see outlier-splicing-detection; for full-isoform PacBio/ONT analysis see long-read-splicing.
tool_type: mixed
primary_tool: rMATS-turbo
---

## Version Compatibility

Reference examples tested with: STAR 2.7.11+, fastp 0.23+, numpy 1.26+, pandas 2.2+

Before using code patterns, verify installed versions match. If versions differ:
- Python: `pip show <package>` then `help(module.function)` to check signatures
- R: `packageVersion('<pkg>')` then `?function_name` to verify parameters
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# Alternative Splicing Analysis Pipeline

**"Analyze alternative splicing from my RNA-seq data"** → Orchestrate STAR alignment, PSI quantification (rMATS-turbo/SUPPA2), differential splicing detection, isoform switching analysis (IsoformSwitchAnalyzeR), sashimi plot visualization, and junction QC.

Complete workflow from raw RNA-seq to differential splicing results.

## Pipeline Overview

```
FASTQ → Read QC → STAR 2-pass → Junction QC → rMATS-turbo → Results → Visualization
                                    ↓
                            (Optional) IsoformSwitchAnalyzeR
```

## Step 1: Read Quality Control

```bash
# fastp for adapter trimming and quality filtering
fastp \
    -i sample_R1.fastq.gz \
    -I sample_R2.fastq.gz \
    -o sample_clean_R1.fastq.gz \
    -O sample_clean_R2.fastq.gz \
    --detect_adapter_for_pe \
    --thread 8 \
    -h sample_fastp.html
```

## Step 2: STAR 2-Pass Alignment

```bash
# First pass to detect novel junctions
STAR \
    --runThreadN 8 \
    --genomeDir star_index/ \
    --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
    --readFilesCommand zcat \
    --outFileNamePrefix sample_pass1_ \
    --outSAMtype BAM Unsorted \
    --outSJfilterOverhangMin 8 8 8 8 \
    --alignSJDBoverhangMin 1

# Generate new index with discovered junctions
# (Combine SJ.out.tab files from all samples)
cat *_SJ.out.tab > combined_SJ.out.tab

# Second pass with combined junctions
STAR \
    --runThreadN 8 \
    --genomeDir star_index/ \
    --readFilesIn sample_R1.fastq.gz sample_R2.fastq.gz \
    --readFilesCommand zcat \
    --sjdbFileChrStartEnd combined_SJ.out.tab \
    --outFileNamePrefix sample_ \
    --outSAMtype BAM SortedByCoordinate \
    --outSJfilterOverhangMin 8 8 8 8 \
    --alignSJDBoverhangMin 1 \
    --quantMode GeneCounts
```

## Step 3: Junction QC Checkpoint

```python
import subprocess

def check_junction_saturation(bam_file, bed_file, output_prefix):
    '''
    QC Checkpoint: Verify junction detection saturation.
    Plateau indicates sufficient depth for splicing analysis.
    '''
    subprocess.run([
        'junction_saturation.py',
        '-i', bam_file,
        '-r', bed_file,
        '-o', output_prefix
    ], check=True)

    # Manual check: curves should plateau
    print(f'Check {output_prefix}.junctionSaturation_plot.pdf')
    print('If curves still rising, consider deeper sequencing')
```

## Step 4: Differential Splicing with rMATS-turbo

```bash
# Create sample list files
# condition1_bams.txt: sample1.bam,sample2.bam,sample3.bam
# condition2_bams.txt: sample4.bam,sample5.bam,sample6.bam

rmats.py \
    --b1 condition1_bams.txt \
    --b2 condition2_bams.txt \
    --gtf annotation.gtf \
    -t paired \
    --readLength 150 \
    --nthread 8 \
    --od rmats_output \
    --tmp rmats_tmp
```

## Step 5: Filter Results

```python
import pandas as pd

def filter_differential_splicing(rmats_dir, event_type='SE',
                                  fdr_cutoff=0.05, dpsi_cutoff=0.1, min_reads=10):
    '''
    Filter rMATS results for significant events.

    Thresholds:
    - |deltaPSI| > 0.1 (lenient) or > 0.2 (stringent)
    - FDR < 0.05
    - Junction reads >= 10
    '''
    jc_file = f'{rmats_dir}/{event_type}.MATS.JC.txt'
    df = pd.read_csv(jc_file, sep='\t')

    significant = df[
        (df['FDR'] < fdr_cutoff) &
        (df['IncLevelDifference'].abs() > dpsi_cutoff)
    ].copy()

    print(f'Significant {event_type} events: {len(significant)}')

    # Sort by significance and effect size
    significant['score'] = -significant['FDR'].apply(lambda x: max(x, 1e-300)).apply(
        lambda x: __import__('numpy').log10(x)
    ) * significant['IncLevelDifference'].abs()

    return significant.sort_values('score', ascending=False)
```

## Step 6: Optional Isoform Switching

```r
library(IsoformSwitchAnalyzeR)

# Import Salmon quantification if available
switchList <- importRdata(
    isoformCountMatrix = counts,
    isoformRepExpression = tpm,
    designMatrix = design,
    isoformExonAnnoation = 'annotation.gtf',
    isoformNtFasta = 'transcripts.fa'
)

# Analyze switches
switchList <- isoformSwitchTestDEXSeq(switchList, reduceToSwitchingGenes = TRUE)
```

## Step 7: Sashimi Visualization

```python
import subprocess

def visualize_top_events(rmats_dir, grouping_file, gtf_file, output_dir, n_top=20):
    '''Generate sashimi plots for top differential events.'''
    import pandas as pd
    from pathlib import Path

    Path(output_dir).mkdir(parents=True, exist_ok=True)

    for event_type in ['SE', 'A5SS', 'A3SS', 'MXE', 'RI']:
        jc_file = f'{rmats_dir}/{event_type}.MATS.JC.txt'
        df = pd.read_csv(jc_file, sep='\t')
        sig = df[(df['FDR'] < 0.05) & (df['IncLevelDifference'].abs() > 0.1)]

        for idx, event in sig.head(n_top).iterrows():
            chrom = event['chr']
            start = event.get('upstreamES', event.get('1stExonStart_0base', 0)) - 500
            end = event.get('downstreamEE', event.get('2ndExonEnd', 0)) + 500
            gene = event['geneSymbol']

            subprocess.run([
                'ggsashimi.py',
                '-b', grouping_file,
                '-c', f'{chrom}:{start}-{end}',
                '-o', f'{output_dir}/{event_type}_{gene}',
                '-g', gtf_file,
                '--shrink',
                '--fix-y-scale',
                '-M', '5'
            ], check=True)
```

## Complete Pipeline Script

```bash
#!/bin/bash
set -e

# Configuration
SAMPLES="sample1 sample2 sample3 sample4 sample5 sample6"
CONDITIONS="control control control treatment treatment treatment"
GTF="annotation.gtf"
STAR_INDEX="star_index/"
THREADS=8

# Step 1: QC and trimming
for sample in $SAMPLES; do
    fastp -i ${sample}_R1.fq.gz -I ${sample}_R2.fq.gz \
          -o ${sample}_clean_R1.fq.gz -O ${sample}_clean_R2.fq.gz \
          --thread $THREADS
done

# Step 2: STAR 2-pass alignment
# ... (as above)

# Step 3: Junction QC
for sample in $SAMPLES; do
    junction_saturation.py -i ${sample}.bam -r annotation.bed -o ${sample}_junc
done

# Step 4: rMATS differential splicing
rmats.py --b1 control_bams.txt --b2 treatment_bams.txt \
         --gtf $GTF -t paired --readLength 150 --nthread $THREADS \
         --od rmats_output --tmp rmats_tmp

echo "Pipeline complete. Check rmats_output/ for results."
```

## When NOT to Use This Pipeline (Pipeline Variants)

This pipeline targets **bulk short-read RNA-seq differential splicing between two groups**. For other regimes, use the dedicated skill:

| Question | Use instead |
|----------|-------------|
| "Does this DNA variant alter splicing?" | alternative-splicing/splice-variant-prediction (SpliceAI, Pangolin, MMSplice, ClinGen SVI 2023) |
| "What is aberrant in this single rare-disease patient?" | alternative-splicing/outlier-splicing-detection (FRASER 2.0, OUTRIDER, DROP) |
| "Full-isoform analysis from PacBio Iso-Seq / ONT" | alternative-splicing/long-read-splicing (FLAIR, IsoQuant, Bambu, SQANTI3, rMATS-long) |
| "Single-cell splicing analysis" | alternative-splicing/single-cell-splicing (chemistry-first decision; MARVEL, BRIE2 plate; long-read SC) |
| "Heterogeneous cohort, n>=10 vs n>=10" | This pipeline + MAJIQ V3 HET module (see alternative-splicing/differential-splicing) |
| "Microexon-focused (3-27 nt)" | This pipeline with VAST-TOOLS or MicroExonator; see alternative-splicing/splicing-quantification |

## Related Skills

- alternative-splicing/splicing-quantification - PSI computation, event taxonomy, sign conventions
- alternative-splicing/differential-splicing - Tool selection, MAJIQ V3, Shiba, leafcutter, reconciliation
- alternative-splicing/isoform-switching - DTU + NMD/ORF/domain consequences (IsoformSwitchAnalyzeR v2, stageR)
- alternative-splicing/sashimi-plots - ggsashimi, MAJIQ-VOILA, leafviz visualization
- alternative-splicing/splicing-qc - STAR 2-pass cohort-style, library prep, depth thresholds
- alternative-splicing/single-cell-splicing - 10X chemistry decision; plate-based and long-read SC
- alternative-splicing/splice-variant-prediction - SpliceAI / Pangolin / MMSplice variant interpretation
- alternative-splicing/outlier-splicing-detection - FRASER 2.0 / DROP rare-disease workflow
- alternative-splicing/long-read-splicing - PacBio HiFi / ONT full-isoform analysis
- read-alignment/star-alignment - STAR 2-pass cohort-style configuration
- rna-quantification/alignment-free-quant - Salmon TPM for SUPPA2 and DTU pipelines

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.