bio-workflow-management-cwl-workflows

Name: bio-workflow-management-cwl-workflows
Author: GPTomics/bioSkills

$npx mdskill add GPTomics/bioSkills/bio-workflow-management-cwl-workflows

Build portable bioinformatics pipelines with CWL standards.

Enables cross-platform sharing and registry contribution.
Integrates cwltool, FastQC, Nextflow, Salmon, Snakemake, and fastp.
Validates tool versions and adapts to actual API signatures.
Delivers executable YAML workflow definitions for analysis.

SKILL.md

.github/skills/bio-workflow-management-cwl-workflowsView on GitHub ↗

---
name: bio-workflow-management-cwl-workflows
description: Create portable, standards-based bioinformatics pipelines with Common Workflow Language (CWL). Use when building workflows that need maximum portability across execution platforms, sharing pipelines with collaborators using different systems, or contributing to community workflow registries.
tool_type: cli
primary_tool: cwltool
---

## Version Compatibility

Reference examples tested with: FastQC 0.12+, Nextflow 23.10+, Salmon 1.10+, Snakemake 8.0+, fastp 0.23+

Before using code patterns, verify installed versions match. If versions differ:
- CLI: `<tool> --version` then `<tool> --help` to confirm flags

If code throws ImportError, AttributeError, or TypeError, introspect the installed
package and adapt the example to match the actual API rather than retrying.

# CWL Workflows

**"Write a portable CWL workflow for my analysis"** → Define tools and workflows in YAML using the Common Workflow Language standard for maximum cross-platform portability and sharing through workflow registries.
- CLI: `cwltool` for local execution of CWL documents
- YAML: CWL v1.2 CommandLineTool and Workflow class definitions

## Basic Tool Definition

```yaml
# fastqc.cwl
cwlVersion: v1.2
class: CommandLineTool
baseCommand: fastqc

inputs:
  fastq:
    type: File
    inputBinding:
      position: 1

outputs:
  html:
    type: File
    outputBinding:
      glob: "*_fastqc.html"
  zip:
    type: File
    outputBinding:
      glob: "*_fastqc.zip"
```

## Tool with Parameters

```yaml
# bwa_mem.cwl
cwlVersion: v1.2
class: CommandLineTool
baseCommand: [bwa, mem]

requirements:
  DockerRequirement:
    dockerPull: biocontainers/bwa:v0.7.17
  ResourceRequirement:
    coresMin: 8
    ramMin: 16000

inputs:
  threads:
    type: int
    default: 8
    inputBinding:
      prefix: -t
      position: 1
  reference:
    type: File
    secondaryFiles:
      - .amb
      - .ann
      - .bwt
      - .pac
      - .sa
    inputBinding:
      position: 2
  reads_1:
    type: File
    inputBinding:
      position: 3
  reads_2:
    type: File?
    inputBinding:
      position: 4

stdout: aligned.sam

outputs:
  sam:
    type: stdout
```

## Basic Workflow

```yaml
# rnaseq.cwl
cwlVersion: v1.2
class: Workflow

inputs:
  fastq_1: File
  fastq_2: File
  salmon_index: Directory

outputs:
  quant_results:
    type: Directory
    outputSource: salmon/quant_dir

steps:
  fastp:
    run: fastp.cwl
    in:
      reads_1: fastq_1
      reads_2: fastq_2
    out: [trimmed_1, trimmed_2, json_report]

  salmon:
    run: salmon_quant.cwl
    in:
      index: salmon_index
      reads_1: fastp/trimmed_1
      reads_2: fastp/trimmed_2
    out: [quant_dir]
```

## Scatter (Parallel Execution)

```yaml
cwlVersion: v1.2
class: Workflow

requirements:
  ScatterFeatureRequirement: {}

inputs:
  fastq_files:
    type: File[]
  reference: File

outputs:
  bam_files:
    type: File[]
    outputSource: align/bam

steps:
  align:
    run: bwa_mem.cwl
    scatter: fastq
    in:
      fastq: fastq_files
      reference: reference
    out: [bam]
```

## Multi-Scatter

```yaml
requirements:
  ScatterFeatureRequirement: {}
  MultipleInputFeatureRequirement: {}

steps:
  align:
    run: bwa_mem.cwl
    scatter: [reads_1, reads_2]
    scatterMethod: dotproduct
    in:
      reads_1: fastq_1_array
      reads_2: fastq_2_array
      reference: reference
    out: [bam]
```

## Input File (Job)

```yaml
# job.yaml
fastq_1:
  class: File
  path: data/sample1_R1.fq.gz
fastq_2:
  class: File
  path: data/sample1_R2.fq.gz
salmon_index:
  class: Directory
  path: ref/salmon_index
threads: 8
```

## Secondary Files

```yaml
inputs:
  bam:
    type: File
    secondaryFiles:
      - .bai
  reference:
    type: File
    secondaryFiles:
      - pattern: .fai
        required: true
      - pattern: .dict
        required: false
```

## Docker and Singularity

```yaml
requirements:
  DockerRequirement:
    dockerPull: quay.io/biocontainers/salmon:1.10.0--h7e5ed60_0

hints:
  SoftwareRequirement:
    packages:
      salmon:
        version: ["1.10.0"]
```

```bash
# Run with Docker
cwltool --docker workflow.cwl job.yaml

# Run with Singularity
cwltool --singularity workflow.cwl job.yaml
```

## Resource Requirements

```yaml
requirements:
  ResourceRequirement:
    coresMin: 4
    coresMax: 16
    ramMin: 8000
    ramMax: 32000
    outdirMin: 10000
    tmpdirMin: 10000
```

## Conditional Steps

```yaml
cwlVersion: v1.2
class: Workflow

requirements:
  InlineJavascriptRequirement: {}

inputs:
  run_qc: boolean
  fastq: File

steps:
  fastqc:
    run: fastqc.cwl
    when: $(inputs.run_qc)
    in:
      run_qc: run_qc
      fastq: fastq
    out: [html]
```

## Subworkflows

```yaml
# main.cwl
steps:
  qc_workflow:
    run: subworkflows/qc.cwl
    in:
      reads_1: fastq_1
      reads_2: fastq_2
    out: [qc_report, trimmed_1, trimmed_2]

  alignment_workflow:
    run: subworkflows/align.cwl
    in:
      reads_1: qc_workflow/trimmed_1
      reads_2: qc_workflow/trimmed_2
    out: [bam]
```

## File Arrays and Directories

```yaml
inputs:
  bam_files:
    type: File[]
  output_dir:
    type: string
    default: "results"

outputs:
  results:
    type: Directory
    outputBinding:
      glob: $(inputs.output_dir)
```

## JavaScript Expressions

```yaml
requirements:
  InlineJavascriptRequirement: {}

inputs:
  sample_name: string

outputs:
  output_bam:
    type: File
    outputBinding:
      glob: $(inputs.sample_name + ".sorted.bam")

arguments:
  - prefix: -o
    valueFrom: $(inputs.sample_name).sorted.bam
```

## InitialWorkDirRequirement

```yaml
requirements:
  InitialWorkDirRequirement:
    listing:
      - entry: $(inputs.reference)
        writable: false
      - entryname: config.txt
        entry: |
          threads=$(inputs.threads)
          memory=$(inputs.memory)
```

## Complete RNA-seq Tool

```yaml
# salmon_quant.cwl
cwlVersion: v1.2
class: CommandLineTool
baseCommand: [salmon, quant]

requirements:
  DockerRequirement:
    dockerPull: quay.io/biocontainers/salmon:1.10.0--h7e5ed60_0
  ResourceRequirement:
    coresMin: 8
    ramMin: 16000

inputs:
  index:
    type: Directory
    inputBinding:
      prefix: -i
  reads_1:
    type: File
    inputBinding:
      prefix: "-1"
  reads_2:
    type: File
    inputBinding:
      prefix: "-2"
  lib_type:
    type: string
    default: A
    inputBinding:
      prefix: -l
  threads:
    type: int
    default: 8
    inputBinding:
      prefix: --threads
  output_dir:
    type: string
    default: quant_output
    inputBinding:
      prefix: -o

outputs:
  quant_dir:
    type: Directory
    outputBinding:
      glob: $(inputs.output_dir)
```

## Run Commands

```bash
# Validate CWL file
cwltool --validate workflow.cwl

# Run workflow
cwltool workflow.cwl job.yaml

# Run with Docker
cwltool --docker workflow.cwl job.yaml

# Run with Singularity
cwltool --singularity workflow.cwl job.yaml

# Run with caching
cwltool --cachedir ./cache workflow.cwl job.yaml

# Run on Toil
toil-cwl-runner workflow.cwl job.yaml
```

## Execution Engines

| Engine | Use Case |
|--------|----------|
| cwltool | Reference implementation, local execution |
| Toil | HPC clusters, cloud (AWS, Google, Azure) |
| Arvados | Enterprise workflow management |
| CWL-Airflow | Airflow integration |

## Related Skills

- workflow-management/wdl-workflows - WDL alternative
- workflow-management/snakemake-workflows - Python-based alternative
- workflow-management/nextflow-pipelines - Groovy-based alternative

More from GPTomics/bioSkills

Skill	Description
bio-admet-prediction	Predicts ADMET properties using ADMETlab 3.0 API or DeepChem models. Estimates bioavailability, CYP inhibition, hERG liability, and 119 toxicity endpoints with uncertainty quantification. Filters for PAINS and other structural alerts. Use when filtering compounds for drug-likeness or prioritizing leads by predicted safety.
bio-alignment-amplicon-clipping	Trim PCR primers from aligned reads in amplicon-panel BAMs using samtools ampliconclip. Use when processing SARS-CoV-2 ARTIC, hereditary cancer panels, ctDNA hot-spot panels, or any amplicon assay where primer-derived bases would falsely confirm reference at primer footprints.
bio-alignment-filtering	Filter alignments by flags, mapping quality, and regions using samtools view and pysam. Use when extracting specific reads, removing low-quality alignments, or subsetting to target regions.
bio-alignment-indexing	Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
bio-alignment-io	Read, write, and convert multiple sequence alignment files using Biopython Bio.AlignIO. Supports Clustal, PHYLIP, Stockholm, FASTA, Nexus, and other alignment formats for phylogenetics and conservation analysis. Use when reading, writing, or converting alignment file formats.
bio-alignment-msa-parsing	Parse and analyze multiple sequence alignments using Biopython. Extract sequences, identify conserved regions, analyze gaps, work with annotations, and manipulate alignment data for downstream analysis. Use when parsing or manipulating multiple sequence alignments.
bio-alignment-msa-statistics	Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
bio-alignment-multiple	Perform multiple sequence alignment using MAFFT, MUSCLE5, ClustalOmega, or T-Coffee. Guides tool and algorithm selection based on dataset size, sequence divergence, and downstream application. Use when aligning three or more homologous sequences for phylogenetics, conservation analysis, or evolutionary studies.
bio-alignment-pairwise	Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
bio-alignment-sorting	Sort alignment files by coordinate or read name using samtools and pysam. Use when preparing BAM files for indexing, variant calling, or paired-end analysis.