Skip to content
MGP1000

Usage of the Pipeline Modules

The MGP1000 consists of 3 scripts that are used to execute the modules.

  • preprocessing.nf
  • germline.nf
  • somatic.nf

There are two methods for running each module in the pipeline: directly from the current environment or batch submission via a wrapper script (more on this later).

In both scenarios Nextflow performs all the job scheduling/submitting by interfacing with the HPC’s native job executor.

Currently supported executors:


Quickstart Demo ⚡️

A quick demo of running the Preprocessing module and sanity check of the installation:

nextflow run preprocessing.nf \
	--run_id demo \
	--input_format fastq \
	--input_dir tests/ \
	-profile preprocessing

This command will run the Preprocessing module in the foreground using singularity containers and slurm execution manager (by default, lsf also available via --executor).


Preprocessing 🔵

To produce high-quality variant calls, the input data preprocessing must be uniform and deterministic. Additionally, there needs to be consistent collection of standardized QC metrics. The Preprocessing module achieves these endpoints by adhering to the best practices as laid out by the GATK: alignment with BWA mem, post-processing with samtools followed by BQSR. Finally, robust QC metrics are recorded with FastQC (pre-alignment) and Alfred (post-alignment).

Placeholder for image of module DAG

Use-cases 🔍

Below are a series of common use-cases and example code snippets of nextflow run commands that execute the pipeline module.

The most common use-case: alignment of paired-end FASTQs

nextflow run preprocessing.nf \
  -bg \
  --run_id batch1 \
  --input_format fastq \
  --email user@example.com \
  -profile preprocessing

The second most common use-case: alignment to hg38 of previously hg19-aligned BAMs

nextflow run preprocessing.nf \
  -bg \
  --run_id batch2 \
  --input_format bam \
  --email user@example.com \
  -profile preprocessing

The thrid most common use-case: quality-control analysis on previously hg38-aligned BAMs

nextflow run preprocessing.nf \
  -bg \
  --run_id batch3 \
  --input_format bam \
  --qc_only yes \
  --email user@example.com \
  -profile preprocessing

The fourth most common use-case: alignment of lane split paired-end FASTQs

nextflow run preprocessing.nf \
  -bg \
  --run_id batch4 \
  --input_format fastq \
  --lane_split yes \
  --email user@example.com \
  -profile preprocessing

Script Parameters ⚙️

Mandatory Arguements

--run_id

Unique identifier for pipeline run

--input_format

Format of input files

Default: fastq | Available: fastq, bam

-profile

Configuration profile to use, must use preprocessing

Optional Arguments

-bg

Runs the pipeline processes in the background, this ption should be included if deploying pipeline with real data set so processes will not be cut if user disconnects from deployment environment

-resume

Successfully completed tasks are cached so that if the pipeline stops prematurely the previously completed tasks are skipped while maintaining their output

--lane_split

Determines if input FASTQs are lane split per R1/R2

Default: no | Available: yes, no

--input_dir

Directory that holds input FASTQs or BAMs, this should be given as an absolute path

Default: input/

--output_dir

Directory that will hold all output files, this should be given as an absolute path

Default: output/

--email

Email address to send workflow completion/stoppage notification

--seq_protocol

Sequencing protocol of the input, WGS for whole-genome and WES for whole-exome

Default: WGS | Available: WES, WGS

--cpus

Globally set the number of cpus to be allocated

--memory

Globally set the amount of memory to be allocated, written as ##.GB or ##.MB

--queue_size

Set max number of tasks the pipeline will launch

Default: 100

--executor

Set the job executor for the run

Default: slurm | Available: local, lsf, slurm

--help

Prints help message


Germline 🟢

The matched normal sample BAM is used to call germline SNPs and InDels in a highly parallel fashion with DeepVariant. Additionally, the ancestry admixture likelihood is estimated in the context of 23 reference populations from the 1000 Genomes Project using fastNGSadmix.

Placeholder for image of module DAG

Use-cases 🔍

Below are a series of common use-cases and example code snippets of nextflow run commands that execute the pipeline module.

The most common use-case: run full germline analysis (calling + admixture)

nextflow run germline.nf \
  -bg \
  --run_id batch1 \
  --sample_sheet wgs_samples.csv \
  --email user@example.com \
  -profile germline

The most common use-case: run only germline variant calling

nextflow run germline.nf \
  -bg \
  --run_id batch2 \
  --sample_sheet wgs_samples.csv \
  --fastngsadmix off \
  --email user@example.com \
  -profile germline

The third common use-case: run only admixture likelihood estimation

nextflow run germline.nf \
  -bg \
  --run_id batch3 \
  --sample_sheet wgs_samples.csv \
  --deepvariant off \
  --email user@example.com \
  -profile germline

Script Parameters ⚙️

Mandatory Arguements

--run_id

Unique identifier for pipeline run

--sample_sheet

CSV file containing the list of samples where the first column designates the filename of the normal sample, the second column for the filename of the matched tumor sample

-profile

Configuration profile to use, must use germline

Optional Arguments

-bg

Runs the pipeline processes in the background, this ption should be included if deploying pipeline with real data set so processes will not be cut if user disconnects from deployment environment

-resume

Successfully completed tasks are cached so that if the pipeline stops prematurely the previously completed tasks are skipped while maintaining their output

--input_dir

Directory that holds input FASTQs or BAMs, this should be given as an absolute path

Default: input/

--output_dir

Directory that will hold all output files, this should be given as an absolute path

Default: output/

--email

Email address to send workflow completion/stoppage notification

--seq_protocol

Sequencing protocol of the input, WGS for whole-genome and WES for whole-exome

Default: WGS | Available: WGS, WES

--deepvariant

Indicates whether or not to run DeepVariant workflow

Default: on | Available: off, on

--fastngsadmix

Indicates whether or not to run fastNGSadmix workflow

Default: on | Available: off, on

--cpus

Globally set the number of cpus to be allocated

--memory

Globally set the amount of memory to be allocated, written as ##.GB or ##.MB

--queue_size

Set max number of tasks the pipeline will launch

Default: 100

--executor

Set the job executor for the run

Default: slurm | Available: local, lsf, slurm

--help

Prints help message


Somatic ⚫️

The matched tumor/normal BAM pair is used in somatic variant analysis employing a consensus mechanism for determining the final set of somatic events, including Mutect2, Strelka2, and VarScan2 for SNVs; Mutect2, Strelka2, VarScan2, and SvABA for InDels; Battenberg and FACETS for CNVs; Manta, SvABA, DELLY2, and IgCaller for SVs.

Placeholder for image of module DAG

Use-cases 🔍

Below are a series of common use-cases and example code snippets of nextflow run commands that execute the pipeline module.

The most common use-case: run full somatic analysis

nextflow run somatic.nf \
  -bg \
  --run_id batch1 \
  --sample_sheet wgs_samples.csv \
  --email user@example.com \
  -profile somatic

Script Parameters ⚙️

Mandatory Arguements

--run_id

Unique identifier for pipeline run

--sample_sheet

CSV file containing the list of samples where the first column designates the filename of the normal sample, the second column for the filename of the matched tumor sample

-profile

Configuration profile to use, must use somatic

Optional Arguments

-bg

Runs the pipeline processes in the background, this ption should be included if deploying pipeline with real data set so processes will not be cut if user disconnects from deployment environment

-resume

Successfully completed tasks are cached so that if the pipeline stops prematurely the previously completed tasks are skipped while maintaining their output

--input_dir

Directory that holds input FASTQs or BAMs, this should be given as an absolute path

Default: input/

--output_dir

Directory that will hold all output files, this should be given as an absolute path

Default: output/

--email

Email address to send workflow completion/stoppage notification

--mutect_ref_vcf_concatenated

Indicates whether or not the gnomAD allele frequency reference VCF used for MuTect2 processes has been concatenated, this will be done in a process of the pipeline if it has not, this does not need to be done for every separate run after the first

Default: no | Available: yes, no

--battenberg_ref_cached

Indicates whether or not the reference files used for Battenberg have been downloaded/cached locally, this will be done in a process of the pipeline if it has not, this does not need to be done for every separate run after the first

Default: no | Available: yes, no

--cpus

Globally set the number of cpus to be allocated

--memory

Globally set the amount of memory to be allocated, written as ##.GB or ##.MB

--queue_size

Set max number of tasks the pipeline will launch

Default: 100

--executor

Set the job executor for the run

Default: slurm | Available: local, lsf, slurm

--help

Prints help message

Toolbox Arguments

--battenberg

Indicates whether or not to use this tool

Default: on | Available: off, on

--battenberg_min_depth

Manually set the minimum read depth in the normal sample for SNP filtering in BAF calculations, default is for 30x coverage

Default: 10

--battenberg_preset_rho_psi

Wish to manually set the rho/psi for this run? If TRUE, must set both rho and psi

Default: FALSE | Available: TRUE, FALSE

--battenberg_preset_rho

Manually set the value of rho (purity)

Default: NA

--battenberg_preset_psi

Manually set the value of psi (ploidy)

Default: NA

--facets

Indicates whether or not to use this tool

Default: on | Available: off, on

--facets_min_depth

Manually set the minimum read depth in the normal sample for SNP filtering in BAF calculations, default is for 30x coverage

Default: 20

--manta

Indicates whether or not to use this tool

Default: on | Available: off, on

--svaba

Indicates whether or not to use this tool

Default: on | Available: off, on

--delly

Indicates whether or not to use this tool

Default: on | Available: off, on

--delly_strict

Enforce stricter thresholds for calling SVs with DELLY to overcome libraries with extraordinary number of interchromosomal reads

Default: off | Available: on, off

--igcaller

Indicates whether or not to use this tool

Default: on | Available: off, on

--varscan

Indicates whether or not to use this tool

Default: on | Available: off, on

--mutect

Indicates whether or not to use this tool

Default: on | Available: off, on

--strelka

Indicates whether or not to use this tool

Default: on | Available: off, on

--conpair

Indicates whether or not to use this tool

Default: on | Available: off, on

--conpair_min_cov

Manually set the minimum coverage

Default: 10

--fragcounter

Indicates whether or not to use this tool

Default: on | Available: off, on

--telomerecat

Indicates whether or not to use this tool

Default: off | Available: on, off

--telomerehunter

Indicates whether or not to use this tool

Default: off | Available: on, off

--caveman

Indicates whether or not to use this tool

EXPERIMENTAL! Requires many process directories and only calls in WES targets

Default: off | Available: on, off

Batch Submission of Pipeline Runs

In the event the user is required to submit the parent nextflow run process as a batch submission, there is a script bin/nextflow_run_slurm_submitter.sh that will accept user-specified parameters for running any pipeline module by first generating and executing a Slurm sbatch submission script. Below is the full help message and usage example for each step.

bin/nextflow_run_slurm_submitter.sh -h

Use-cases

Below are a series of common use-cases for users to wrap parent run process into a batch submission.

Common use-case for running the Preprocessing module

bin/nextflow_run_slurm_submitter.sh \
    preprocessing.nf \
    batch1 \
    user@example.com \
    3 \
    "module load java/1.8 nextflow/22.0.8 singularity/3.9.8" \
    "--input_format fastq"

Common use-case for running the Germline module

bin/nextflow_run_slurm_submitter.sh \
    germline.nf \
    batch2 \
    user@example.com \
    14 \
    "module load java/1.8 nextflow/22.0.8 singularity/3.9.8" \
    "--sample_sheet normal.csv"

Common use-case for running the Somatic module

bin/nextflow_run_slurm_submitter.sh \
    somatic.nf \
    batch3 \
    user@example.com \
    7 \
    "module load java/1.8 nextflow/22.0.8 singularity/3.9.8" \
    "--sample_sheet tumor_normal.csv"