Usage of the Pipeline Modules
The MGP1000 consists of 3 scripts that are used to execute the modules.
preprocessing.nf
germline.nf
somatic.nf
There are two methods for running each module in the pipeline: directly from the current environment or batch submission via a wrapper script (more on this later).
In both scenarios Nextflow performs all the job scheduling/submitting by interfacing with the HPC’s native job executor.
Currently supported executors:
Quickstart Demo ⚡️
A quick demo of running the Preprocessing module and sanity check of the installation:
nextflow run preprocessing.nf \
--run_id demo \
--input_format fastq \
--input_dir tests/ \
-profile preprocessing
This command will run the Preprocessing module in the foreground using singularity
containers and slurm
execution manager (by default, lsf
also available via --executor
).
Preprocessing 🔵
To produce high-quality variant calls, the input data preprocessing must be uniform and deterministic. Additionally, there needs to be consistent collection of standardized QC metrics. The Preprocessing module achieves these endpoints by adhering to the best practices as laid out by the GATK: alignment with BWA mem, post-processing with samtools followed by BQSR. Finally, robust QC metrics are recorded with FastQC (pre-alignment) and Alfred (post-alignment).
Placeholder for image of module DAG
Use-cases 🔍
Below are a series of common use-cases and example code snippets of nextflow run
commands that execute the pipeline module.
The most common use-case: alignment of paired-end FASTQs
nextflow run preprocessing.nf \
-bg \
--run_id batch1 \
--input_format fastq \
--email user@example.com \
-profile preprocessing
The second most common use-case: alignment to hg38 of previously hg19-aligned BAMs
nextflow run preprocessing.nf \
-bg \
--run_id batch2 \
--input_format bam \
--email user@example.com \
-profile preprocessing
The thrid most common use-case: quality-control analysis on previously hg38-aligned BAMs
nextflow run preprocessing.nf \
-bg \
--run_id batch3 \
--input_format bam \
--qc_only yes \
--email user@example.com \
-profile preprocessing
The fourth most common use-case: alignment of lane split paired-end FASTQs
nextflow run preprocessing.nf \
-bg \
--run_id batch4 \
--input_format fastq \
--lane_split yes \
--email user@example.com \
-profile preprocessing
Script Parameters ⚙️
Mandatory Arguements
--run_id
Unique identifier for pipeline run
--input_format
Format of input files
Default:
fastq
| Available:fastq
,bam
-profile
Configuration profile to use, must use
preprocessing
Optional Arguments
-bg
Runs the pipeline processes in the background, this ption should be included if deploying pipeline with real data set so processes will not be cut if user disconnects from deployment environment
-resume
Successfully completed tasks are cached so that if the pipeline stops prematurely the previously completed tasks are skipped while maintaining their output
--lane_split
Determines if input FASTQs are lane split per R1/R2
Default:
no
| Available:yes
,no
--input_dir
Directory that holds input FASTQs or BAMs, this should be given as an absolute path
Default:
input/
--output_dir
Directory that will hold all output files, this should be given as an absolute path
Default:
output/
--email
Email address to send workflow completion/stoppage notification
--seq_protocol
Sequencing protocol of the input, WGS for whole-genome and WES for whole-exome
Default:
WGS
| Available:WES
,WGS
--cpus
Globally set the number of cpus to be allocated
--memory
Globally set the amount of memory to be allocated, written as
##.GB
or##.MB
--queue_size
Set max number of tasks the pipeline will launch
Default:
100
--executor
Set the job executor for the run
Default:
slurm
| Available:local
,lsf
,slurm
--help
Prints help message
Germline 🟢
The matched normal sample BAM is used to call germline SNPs and InDels in a highly parallel fashion with DeepVariant. Additionally, the ancestry admixture likelihood is estimated in the context of 23 reference populations from the 1000 Genomes Project using fastNGSadmix.
Placeholder for image of module DAG
Use-cases 🔍
Below are a series of common use-cases and example code snippets of nextflow run
commands that execute the pipeline module.
The most common use-case: run full germline analysis (calling + admixture)
nextflow run germline.nf \
-bg \
--run_id batch1 \
--sample_sheet wgs_samples.csv \
--email user@example.com \
-profile germline
The most common use-case: run only germline variant calling
nextflow run germline.nf \
-bg \
--run_id batch2 \
--sample_sheet wgs_samples.csv \
--fastngsadmix off \
--email user@example.com \
-profile germline
The third common use-case: run only admixture likelihood estimation
nextflow run germline.nf \
-bg \
--run_id batch3 \
--sample_sheet wgs_samples.csv \
--deepvariant off \
--email user@example.com \
-profile germline
Script Parameters ⚙️
Mandatory Arguements
--run_id
Unique identifier for pipeline run
--sample_sheet
CSV file containing the list of samples where the first column designates the filename of the normal sample, the second column for the filename of the matched tumor sample
-profile
Configuration profile to use, must use
germline
Optional Arguments
-bg
Runs the pipeline processes in the background, this ption should be included if deploying pipeline with real data set so processes will not be cut if user disconnects from deployment environment
-resume
Successfully completed tasks are cached so that if the pipeline stops prematurely the previously completed tasks are skipped while maintaining their output
--input_dir
Directory that holds input FASTQs or BAMs, this should be given as an absolute path
Default:
input/
--output_dir
Directory that will hold all output files, this should be given as an absolute path
Default:
output/
--email
Email address to send workflow completion/stoppage notification
--seq_protocol
Sequencing protocol of the input, WGS for whole-genome and WES for whole-exome
Default:
WGS
| Available:WGS
,WES
--deepvariant
Indicates whether or not to run DeepVariant workflow
Default:
on
| Available:off
,on
--fastngsadmix
Indicates whether or not to run fastNGSadmix workflow
Default:
on
| Available:off
,on
--cpus
Globally set the number of cpus to be allocated
--memory
Globally set the amount of memory to be allocated, written as
##.GB
or##.MB
--queue_size
Set max number of tasks the pipeline will launch
Default:
100
--executor
Set the job executor for the run
Default:
slurm
| Available:local
,lsf
,slurm
--help
Prints help message
Somatic ⚫️
The matched tumor/normal BAM pair is used in somatic variant analysis employing a consensus mechanism for determining the final set of somatic events, including Mutect2, Strelka2, and VarScan2 for SNVs; Mutect2, Strelka2, VarScan2, and SvABA for InDels; Battenberg and FACETS for CNVs; Manta, SvABA, DELLY2, and IgCaller for SVs.
Placeholder for image of module DAG
Use-cases 🔍
Below are a series of common use-cases and example code snippets of nextflow run
commands that execute the pipeline module.
The most common use-case: run full somatic analysis
nextflow run somatic.nf \
-bg \
--run_id batch1 \
--sample_sheet wgs_samples.csv \
--email user@example.com \
-profile somatic
Script Parameters ⚙️
Mandatory Arguements
--run_id
Unique identifier for pipeline run
--sample_sheet
CSV file containing the list of samples where the first column designates the filename of the normal sample, the second column for the filename of the matched tumor sample
-profile
Configuration profile to use, must use
somatic
Optional Arguments
-bg
Runs the pipeline processes in the background, this ption should be included if deploying pipeline with real data set so processes will not be cut if user disconnects from deployment environment
-resume
Successfully completed tasks are cached so that if the pipeline stops prematurely the previously completed tasks are skipped while maintaining their output
--input_dir
Directory that holds input FASTQs or BAMs, this should be given as an absolute path
Default:
input/
--output_dir
Directory that will hold all output files, this should be given as an absolute path
Default:
output/
--email
Email address to send workflow completion/stoppage notification
--mutect_ref_vcf_concatenated
Indicates whether or not the gnomAD allele frequency reference VCF used for MuTect2 processes has been concatenated, this will be done in a process of the pipeline if it has not, this does not need to be done for every separate run after the first
Default:
no
| Available:yes
,no
--battenberg_ref_cached
Indicates whether or not the reference files used for Battenberg have been downloaded/cached locally, this will be done in a process of the pipeline if it has not, this does not need to be done for every separate run after the first
Default:
no
| Available:yes
,no
--cpus
Globally set the number of cpus to be allocated
--memory
Globally set the amount of memory to be allocated, written as
##.GB
or##.MB
--queue_size
Set max number of tasks the pipeline will launch
Default:
100
--executor
Set the job executor for the run
Default:
slurm
| Available:local
,lsf
,slurm
--help
Prints help message
Toolbox Arguments
--battenberg
Indicates whether or not to use this tool
Default:
on
| Available:off
,on
--battenberg_min_depth
Manually set the minimum read depth in the normal sample for SNP filtering in BAF calculations, default is for 30x coverage
Default:
10
--battenberg_preset_rho_psi
Wish to manually set the rho/psi for this run? If TRUE, must set both rho and psi
Default:
FALSE
| Available:TRUE
,FALSE
--battenberg_preset_rho
Manually set the value of rho (purity)
Default:
NA
--battenberg_preset_psi
Manually set the value of psi (ploidy)
Default:
NA
--facets
Indicates whether or not to use this tool
Default:
on
| Available:off
,on
--facets_min_depth
Manually set the minimum read depth in the normal sample for SNP filtering in BAF calculations, default is for 30x coverage
Default:
20
--manta
Indicates whether or not to use this tool
Default:
on
| Available:off
,on
--svaba
Indicates whether or not to use this tool
Default:
on
| Available:off
,on
--delly
Indicates whether or not to use this tool
Default:
on
| Available:off
,on
--delly_strict
Enforce stricter thresholds for calling SVs with DELLY to overcome libraries with extraordinary number of interchromosomal reads
Default:
off
| Available:on
,off
--igcaller
Indicates whether or not to use this tool
Default:
on
| Available:off
,on
--varscan
Indicates whether or not to use this tool
Default:
on
| Available:off
,on
--mutect
Indicates whether or not to use this tool
Default:
on
| Available:off
,on
--strelka
Indicates whether or not to use this tool
Default:
on
| Available:off
,on
--conpair
Indicates whether or not to use this tool
Default:
on
| Available:off
,on
--conpair_min_cov
Manually set the minimum coverage
Default:
10
--fragcounter
Indicates whether or not to use this tool
Default:
on
| Available:off
,on
--telomerecat
Indicates whether or not to use this tool
Default:
off
| Available:on
,off
--telomerehunter
Indicates whether or not to use this tool
Default:
off
| Available:on
,off
--caveman
Indicates whether or not to use this tool
EXPERIMENTAL! Requires many process directories and only calls in WES targets
Default:
off
| Available:on
,off
Batch Submission of Pipeline Runs
In the event the user is required to submit the parent nextflow run
process as a batch submission, there is a script bin/nextflow_run_slurm_submitter.sh
that will accept user-specified parameters for running any pipeline module by first generating and executing a Slurm sbatch
submission script. Below is the full help message and usage example for each step.
bin/nextflow_run_slurm_submitter.sh -h
Use-cases
Below are a series of common use-cases for users to wrap parent run process into a batch submission.
Common use-case for running the Preprocessing module
bin/nextflow_run_slurm_submitter.sh \
preprocessing.nf \
batch1 \
user@example.com \
3 \
"module load java/1.8 nextflow/22.0.8 singularity/3.9.8" \
"--input_format fastq"
Common use-case for running the Germline module
bin/nextflow_run_slurm_submitter.sh \
germline.nf \
batch2 \
user@example.com \
14 \
"module load java/1.8 nextflow/22.0.8 singularity/3.9.8" \
"--sample_sheet normal.csv"
Common use-case for running the Somatic module
bin/nextflow_run_slurm_submitter.sh \
somatic.nf \
batch3 \
user@example.com \
7 \
"module load java/1.8 nextflow/22.0.8 singularity/3.9.8" \
"--sample_sheet tumor_normal.csv"