Skip to content
MGP1000

Stage Input Data

By default, all input files are handled out of the input/ and input/preprocessedBams directories for the Preprocessing and Germline/Somatic modules, respectively. However, each module in the pipeline includes an parameter (--input_dir) for the user to define the input directory. Finally, all symbolic links are followed.

Click on a tab below for details on input needed for each module …

The Preprocessing module currently supports fastq or bam as --input_format and WGS or WES as --seq_protocol. This includes lane split FASTQs (See below for validated naming conventions). The files will be merged internally as part of the module run and will not alter any input files.

There are two underlying logical assumptions:

  1. The input files are of a single format
  1. The FASTQs use an ‘R1/R2’ naming convention to designate paired-end read files

Example of input directory with FASTQs:

sample_1_R1.fastq.gz
sample_1_R2.fastq.gz
sample_2_R1.fastq.gz
sample_2_R2.fastq.gz

Example of input directory with BAMs:

sample_1.bam
sample_2.bam

Validated naming conventions for lane split FASTQs:

  • _L00[1-9]_R[12]_[\d\w]+
  • _L00[1-9]_R[12]
  • _00[1-9]_R[12]_[\d\w]+
  • _00[1-9]_R[12]
  • _R[12]_00[1-9]

Not sure if your FASTQs will work? Test their filename against the regex strings listed above here.