Stage Input Data
By default, all input files are handled out of the input/ and input/preprocessedBams directories for the Preprocessing and Germline/Somatic modules, respectively. However, each module in the pipeline includes an parameter (--input_dir) for the user to define the input directory. Finally, all symbolic links are followed.
Click on a tab below for details on input needed for each module …
The Preprocessing module currently supports fastq or bam as --input_format and WGS or WES as --seq_protocol. This includes lane split FASTQs (See below for validated naming conventions). The files will be merged internally as part of the module run and will not alter any input files.
There are two underlying logical assumptions:
- The input files are of a single format
- The FASTQs use an ‘R1/R2’ naming convention to designate paired-end read files
Example of input directory with FASTQs:
sample_1_R1.fastq.gz
sample_1_R2.fastq.gz
sample_2_R1.fastq.gz
sample_2_R2.fastq.gzExample of input directory with BAMs:
sample_1.bam
sample_2.bamValidated naming conventions for lane split FASTQs:
_L00[1-9]_R[12]_[\d\w]+_L00[1-9]_R[12]_00[1-9]_R[12]_[\d\w]+_00[1-9]_R[12]_R[12]_00[1-9]
Not sure if your FASTQs will work? Test their filename against the regex strings listed above here.
The Germline module currently supports any GRCh38 aligned bam for either WGS or WES as --seq_protocol. Only additional input needed is an index (.bai) for the bam and a sample sheet CSV file.
Example of input directory:
N_sample_1.bam
N_sample_1.bam.bai
N_sample_2.bam
N_sample_2.bam.baiExample of sample sheet CSV file:
normal,tumor
N_sample_1.bam,T_sample_1.bam
N_sample_2.bam,T_sample_2.bamThis CSV file should only include the filenames of the bam that can be found at the input directory path.
Notice: While the
T_sample_1.bamfile is written in the sample sheet, the module only searchs for files in the normal column
The Somatic module currently supports any GRCh38 aligned bam for WGS as --seq_protocol. Only additional input needed is an index (.bai) for the bam and a sample sheet CSV file.
Example of input directory:
N_sample_1.bam T_sample_1.bam
N_sample_1.bam.bai T_sample_1.bam.bai
N_sample_2.bam T_sample_2.bam
N_sample_2.bam.bai T_sample_2.bam.baiExample of sample sheet CSV file:
normal,tumor
N_sample_1.bam,T_sample_1.bam
N_sample_2.bam,T_sample_2.bamThis CSV file should only include the filenames of the bam that can be found at the input directory path.