Provenance
These reference pages provide in-depth provenance on the processes in each module. Detailing the input and output file format, the execution command, and any reference data.
Reference Genome
For this initial version of the MGP1000, the reference genome that will be used is GRCh38 (i.e. hg38).
All modules of the pipeline are currently only designed to utilize this version. Newer version may eventually be considered if tool-specific reference files can be regenerated upon it.
All reference files are provided with the repository and are stored within the references/hg38
directory.
The core reference file Homo_sapiens_assembly38.fasta
was sourced from the Broad Instituteโs publicly available hg38 reference resource.
This build includes all autosomes, both sex chromosomes, mitochondrial DNA contig, EBV DNA contig, all random/unplaced/alt/decoy contigs, and all HLA contigs.
Preprocessing ๐ต
Output ๐
The quality control metrics detailed in *.alfred.qc.json.gz
can be viewed interactively by uploading the file to the Alfred | GEAR Genomics GUI. Otherwise, a breadth of ## alignment QC metrics is tabulated in *.alfred.qc.summary.txt
which can be easily aggregated per sample for data analysis.
Tools ๐ ๏ธ
- GATK - RevertSam (Picard)
- Biobambam2 - bamtofastq
- Trimmomatic - PE
- FastQC
- BWA - mem
- Samtools - sort
- Samtools - collate
- Samtools - fixmate
- Samtools - markdup
- ABRA2
- GATK - DownsampleSam (Picard)
- GATK - BaseRecalibrator
- GATK - ApplyBQSR
- Alfred - qc
Germline ๐ข
Output ๐
The output germline VCF contains both SNPs and InDels with default DeepVariant award-winning machine-leaning based filtering. The admixture estimations are found in the *.fastngsadmix.23.qopt
. As described here, the first two rows have the names of the reference populations estimated and the converged upon estimates, and then 100 rows with the bootstrapping estimates.
Tools ๐ ๏ธ
Somatic โซ๏ธ
Output ๐
The consensus SNVs and InDels are output in mutation tables. These tab-delimited *.hq.union.consensus.somatic.[snv|indel].txt.gz
files contain the union of all variants from all callers, with consensus records annotated and caller-specific VCF metrics reported. There are no other filters applied than the default recommended by each caller. For each consensus record, the median VAF across all reporting callers is output.
The consensus SVs are output in BEDPE format. This paired breakpoint *.hq.union.consensus.somatic.sv.bedpe
file contains the union of all variants from all callers, with consensus breakpoints being merged using a comprehensive coordinate-based merge / join of junctions if within 1000bp. The breakpoints from IgCaller
that are output in *.igcaller.oncogenic.rearrangements.tsv
are integrated using a simple filter of Score >= 5
& 'Reads in normal' <= 4
& 'Count in PoN' <= 5
. The DUP and DEL breakpoints from Manta
, DELLY2
, and SvABA
are post-filtered for reduction of FPs using duphold
which annotates these records with depth of coverage change across the breakpoints. Records are then filtered using DHFFC>0.7
for DEL
and DHBFC<1.3
for DUP
.
The consensus CNVs are output in BED format. This segmented *.hq.union.consensus.somatic.cnv.bed
file contains the union coverage of copy number segments from both callers. Specifically, there are columns reporting the total copy number, and major/minor allele copy number per tool at each potential segment.
Tools ๐ ๏ธ
- Conpair
- Battenberg
- FACETS
- Manta
- SvABA
- DELLY2
- IgCaller
- Mutect2
- Strelka2
- VarScan2
- alleleCounter
- Samtools
- BCFtools
- duphold
- bam-readcount
- BEDtools
- BEDOPS
- fragCounter
- gGnome
- CaVEMan
- Telomerecat
- TelomereHunter