Skip to content
MGP1000

Provenance

These reference pages provide in-depth provenance on the processes in each module. Detailing the input and output file format, the execution command, and any reference data.


Reference Genome

For this initial version of the MGP1000, the reference genome that will be used is GRCh38 (i.e. hg38). All modules of the pipeline are currently only designed to utilize this version. Newer version may eventually be considered if tool-specific reference files can be regenerated upon it. All reference files are provided with the repository and are stored within the references/hg38 directory.

The core reference file Homo_sapiens_assembly38.fasta was sourced from the Broad Instituteโ€™s publicly available hg38 reference resource. This build includes all autosomes, both sex chromosomes, mitochondrial DNA contig, EBV DNA contig, all random/unplaced/alt/decoy contigs, and all HLA contigs.


Preprocessing ๐Ÿ”ต

Output ๐Ÿ†

The quality control metrics detailed in *.alfred.qc.json.gz can be viewed interactively by uploading the file to the Alfred | GEAR Genomics GUI. Otherwise, a breadth of ## alignment QC metrics is tabulated in *.alfred.qc.summary.txt which can be easily aggregated per sample for data analysis.

Tools ๐Ÿ› ๏ธ


Germline ๐ŸŸข

Output ๐Ÿ†

The output germline VCF contains both SNPs and InDels with default DeepVariant award-winning machine-leaning based filtering. The admixture estimations are found in the *.fastngsadmix.23.qopt. As described here, the first two rows have the names of the reference populations estimated and the converged upon estimates, and then 100 rows with the bootstrapping estimates.

Tools ๐Ÿ› ๏ธ


Somatic โšซ๏ธ

Output ๐Ÿ†

The consensus SNVs and InDels are output in mutation tables. These tab-delimited *.hq.union.consensus.somatic.[snv|indel].txt.gz files contain the union of all variants from all callers, with consensus records annotated and caller-specific VCF metrics reported. There are no other filters applied than the default recommended by each caller. For each consensus record, the median VAF across all reporting callers is output.

The consensus SVs are output in BEDPE format. This paired breakpoint *.hq.union.consensus.somatic.sv.bedpe file contains the union of all variants from all callers, with consensus breakpoints being merged using a comprehensive coordinate-based merge / join of junctions if within 1000bp. The breakpoints from IgCaller that are output in *.igcaller.oncogenic.rearrangements.tsv are integrated using a simple filter of Score >= 5 & 'Reads in normal' <= 4 & 'Count in PoN' <= 5. The DUP and DEL breakpoints from Manta, DELLY2, and SvABA are post-filtered for reduction of FPs using duphold which annotates these records with depth of coverage change across the breakpoints. Records are then filtered using DHFFC>0.7 for DEL and DHBFC<1.3 for DUP.

The consensus CNVs are output in BED format. This segmented *.hq.union.consensus.somatic.cnv.bed file contains the union coverage of copy number segments from both callers. Specifically, there are columns reporting the total copy number, and major/minor allele copy number per tool at each potential segment.

Tools ๐Ÿ› ๏ธ