Directory Structure

PROJECT
├── BWA
│   ├── bwa_bam_config.yaml
│   ├── logs
│   ├── SAMPLE1
│   │   ├── SMP-001-N
│   │   └── SMP-001-T
│   └── SMP-002
│       ├── SMP-002-N
│       ├── SMP-002-T1
│       └── SMP-002-T2
├── GATK
│   ├── gatk_bam_config.yaml
│   ├── logs
│   ├── SAMPLE1
│   │   ├── SMP-001-N
│   │   └── SMP-001-T
│   └── SMP-002
│       ├── SMP-002-N
│       ├── SMP-002-T1
│       └── SMP-002-T2
├── BAMQC
│   ├── Coverage
│   │   ├── date_PROJECTNAME_Coverage_summary.tsv
│   │   ├── date_PROJECTNAME_total_bases_covered.tsv
│   │   ├── logs
│   │   ├── SAMPLE1
│   │   │   ├── CallableBases.tsv
│   │   │   ├── SMP-001-N_DepthOfCoverage.*
│   │   │   └── SMP-001-T_DepthOfCoverage.*
│   │   └── SMP-002
│   │       ├── CallableBases.tsv
│   │       ├── SMP-002-N_DepthOfCoverage.*
│   │       ├── SMP-002-T1_DepthOfCoverage.*
│   │       └── SMP-002-T2_DepthOfCoverage.*
│   ├── SequenceMetrics
│   │   ├── date_PROJECTNAME_ContaminationEstimates.tsv
│   │   ├── date_PROJECTNAME_SequenceArtefacts.tsv
│   │   ├── date_PROJECTNAME_CoverageHistogram.tsv
│   │   ├── logs
│   │   ├── SAMPLE1
│   │   │   ├── SMP-001-N_{contamination.table,*metrics}
│   │   │   └── SMP-001-T_{contamination.table,*metrics}
│   │   └── SMP-002
│   │       ├── SMP-002-N_{contamination.table,*metrics}
│   │       ├── SMP-002-T1_{contamination.table,*metrics}
│   │       └── SMP-002-T2_{contamination.table,*metrics}
└── logs
    └── run_DNA_pipeline_timestamp
        ├── pughlab_dna_pipeline__run_bwa
        ├── pughlab_dna_pipeline__run_gatk
        ├── pughlab_dna_pipeline__run_coverage
        └── pughlab_dna_pipeline__run_run_qc

Final outputs

preprocessing:

collect_fastqc_metrics.pl
- will output merged metrics from each fastq file (total sequences, sequence length, gc content)
bwa.pl
- will output merged (duplicate-marked), sorted, bwa-aligned BAMs and
- bwa_bam_config.yaml for use with gatk.pl
gatk.pl
- will output indel-realigned, base-quality score recalibrated BAMs and
- gatk_bam_config.yaml for use with downstream scripts

qc:

contest.pl
- will use collect_contest_output.R to collect contamination estimates from all processed samples
- output includes: DATE_projectname_ContEst_output.tsv (combined META and READGROUP estimates)
get_sequencing_metrics.pl
- will use collect_sequencing_metrics.R to collect contamination estimates and summarize sequencing artefacts from all processed samples
- output includes: DATE_projectname_ContaminationEstimates.tsv, DATE_projectname_AlignmentMetrics.tsv, DATE_projectname_CoverageHistogram.tsv and DATE_projectname_SequenceArtefacts.tsv
- optionally: DATE_projectname_WGSMetrics.tsv (WGS only)
get_coverage.pl
- will use collect_coverage_output.R to collect depth of coverage metrics from all processed samples
- output includes:
  - DATE_projectname_Coverage_summary.tsv (summary metrics including mean, median, % above 15)
  - DATE_projectname_Coverage_statistics.tsv (N bases with X read depth [from 0 to 500])
- will use count_callable_bases.R to collect total callable base details:
  - DATE_projectname_total_bases_covered.tsv
  - DATE_projectname_CallableBases.RData

variant_calling:

haplotype_caller.pl
- will output .g.vcf files for each sample
genotype_gvcfs.pl
- will output multi-sample, variant score quality recalibrated vcf OR hard-threshold filtered vcf
- will output annotated per-sample VCFs if requested (vcf2maf)
- will run filter_germline_genotypes.pl to extract probable germline variants for each patient
- will use collect_germline_genotypes.R to collect genotypes from all processed samples
- output includes:
  - DATE_projectname_germline_genotypes.tsv (position x sample matrix)
  - DATE_projectname_germline_correlation.tsv (sample x sample matrix)
annotate_germline.pl, mutect.pl, mutect2.pl, varscan.pl, vardict.pl, somaticsniper.pl, strelka.pl and pindel.pl
- will use collect_snv_output.R to collect SNV/INDEL calls from all processed samples
- output includes:
  - DATE_projectname_mutations_for_cbioportal.tsv (SNV and INDEL calls in format required by cBioportal)
run_sequenza_with_optimal_gamma.pl
- will use collect_sequenza_output.R to collect optimized CNV calls from all processed samples
- output includes:
  - DATE_projectname_Sequenza_ploidy_purity.tsv (best purity and ploidy estimates)
  - DATE_projectname_Sequenza_cna_gene_matrix.tsv (thresholded CN status; gene x patient matrix; a gene is considered to have a CNA if >20 bases overlap with a discovered segment)
  - DATE_projectname_Sequenza_ratio_gene_matrix.tsv (log2(depth.ratio); gene x patient matrix; a gene is considered to have a CNA if >20 bases overlap with a discovered segment)
  - DATE_projectname_segments_for_gistic.tsv and DATE_projectname_markers_for_gistic.tsv (log2(depth.ratio) for input to GISTIC2.0)
  - DATE_projectname_segments_for_cbioportal.tsv (log2(depth.ratio) formatted for cbioportal)
gatk_cnv.pl
- will use collect_gatk_cnv_output.R to collect SCNV calls from all processed samples
- output includes:
  - DATE_projectname_gatk_pga_estimates.tsv
  - DATE_projectname_gatk_ratio_gene_matrix.tsv (log2(depth.ratio); gene x patient matrix; a gene is considered to have a CNA if >20 bases overlap with a discovered segment)
  - DATE_projectname_gatk_cnv_for_gistic.tsv (log2(depth.ratio) for input to GISTIC2.0)
  - DATE_projectname_gatk_cnv_for_cbioportal.tsv (log2(depth.ratio) formatted for cbioportal)
ascat.pl
- will use collect_ascat_output.R to collect SCNV calls from all processed samples:
  - DATE_projectname_ascat_purity_ploidy.tsv (tumour fraction and ploidy output from all samples)
  - DATE_projectname_ascat_cna_gene_matrix.tsv (thresholded CN status; gene x patient matrix)
  - DATE_projectname_ascat_ratio_gene_matrix.tsv (log2(depth.ratio); gene x patient matrix)
ichor_cna.pl
- will use collect_ichorCNA_output.R to collect tumour fraction and ploidy output from all samples: DATE_projectname_ichorCNA_estimates.tsv
- will use collect_ichorCNA_output.R to collect per-bin copy-number values from all samples: DATE_projectname_perbin_cna_status.tsv
panelcn_mops.pl
- will use collect_panelCN_mops_output.R to collect all output (long format) from all samples: DATE_projectname_panelCN.mops_output.tsv
- will use collect_panelCN_mops_output.R to collect normalized read counts and copy-number values from all samples (wide format): DATE_projectname_panelCN.mops_cna_matrix.tsv, DATE_projectname_panelCN.mops_normalized_readcounts.tsv
delly.pl, manta.pl, novobreak.pl, svict.pl and mavis.pl
- will use collect_mavis_output.R to collect SV calls from all samples
- output includes:
  - DATE_projectname_mavis_output.tsv (concatenated output across samples)
  - DATE_projectname_svs_for_cbioportal.tsv (SVs in format required by cBioportal)
msi_sensor.pl
- will use collect_msi_estimates.R to collect MSI output from all samples: DATE_projectname_msi_estimates.tsv

summarize:

ensemble mutation calls: DATE_projectname_ensemble_mutation_data.tsv
visualizations for:
- qc (summary plots and concerns for manual review)
- mutation landscape plots for SNV (somatic and germline), SV (somatic and germline [if available]) and SCNAs
- mutation signature plots (single-base substitutions, HRD)
- detailed methods (methods.tex)

create_report:

final Report.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Directory Structure

Directory Structure

Final outputs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally