-
Notifications
You must be signed in to change notification settings - Fork 2
Directory Structure
sprokopec edited this page May 15, 2024
·
6 revisions
PROJECT
├── BWA
│ ├── bwa_bam_config.yaml
│ ├── logs
│ ├── SAMPLE1
│ │ ├── SMP-001-N
│ │ └── SMP-001-T
│ └── SMP-002
│ ├── SMP-002-N
│ ├── SMP-002-T1
│ └── SMP-002-T2
├── GATK
│ ├── gatk_bam_config.yaml
│ ├── logs
│ ├── SAMPLE1
│ │ ├── SMP-001-N
│ │ └── SMP-001-T
│ └── SMP-002
│ ├── SMP-002-N
│ ├── SMP-002-T1
│ └── SMP-002-T2
├── BAMQC
│ ├── Coverage
│ │ ├── date_PROJECTNAME_Coverage_summary.tsv
│ │ ├── date_PROJECTNAME_total_bases_covered.tsv
│ │ ├── logs
│ │ ├── SAMPLE1
│ │ │ ├── CallableBases.tsv
│ │ │ ├── SMP-001-N_DepthOfCoverage.*
│ │ │ └── SMP-001-T_DepthOfCoverage.*
│ │ └── SMP-002
│ │ ├── CallableBases.tsv
│ │ ├── SMP-002-N_DepthOfCoverage.*
│ │ ├── SMP-002-T1_DepthOfCoverage.*
│ │ └── SMP-002-T2_DepthOfCoverage.*
│ ├── SequenceMetrics
│ │ ├── date_PROJECTNAME_ContaminationEstimates.tsv
│ │ ├── date_PROJECTNAME_SequenceArtefacts.tsv
│ │ ├── date_PROJECTNAME_CoverageHistogram.tsv
│ │ ├── logs
│ │ ├── SAMPLE1
│ │ │ ├── SMP-001-N_{contamination.table,*metrics}
│ │ │ └── SMP-001-T_{contamination.table,*metrics}
│ │ └── SMP-002
│ │ ├── SMP-002-N_{contamination.table,*metrics}
│ │ ├── SMP-002-T1_{contamination.table,*metrics}
│ │ └── SMP-002-T2_{contamination.table,*metrics}
└── logs
└── run_DNA_pipeline_timestamp
├── pughlab_dna_pipeline__run_bwa
├── pughlab_dna_pipeline__run_gatk
├── pughlab_dna_pipeline__run_coverage
└── pughlab_dna_pipeline__run_run_qc
preprocessing:
- collect_fastqc_metrics.pl
- will output merged metrics from each fastq file (total sequences, sequence length, gc content)
- bwa.pl
- will output merged (duplicate-marked), sorted, bwa-aligned BAMs and
- bwa_bam_config.yaml for use with gatk.pl
- gatk.pl
- will output indel-realigned, base-quality score recalibrated BAMs and
- gatk_bam_config.yaml for use with downstream scripts
qc:
- contest.pl
- will use collect_contest_output.R to collect contamination estimates from all processed samples
- output includes: DATE_projectname_ContEst_output.tsv (combined META and READGROUP estimates)
- get_sequencing_metrics.pl
- will use collect_sequencing_metrics.R to collect contamination estimates and summarize sequencing artefacts from all processed samples
- output includes: DATE_projectname_ContaminationEstimates.tsv, DATE_projectname_AlignmentMetrics.tsv, DATE_projectname_CoverageHistogram.tsv and DATE_projectname_SequenceArtefacts.tsv
- optionally: DATE_projectname_WGSMetrics.tsv (WGS only)
- get_coverage.pl
- will use collect_coverage_output.R to collect depth of coverage metrics from all processed samples
- output includes:
- DATE_projectname_Coverage_summary.tsv (summary metrics including mean, median, % above 15)
- DATE_projectname_Coverage_statistics.tsv (N bases with X read depth [from 0 to 500])
- will use count_callable_bases.R to collect total callable base details:
- DATE_projectname_total_bases_covered.tsv
- DATE_projectname_CallableBases.RData
variant_calling:
- haplotype_caller.pl
- will output .g.vcf files for each sample
- genotype_gvcfs.pl
- will output multi-sample, variant score quality recalibrated vcf OR hard-threshold filtered vcf
- will output annotated per-sample VCFs if requested (vcf2maf)
- will run filter_germline_genotypes.pl to extract probable germline variants for each patient
- will use collect_germline_genotypes.R to collect genotypes from all processed samples
- output includes:
- DATE_projectname_germline_genotypes.tsv (position x sample matrix)
- DATE_projectname_germline_correlation.tsv (sample x sample matrix)
- annotate_germline.pl, mutect.pl, mutect2.pl, varscan.pl, vardict.pl, somaticsniper.pl, strelka.pl and pindel.pl
- will use collect_snv_output.R to collect SNV/INDEL calls from all processed samples
- output includes:
- DATE_projectname_mutations_for_cbioportal.tsv (SNV and INDEL calls in format required by cBioportal)
- run_sequenza_with_optimal_gamma.pl
- will use collect_sequenza_output.R to collect optimized CNV calls from all processed samples
- output includes:
- DATE_projectname_Sequenza_ploidy_purity.tsv (best purity and ploidy estimates)
- DATE_projectname_Sequenza_cna_gene_matrix.tsv (thresholded CN status; gene x patient matrix; a gene is considered to have a CNA if >20 bases overlap with a discovered segment)
- DATE_projectname_Sequenza_ratio_gene_matrix.tsv (log2(depth.ratio); gene x patient matrix; a gene is considered to have a CNA if >20 bases overlap with a discovered segment)
- DATE_projectname_segments_for_gistic.tsv and DATE_projectname_markers_for_gistic.tsv (log2(depth.ratio) for input to GISTIC2.0)
- DATE_projectname_segments_for_cbioportal.tsv (log2(depth.ratio) formatted for cbioportal)
- gatk_cnv.pl
- will use collect_gatk_cnv_output.R to collect SCNV calls from all processed samples
- output includes:
- DATE_projectname_gatk_pga_estimates.tsv
- DATE_projectname_gatk_ratio_gene_matrix.tsv (log2(depth.ratio); gene x patient matrix; a gene is considered to have a CNA if >20 bases overlap with a discovered segment)
- DATE_projectname_gatk_cnv_for_gistic.tsv (log2(depth.ratio) for input to GISTIC2.0)
- DATE_projectname_gatk_cnv_for_cbioportal.tsv (log2(depth.ratio) formatted for cbioportal)
- ascat.pl
- will use collect_ascat_output.R to collect SCNV calls from all processed samples:
- DATE_projectname_ascat_purity_ploidy.tsv (tumour fraction and ploidy output from all samples)
- DATE_projectname_ascat_cna_gene_matrix.tsv (thresholded CN status; gene x patient matrix)
- DATE_projectname_ascat_ratio_gene_matrix.tsv (log2(depth.ratio); gene x patient matrix)
- will use collect_ascat_output.R to collect SCNV calls from all processed samples:
- ichor_cna.pl
- will use collect_ichorCNA_output.R to collect tumour fraction and ploidy output from all samples: DATE_projectname_ichorCNA_estimates.tsv
- will use collect_ichorCNA_output.R to collect per-bin copy-number values from all samples: DATE_projectname_perbin_cna_status.tsv
- panelcn_mops.pl
- will use collect_panelCN_mops_output.R to collect all output (long format) from all samples: DATE_projectname_panelCN.mops_output.tsv
- will use collect_panelCN_mops_output.R to collect normalized read counts and copy-number values from all samples (wide format): DATE_projectname_panelCN.mops_cna_matrix.tsv, DATE_projectname_panelCN.mops_normalized_readcounts.tsv
- delly.pl, manta.pl, novobreak.pl, svict.pl and mavis.pl
- will use collect_mavis_output.R to collect SV calls from all samples
- output includes:
- DATE_projectname_mavis_output.tsv (concatenated output across samples)
- DATE_projectname_svs_for_cbioportal.tsv (SVs in format required by cBioportal)
- msi_sensor.pl
- will use collect_msi_estimates.R to collect MSI output from all samples: DATE_projectname_msi_estimates.tsv
summarize:
- ensemble mutation calls: DATE_projectname_ensemble_mutation_data.tsv
- visualizations for:
- qc (summary plots and concerns for manual review)
- mutation landscape plots for SNV (somatic and germline), SV (somatic and germline [if available]) and SCNAs
- mutation signature plots (single-base substitutions, HRD)
- detailed methods (methods.tex)
create_report:
- final Report.pdf