Skip to content

Directory Structure

sprokopec edited this page May 15, 2024 · 6 revisions

Directory Structure

PROJECT
├── BWA
│   ├── bwa_bam_config.yaml
│   ├── logs
│   ├── SAMPLE1
│   │   ├── SMP-001-N
│   │   └── SMP-001-T
│   └── SMP-002
│       ├── SMP-002-N
│       ├── SMP-002-T1
│       └── SMP-002-T2
├── GATK
│   ├── gatk_bam_config.yaml
│   ├── logs
│   ├── SAMPLE1
│   │   ├── SMP-001-N
│   │   └── SMP-001-T
│   └── SMP-002
│       ├── SMP-002-N
│       ├── SMP-002-T1
│       └── SMP-002-T2
├── BAMQC
│   ├── Coverage
│   │   ├── date_PROJECTNAME_Coverage_summary.tsv
│   │   ├── date_PROJECTNAME_total_bases_covered.tsv
│   │   ├── logs
│   │   ├── SAMPLE1
│   │   │   ├── CallableBases.tsv
│   │   │   ├── SMP-001-N_DepthOfCoverage.*
│   │   │   └── SMP-001-T_DepthOfCoverage.*
│   │   └── SMP-002
│   │       ├── CallableBases.tsv
│   │       ├── SMP-002-N_DepthOfCoverage.*
│   │       ├── SMP-002-T1_DepthOfCoverage.*
│   │       └── SMP-002-T2_DepthOfCoverage.*
│   ├── SequenceMetrics
│   │   ├── date_PROJECTNAME_ContaminationEstimates.tsv
│   │   ├── date_PROJECTNAME_SequenceArtefacts.tsv
│   │   ├── date_PROJECTNAME_CoverageHistogram.tsv
│   │   ├── logs
│   │   ├── SAMPLE1
│   │   │   ├── SMP-001-N_{contamination.table,*metrics}
│   │   │   └── SMP-001-T_{contamination.table,*metrics}
│   │   └── SMP-002
│   │       ├── SMP-002-N_{contamination.table,*metrics}
│   │       ├── SMP-002-T1_{contamination.table,*metrics}
│   │       └── SMP-002-T2_{contamination.table,*metrics}
└── logs
    └── run_DNA_pipeline_timestamp
        ├── pughlab_dna_pipeline__run_bwa
        ├── pughlab_dna_pipeline__run_gatk
        ├── pughlab_dna_pipeline__run_coverage
        └── pughlab_dna_pipeline__run_run_qc

Final outputs

preprocessing:

  • collect_fastqc_metrics.pl
    • will output merged metrics from each fastq file (total sequences, sequence length, gc content)
  • bwa.pl
    • will output merged (duplicate-marked), sorted, bwa-aligned BAMs and
    • bwa_bam_config.yaml for use with gatk.pl
  • gatk.pl
    • will output indel-realigned, base-quality score recalibrated BAMs and
    • gatk_bam_config.yaml for use with downstream scripts

qc:

  • contest.pl
    • will use collect_contest_output.R to collect contamination estimates from all processed samples
    • output includes: DATE_projectname_ContEst_output.tsv (combined META and READGROUP estimates)
  • get_sequencing_metrics.pl
    • will use collect_sequencing_metrics.R to collect contamination estimates and summarize sequencing artefacts from all processed samples
    • output includes: DATE_projectname_ContaminationEstimates.tsv, DATE_projectname_AlignmentMetrics.tsv, DATE_projectname_CoverageHistogram.tsv and DATE_projectname_SequenceArtefacts.tsv
    • optionally: DATE_projectname_WGSMetrics.tsv (WGS only)
  • get_coverage.pl
    • will use collect_coverage_output.R to collect depth of coverage metrics from all processed samples
    • output includes:
      • DATE_projectname_Coverage_summary.tsv (summary metrics including mean, median, % above 15)
      • DATE_projectname_Coverage_statistics.tsv (N bases with X read depth [from 0 to 500])
    • will use count_callable_bases.R to collect total callable base details:
      • DATE_projectname_total_bases_covered.tsv
      • DATE_projectname_CallableBases.RData

variant_calling:

  • haplotype_caller.pl
    • will output .g.vcf files for each sample
  • genotype_gvcfs.pl
    • will output multi-sample, variant score quality recalibrated vcf OR hard-threshold filtered vcf
    • will output annotated per-sample VCFs if requested (vcf2maf)
    • will run filter_germline_genotypes.pl to extract probable germline variants for each patient
    • will use collect_germline_genotypes.R to collect genotypes from all processed samples
    • output includes:
      • DATE_projectname_germline_genotypes.tsv (position x sample matrix)
      • DATE_projectname_germline_correlation.tsv (sample x sample matrix)
  • annotate_germline.pl, mutect.pl, mutect2.pl, varscan.pl, vardict.pl, somaticsniper.pl, strelka.pl and pindel.pl
    • will use collect_snv_output.R to collect SNV/INDEL calls from all processed samples
    • output includes:
      • DATE_projectname_mutations_for_cbioportal.tsv (SNV and INDEL calls in format required by cBioportal)
  • run_sequenza_with_optimal_gamma.pl
    • will use collect_sequenza_output.R to collect optimized CNV calls from all processed samples
    • output includes:
      • DATE_projectname_Sequenza_ploidy_purity.tsv (best purity and ploidy estimates)
      • DATE_projectname_Sequenza_cna_gene_matrix.tsv (thresholded CN status; gene x patient matrix; a gene is considered to have a CNA if >20 bases overlap with a discovered segment)
      • DATE_projectname_Sequenza_ratio_gene_matrix.tsv (log2(depth.ratio); gene x patient matrix; a gene is considered to have a CNA if >20 bases overlap with a discovered segment)
      • DATE_projectname_segments_for_gistic.tsv and DATE_projectname_markers_for_gistic.tsv (log2(depth.ratio) for input to GISTIC2.0)
      • DATE_projectname_segments_for_cbioportal.tsv (log2(depth.ratio) formatted for cbioportal)
  • gatk_cnv.pl
    • will use collect_gatk_cnv_output.R to collect SCNV calls from all processed samples
    • output includes:
      • DATE_projectname_gatk_pga_estimates.tsv
      • DATE_projectname_gatk_ratio_gene_matrix.tsv (log2(depth.ratio); gene x patient matrix; a gene is considered to have a CNA if >20 bases overlap with a discovered segment)
      • DATE_projectname_gatk_cnv_for_gistic.tsv (log2(depth.ratio) for input to GISTIC2.0)
      • DATE_projectname_gatk_cnv_for_cbioportal.tsv (log2(depth.ratio) formatted for cbioportal)
  • ascat.pl
    • will use collect_ascat_output.R to collect SCNV calls from all processed samples:
      • DATE_projectname_ascat_purity_ploidy.tsv (tumour fraction and ploidy output from all samples)
      • DATE_projectname_ascat_cna_gene_matrix.tsv (thresholded CN status; gene x patient matrix)
      • DATE_projectname_ascat_ratio_gene_matrix.tsv (log2(depth.ratio); gene x patient matrix)
  • ichor_cna.pl
    • will use collect_ichorCNA_output.R to collect tumour fraction and ploidy output from all samples: DATE_projectname_ichorCNA_estimates.tsv
    • will use collect_ichorCNA_output.R to collect per-bin copy-number values from all samples: DATE_projectname_perbin_cna_status.tsv
  • panelcn_mops.pl
    • will use collect_panelCN_mops_output.R to collect all output (long format) from all samples: DATE_projectname_panelCN.mops_output.tsv
    • will use collect_panelCN_mops_output.R to collect normalized read counts and copy-number values from all samples (wide format): DATE_projectname_panelCN.mops_cna_matrix.tsv, DATE_projectname_panelCN.mops_normalized_readcounts.tsv
  • delly.pl, manta.pl, novobreak.pl, svict.pl and mavis.pl
    • will use collect_mavis_output.R to collect SV calls from all samples
    • output includes:
      • DATE_projectname_mavis_output.tsv (concatenated output across samples)
      • DATE_projectname_svs_for_cbioportal.tsv (SVs in format required by cBioportal)
  • msi_sensor.pl
    • will use collect_msi_estimates.R to collect MSI output from all samples: DATE_projectname_msi_estimates.tsv

summarize:

  • ensemble mutation calls: DATE_projectname_ensemble_mutation_data.tsv
  • visualizations for:
    • qc (summary plots and concerns for manual review)
    • mutation landscape plots for SNV (somatic and germline), SV (somatic and germline [if available]) and SCNAs
    • mutation signature plots (single-base substitutions, HRD)
    • detailed methods (methods.tex)

create_report:

  • final Report.pdf
Clone this wiki locally