Nextflow Data Processing & Configuration

Our pipeline configuration is still in-development, and the contents of this document are subject to change.

Summary of Processing

Datatype	Method	Output	Tried yet?
https://nf-co.re/sarek/usage v3.2
WES or WGS	DeepVariant	Germline SNV, INDEL	Yes
WES or WGS	Strelka, Mutect2	Somatic SNV, INDEL	Yes
WES or WGS	TBD	Germline and Somatic Structural Variants	No
WES or WGS	TBD	Germline and Somatic CNV	No
WES or WGS	TBD	Tumor MSI	No
SNV, INDEL variants	TBD	Annotated Variants	No

Datatype	Method	Output	Tried yet?
https://nf-co.re/rnaseq v3.11.2
RNA-Seq	STAR with Salmon in alignment-based mode	Gene expression counts	Yes

More information about the workflows is available here: https://sagebionetworks.jira.com/wiki/spaces/WF/pages/2363359776

bam/cram to fastq conversion

When fastq files are not available, cram/bam files are converted to fastq using this pipeline: https://github.com/qbic-pipelines/bamtofastq (v1.2.0).
If unaligned bam files are available instead of fastq files, we recommend providing u-bam files for direct input to sarek 3.0.

WES and WGS Variant Calling (SNV & INDEL)

Germline SNV + INDEL

This involves transformation of WES fastq or cram files to variant call files in VCF format (.vcf files).

As of Jan 2022, the reference genome used is Homo_sapiens/GATK/GRCh38 (https://github.com/nf-core/sarek/blob/2.7.1/conf/igenomes.config#L38-L58)

The processing steps include the following:

Raw fastq files uploaded to Synapse by researcher in a folder with name format experiment_name_rnaseq_fastq_date. No white space should be present in the filenames (all filenames should have _ for whitespaces.
All experiment and sample related annotations need to be added on Synapse before processing can start. This is a required step so that a sample sheet can be generated to trigger the processing workflow
The sample sheet should contain the following information in a comma-separated file (.csv) with at least 3 columns, and a header row as shown below. (More information here)

sample	subject	status	sex	file_1	file_2	lane	parentId	bed_file	output_parent_Id
Synapse specimenID	Synapse individualID	1 (Tumor = 1, Normal 0)	XX or XY	`syn://synId`	`syn://synId`	Lane information	SynapseID of parent folder	Synapse ID of BED file (if WES sata)	Synapse ID of folder where all processed files will be indexed

The files are pulled into NextFlow workflow setup and processed using the following versions of software:

CODE

nf-core/sarek	v2.7.1
Nextflow	v21.10.5
BWA	0.7.17
GATK	v4.1.7.0
FreeBayes	v1.3.2
samtools	v1.9
Strelka	v2.9.10
Manta	v1.6.0
TIDDIT	v2.7.1
AlleleCount	v4.0.2
ASCAT	v2.5.2
Control-FREEC	vv11.6
msisensor	v0.5
SnpEff	v4.3t
VEP	v99.2
MultiQC	v1.8
FastQC	v0.11.9
bcftools	v1.9
CNVkit	v0.9.6
htslib	v1.9
QualiMap	v2.2.2-dev
Trim Galore	v0.6.4_dev
vcftools	v0.1.16
R	v4.0.2

Commands used for running JHU samples on DeepVariant:

All files and sample sheets are first staged in S3 buckets linked to NFTower. then the following command are used to launch the processing pipeline.

Params:

CODE

input: s3://jhu-biobank-nf-project-tower-bucket/jobs/02-sage-sarek-2.7.1-deepvariant/inputs/sample-sheet.tsv
outdir: s3://jhu-biobank-nf-project-tower-bucket/jobs/02-sage-sarek-2.7.1-deepvariant/outputs/
genome: GRCh38
igenomes_base: s3://sage-igenomes/igenomes
model_type: WES
tools: "deepvariant"

Config:

CODE

process {
  errorStrategy = 'retry'
  maxRetries = 3

  withLabel:deepvariant {
    container = "google/deepvariant:1.1.0"
    cpus = 24
  }
}

Pre-run Script:

CODE

export NXF_VER=21.10.5

Profiles:

CODE

aws_tower

Somatic SNV + INDEL using Strelka

Pipeline parameters to run strelka on a tower instance are given below:

CODE

{"input": "s3://ntap-add5-project-tower-bucket/samplesheets/synstage/sarek_newsamplesheet.csv", 
"wes": true, 
"igenomes_base": "s3://sage-igenomes/igenomes", 
"genome": "GATK.GRCh38", 
"tools": "strelka", 
"intervals": "s3://ntap-add5-project-tower-bucket/reference/xgen-exome-research-panel-probes-hg38.bed", 
"outdir": "s3://ntap-add5-project-tower-bucket/outputs/Reprocess/NFINTXXX/"}

For processing WGS data:

Set “wes” to false
Remove “intervals”

Annotated Variants

Currently, germline variant calls in VCF format are being processed manually using VEP and vcf2maf

Pipeline:

CODE

https://github.com/Sage-Bionetworks-Workflows/nf-vcf2maf

Pipeline parameters:

CODE

input: s3://ntap-add5-project-tower-bucket/samplsheets/Reprocess/NFINTXXX/sarek_vcf2maf_samplesheet.csv

Pipeline component version details:

CODE

BCFTOOLS_STATS:
  bcftools: '1.17'
BWAMEM1_MEM:
  bwa: 0.7.17-r1188
  samtools: 1.16.1
CALCULATECONTAMINATION:
  gatk4: 4.4.0.0
CREATE_INTERVALS_BED:
  gawk: 5.1.0
CUSTOM_DUMPSOFTWAREVERSIONS:
  python: 3.11.0
  yaml: '6.0'
FASTP:
  fastp: 0.23.4
FASTQC:
  fastqc: 0.11.9
FILTERMUTECTCALLS:
  gatk4: 4.4.0.0
GATK4_APPLYBQSR:
  gatk4: 4.4.0.0
GATK4_BASERECALIBRATOR:
  gatk4: 4.4.0.0
GATK4_MARKDUPLICATES:
  gatk4: 4.4.0.0
GETPILEUPSUMMARIES:
  gatk4: 4.4.0.0
INDEX_CRAM:
  samtools: '1.17'
INDEX_MARKDUPLICATES:
  samtools: '1.17'
LEARNREADORIENTATIONMODEL:
  gatk4: 4.4.0.0
MOSDEPTH:
  mosdepth: 0.3.3
MUTECT2:
  gatk4: 4.4.0.0
SAMTOOLS_STATS:
  samtools: '1.17'
STRELKA_SINGLE:
  strelka: 2.9.10
TABIX_BGZIPTABIX_INTERVAL_COMBINED:
  tabix: '1.12'
TABIX_BGZIPTABIX_INTERVAL_SPLIT:
  tabix: '1.12'
VCFTOOLS_TSTV_COUNT:
  vcftools: 0.1.16
Workflow:
  Nextflow: 22.10.6
  nf-core/sarek: 3.2.0

RNA Sequencing Data Quantification

Processing RNA-seq files involve transformation of raw data (fastq files) to transcript counts (quants.sf files).

The quantification software of choice is Salmon.

As of Jan 2022, the reference genome used is Homo_sapiens/NCBI/GRCh38.

Processing involves the following steps:

Raw fastq files uploaded to Synapse by researcher in a folder with name format experiment_name_rnaseq_fastq_date . No white space should be present in the filenames (all filenames should have _ for whitespaces. While the naming convention is a best practices recommendation and not a strict rule, the exclusion of whitespaces is required.
All experiment and sample related annotations need to be added on Synapse before processing can start. This is a required step so that a sample sheet can be generated to trigger the processing workflow
The sample sheet should contain the following information in the following format (saved as a .csv file) (More information here) :

sample	single_end	fastq_1	fastq_2	strandedness
Synapse specimenID	0 (1 if paired-end)	synID	synID	auto

The files are pulled into NextFlow workflow setup and processed using the following versions of software:

CODE

BEDTOOLS_GENOMECOV:
  bedtools: 2.30.0
CAT_FASTQ:
  cat: '8.30'
CUSTOM_DUMPSOFTWAREVERSIONS:
  python: 3.11.0
  yaml: '6.0'
CUSTOM_GETCHROMSIZES:
  getchromsizes: 1.16.1
DESEQ2_QC_STAR_SALMON:
  bioconductor-deseq2: 1.28.0
  r-base: 4.0.3
DUPRADAR:
  bioconductor-dupradar: 1.28.0
  r-base: 4.2.1
FASTQC:
  fastqc: 0.11.9
FQ_SUBSAMPLE:
  fq: 0.9.1 (2022-02-22)
MULTIQC_CUSTOM_BIOTYPE:
  python: 3.9.5
PICARD_MARKDUPLICATES:
  picard: 3.0.0
PREPROCESS_TRANSCRIPTS_FASTA_GENCODE:
  sed: '4.7'
QUALIMAP_RNASEQ:
  qualimap: 2.2.2-dev
RSEQC_BAMSTAT:
  rseqc: 3.0.1
RSEQC_INFEREXPERIMENT:
  rseqc: 3.0.1
RSEQC_INNERDISTANCE:
  rseqc: 3.0.1
RSEQC_JUNCTIONANNOTATION:
  rseqc: 3.0.1
RSEQC_JUNCTIONSATURATION:
  rseqc: 3.0.1
RSEQC_READDISTRIBUTION:
  rseqc: 3.0.1
RSEQC_READDUPLICATION:
  rseqc: 3.0.1
SALMON_QUANT:
  salmon: 1.10.1
SALMON_SE_GENE:
  bioconductor-summarizedexperiment: 1.24.0
  r-base: 4.1.1
SALMON_TX2GENE:
  python: 3.9.5
SALMON_TXIMPORT:
  bioconductor-tximeta: 1.12.0
  r-base: 4.1.1
SAMPLESHEET_CHECK:
  python: 3.9.5
SAMTOOLS_FLAGSTAT:
  samtools: 1.16.1
SAMTOOLS_IDXSTATS:
  samtools: 1.16.1
SAMTOOLS_INDEX:
  samtools: 1.16.1
SAMTOOLS_SORT:
  samtools: 1.16.1
SAMTOOLS_STATS:
  samtools: 1.16.1
STAR_ALIGN:
  gawk: 5.1.0
  samtools: 1.16.1
  star: 2.7.9a
STRINGTIE_STRINGTIE:
  stringtie: 2.2.1
SUBREAD_FEATURECOUNTS:
  subread: 2.0.1
TRIMGALORE:
  cutadapt: '3.4'
  trimgalore: 0.6.7
UCSC_BEDCLIP:
  ucsc: '377'
UCSC_BEDGRAPHTOBIGWIG:
  ucsc: '377'
Workflow:
  Nextflow: 22.10.6
  nf-core/rnaseq: 3.11.2

Command used to process JHU Biobank samples:

Params:

CODE

input: s3://jhu-biobank-nf-project-tower-bucket/jobs/01-nfcore-rnaseq-3.4/inputs/sample-sheet.csv
outdir: s3://jhu-biobank-nf-project-tower-bucket/jobs/01-nfcore-rnaseq-3.4/outputs/
genome: GRCh38
igenomes_base: s3://sage-igenomes/igenomes

Config:

CODE

process {
  errorStrategy = 'retry'
  maxRetries = 3
}

Pre-run script:

CODE

export NXF_VER=21.10.5

Profile:

CODE

aws_tower