Nextflow Data Processing & Configuration
Our pipeline configuration is still in-development, and the contents of this document are subject to change.
Summary of Processing
Datatype | Method | Output | Tried yet? |
|---|---|---|---|
WES or WGS | DeepVariant | Yes | |
WES or WGS | Strelka, Mutect2 | Yes | |
WES or WGS | TBD | Germline and Somatic Structural Variants | No |
WES or WGS | TBD | Germline and Somatic CNV | No |
WES or WGS | TBD | Tumor MSI | No |
SNV, INDEL variants | TBD | No | |
https://nf-co.re/rnaseq v3.11.2 | |||
Datatype | Method | Output | Tried yet? |
|---|---|---|---|
RNA-Seq | STAR with Salmon in alignment-based mode | Yes | |
More information about the workflows is available here: https://sagebionetworks.jira.com/wiki/spaces/WF/pages/2363359776
bam/cram to fastq conversion
When fastq files are not available, cram/bam files are converted to fastq using this pipeline: https://github.com/qbic-pipelines/bamtofastq (v1.2.0).
If unaligned bam files are available instead of fastq files, we recommend providing u-bam files for direct input to sarek 3.0.
WES and WGS Variant Calling (SNV & INDEL)
Germline SNV + INDEL
This involves transformation of WES fastq or cram files to variant call files in VCF format (.vcf files).
As of Jan 2022, the reference genome used is Homo_sapiens/GATK/GRCh38 (https://github.com/nf-core/sarek/blob/2.7.1/conf/igenomes.config#L38-L58)
The processing steps include the following:
Raw fastq files uploaded to Synapse by researcher in a folder with name format
experiment_name_rnaseq_fastq_date. No white space should be present in the filenames (all filenames should have_for whitespaces.All experiment and sample related annotations need to be added on Synapse before processing can start. This is a required step so that a sample sheet can be generated to trigger the processing workflow
The sample sheet should contain the following information in a comma-separated file (
.csv) with at least 3 columns, and a header row as shown below. (More information here)
sample | subject | status | sex | file_1 | file_2 | lane | parentId | bed_file | output_parent_Id |
Synapse specimenID | Synapse individualID | 1 (Tumor = 1, Normal 0) | XX or XY |
|
| Lane information | SynapseID of parent folder | Synapse ID of BED file (if WES sata) | Synapse ID of folder where all processed files will be indexed |
The files are pulled into NextFlow workflow setup and processed using the following versions of software:
nf-core/sarek v2.7.1
Nextflow v21.10.5
BWA 0.7.17
GATK v4.1.7.0
FreeBayes v1.3.2
samtools v1.9
Strelka v2.9.10
Manta v1.6.0
TIDDIT v2.7.1
AlleleCount v4.0.2
ASCAT v2.5.2
Control-FREEC vv11.6
msisensor v0.5
SnpEff v4.3t
VEP v99.2
MultiQC v1.8
FastQC v0.11.9
bcftools v1.9
CNVkit v0.9.6
htslib v1.9
QualiMap v2.2.2-dev
Trim Galore v0.6.4_dev
vcftools v0.1.16
R v4.0.2
Commands used for running JHU samples on DeepVariant:
All files and sample sheets are first staged in S3 buckets linked to NFTower. then the following command are used to launch the processing pipeline.
Params:
input: s3://jhu-biobank-nf-project-tower-bucket/jobs/02-sage-sarek-2.7.1-deepvariant/inputs/sample-sheet.tsv
outdir: s3://jhu-biobank-nf-project-tower-bucket/jobs/02-sage-sarek-2.7.1-deepvariant/outputs/
genome: GRCh38
igenomes_base: s3://sage-igenomes/igenomes
model_type: WES
tools: "deepvariant"
Config:
process {
errorStrategy = 'retry'
maxRetries = 3
withLabel:deepvariant {
container = "google/deepvariant:1.1.0"
cpus = 24
}
}
Pre-run Script:
export NXF_VER=21.10.5
Profiles:
aws_tower
Somatic SNV + INDEL using Strelka
Pipeline parameters to run strelka on a tower instance are given below:
{"input": "s3://ntap-add5-project-tower-bucket/samplesheets/synstage/sarek_newsamplesheet.csv",
"wes": true,
"igenomes_base": "s3://sage-igenomes/igenomes",
"genome": "GATK.GRCh38",
"tools": "strelka",
"intervals": "s3://ntap-add5-project-tower-bucket/reference/xgen-exome-research-panel-probes-hg38.bed",
"outdir": "s3://ntap-add5-project-tower-bucket/outputs/Reprocess/NFINTXXX/"}
For processing WGS data:
Set “wes” to false
Remove “intervals”
Annotated Variants
Currently, germline variant calls in VCF format are being processed manually using VEP and vcf2maf
Pipeline:
https://github.com/Sage-Bionetworks-Workflows/nf-vcf2maf
Pipeline parameters:
input: s3://ntap-add5-project-tower-bucket/samplsheets/Reprocess/NFINTXXX/sarek_vcf2maf_samplesheet.csv
Pipeline component version details:
BCFTOOLS_STATS:
bcftools: '1.17'
BWAMEM1_MEM:
bwa: 0.7.17-r1188
samtools: 1.16.1
CALCULATECONTAMINATION:
gatk4: 4.4.0.0
CREATE_INTERVALS_BED:
gawk: 5.1.0
CUSTOM_DUMPSOFTWAREVERSIONS:
python: 3.11.0
yaml: '6.0'
FASTP:
fastp: 0.23.4
FASTQC:
fastqc: 0.11.9
FILTERMUTECTCALLS:
gatk4: 4.4.0.0
GATK4_APPLYBQSR:
gatk4: 4.4.0.0
GATK4_BASERECALIBRATOR:
gatk4: 4.4.0.0
GATK4_MARKDUPLICATES:
gatk4: 4.4.0.0
GETPILEUPSUMMARIES:
gatk4: 4.4.0.0
INDEX_CRAM:
samtools: '1.17'
INDEX_MARKDUPLICATES:
samtools: '1.17'
LEARNREADORIENTATIONMODEL:
gatk4: 4.4.0.0
MOSDEPTH:
mosdepth: 0.3.3
MUTECT2:
gatk4: 4.4.0.0
SAMTOOLS_STATS:
samtools: '1.17'
STRELKA_SINGLE:
strelka: 2.9.10
TABIX_BGZIPTABIX_INTERVAL_COMBINED:
tabix: '1.12'
TABIX_BGZIPTABIX_INTERVAL_SPLIT:
tabix: '1.12'
VCFTOOLS_TSTV_COUNT:
vcftools: 0.1.16
Workflow:
Nextflow: 22.10.6
nf-core/sarek: 3.2.0
RNA Sequencing Data Quantification
Processing RNA-seq files involve transformation of raw data (fastq files) to transcript counts (quants.sf files).
The quantification software of choice is Salmon.
As of Jan 2022, the reference genome used is Homo_sapiens/NCBI/GRCh38.
Processing involves the following steps:
Raw fastq files uploaded to Synapse by researcher in a folder with name format
experiment_name_rnaseq_fastq_date. No white space should be present in the filenames (all filenames should have_for whitespaces. While the naming convention is a best practices recommendation and not a strict rule, the exclusion of whitespaces is required.All experiment and sample related annotations need to be added on Synapse before processing can start. This is a required step so that a sample sheet can be generated to trigger the processing workflow
The sample sheet should contain the following information in the following format (saved as a
.csvfile) (More information here) :
sample | single_end | fastq_1 | fastq_2 | strandedness |
Synapse specimenID | 0 (1 if paired-end) | synID | synID | auto |
The files are pulled into NextFlow workflow setup and processed using the following versions of software:
CODEBEDTOOLS_GENOMECOV: bedtools: 2.30.0 CAT_FASTQ: cat: '8.30' CUSTOM_DUMPSOFTWAREVERSIONS: python: 3.11.0 yaml: '6.0' CUSTOM_GETCHROMSIZES: getchromsizes: 1.16.1 DESEQ2_QC_STAR_SALMON: bioconductor-deseq2: 1.28.0 r-base: 4.0.3 DUPRADAR: bioconductor-dupradar: 1.28.0 r-base: 4.2.1 FASTQC: fastqc: 0.11.9 FQ_SUBSAMPLE: fq: 0.9.1 (2022-02-22) MULTIQC_CUSTOM_BIOTYPE: python: 3.9.5 PICARD_MARKDUPLICATES: picard: 3.0.0 PREPROCESS_TRANSCRIPTS_FASTA_GENCODE: sed: '4.7' QUALIMAP_RNASEQ: qualimap: 2.2.2-dev RSEQC_BAMSTAT: rseqc: 3.0.1 RSEQC_INFEREXPERIMENT: rseqc: 3.0.1 RSEQC_INNERDISTANCE: rseqc: 3.0.1 RSEQC_JUNCTIONANNOTATION: rseqc: 3.0.1 RSEQC_JUNCTIONSATURATION: rseqc: 3.0.1 RSEQC_READDISTRIBUTION: rseqc: 3.0.1 RSEQC_READDUPLICATION: rseqc: 3.0.1 SALMON_QUANT: salmon: 1.10.1 SALMON_SE_GENE: bioconductor-summarizedexperiment: 1.24.0 r-base: 4.1.1 SALMON_TX2GENE: python: 3.9.5 SALMON_TXIMPORT: bioconductor-tximeta: 1.12.0 r-base: 4.1.1 SAMPLESHEET_CHECK: python: 3.9.5 SAMTOOLS_FLAGSTAT: samtools: 1.16.1 SAMTOOLS_IDXSTATS: samtools: 1.16.1 SAMTOOLS_INDEX: samtools: 1.16.1 SAMTOOLS_SORT: samtools: 1.16.1 SAMTOOLS_STATS: samtools: 1.16.1 STAR_ALIGN: gawk: 5.1.0 samtools: 1.16.1 star: 2.7.9a STRINGTIE_STRINGTIE: stringtie: 2.2.1 SUBREAD_FEATURECOUNTS: subread: 2.0.1 TRIMGALORE: cutadapt: '3.4' trimgalore: 0.6.7 UCSC_BEDCLIP: ucsc: '377' UCSC_BEDGRAPHTOBIGWIG: ucsc: '377' Workflow: Nextflow: 22.10.6 nf-core/rnaseq: 3.11.2
Command used to process JHU Biobank samples:
Params:
input: s3://jhu-biobank-nf-project-tower-bucket/jobs/01-nfcore-rnaseq-3.4/inputs/sample-sheet.csv
outdir: s3://jhu-biobank-nf-project-tower-bucket/jobs/01-nfcore-rnaseq-3.4/outputs/
genome: GRCh38
igenomes_base: s3://sage-igenomes/igenomes
Config:
process {
errorStrategy = 'retry'
maxRetries = 3
}
Pre-run script:
export NXF_VER=21.10.5
Profile:
aws_tower