12 Triumph: Run pipeline
CURRENTLY NOT MAINTAINED OR SUPPORTED
This pipeline processes raw fastq files from microbiome data using the Dada2 R-package. At the end, it also produces a report summary of the run.
The pipeline has been tailor-made for the microbiome group at the RIVM, especifically for the Pienter project. It therefore expects the input that is normally generated there and might not be very flexible with sample names and metadata. I will try to explain what is expected from the input when necessary.
12.1 Handbook
12.1.1 Requirements and preparation
See the General Instructions for all pipelines first.
Make sure that your files have the right format: they must have a fastq extension (.fastq, .fq, .fastq.gz or .fq.gz) and have specific names. The names should be only numbers (they can be preceded by 1 letter). The numbers should give information about the run and the sample. For instance, in the file name L002-026_R1.fastq.gz, it means that you are processing the project “L”, the run # 2 and the sample “026”. Moreover, the suffix ’_R1’ or ’_R2’ designate forward or reverse reads. The MiSeq sequencer normally adds an extra suffix _0001 to the names of the fastq files L002-026_R1_0001.fastq.gz. These suffix is no problem for the pipeline to work.
This pipeline needs a metadata (Excel file) should have a very specific format. For details about how to make the metadata file and what the codes mean, please consult with the microbiome team at the RIVM, especially Susana Fuentes. For the use of the pipeline, the main features that need to be added in this Excel file are:
- It should have the file extension .xlsx file (no .csv).
- It should have at least a sheet called “amplicon_assay”. The metadata file for the Triumph project includes many other sheets but this is the only one that is necessary. Leaving the sheet unnamed will not work even if the information inside the sheet is correctly formatted.
- The amplicon_assay sheet should have at least three columns with the names: assay_sample, sample_indentifier and subject_identifier.
- The assay_sample column should contain the “root” of the sample names that are used to name the fastq files (see previous point). For instance, the assay_sample code for the file L002-026_R1_0001.fastq.gz is L002-026.
- The sample_identifier column should contain the identity of the samples. Here specific codes are needed for the control samples. For instance, every non-control sample is named something like S12345678 where “S” denotes that it is a sample belonging to a patient and the numbers are a unique ID for that sample. In contrast, the control samples don’t have unique IDs. They are always named the same way in every run. The accepted codes for the control samples can be found in the Table 1.
- The subject_identifier column contains the same code than sample_identifier and it is used as a back-up. When the codes in sample_identifier are not found, they are looked for instead in the subject_identifier column.
subject_identifier | subject_description |
---|---|
ZMCD | Zymo mock DNA |
ZMCB | Zymo mock Bacteria |
AMCD | ATCC mock DNA |
UMD | UMCU low density mock DNA |
MSS | Mixed (fecal/NP/OP…) sample control |
MSD | Mixed (fecal/NP/OP…) sample control DNA |
BD | Blank from DNA extraction |
BP | Blank from PCR |
LSIC | Library spike-in control |
12.1.2 Download the pipeline
YOU NEED TO DOWNLOAD THE PIPELINE ONCE OR EVERY TIME YOU WANT TO UPDATE IT
Make sure to have followed the instructions to set up conda before installing any of our pipelines!
Please follow the instructions to download pipelines from the Juno team of the IDS-bioinformatics group. The triumph_microbiome_run pipeline can be found in this link.
12.1.3 Start the analysis
Open the terminal.
Enter the folder of the pipeline:
cd /mnt/scratch_dir/<my_folder>/triumph_dada2_run_pipeline-master/
- Exact name of downloaded folder might be slightly different depending on the version that was downloaded.
- Run the pipeline
bash triumph_run_pipeline.sh -i /mnt/scratch_dir/<my_folder>/<my_last_run>/ --metadata /path/to/metadata.xlsx
Please read the section What to expect while running a Juno pipeline
See the section General Troubleshooting for any problems you may encounter.
Note: Do not keep all your data (including results) on the scratch_dir partition. You are allowed to keep 400GB max and with sequencing data, this can get full quite fast.
12.1.3.1 Run the pipeline changing default parameters
Many of the parameters of the dada2
package can be changed/used directly when calling the pipeline. If you want to do that, please refer to the help of the pipeline by using the command:
-h bash triumph_run_pipeline.sh
It should print something like this:
Usage: bash triumph_run_pipeline.sh -i <INPUT_DIR> <parameters>
Input:
-i, --input [DIR] This is the folder containing the input (demultiplexed)
fastq files. This should be a folder for one single
run (sub-folder name will be considered the run name!
so give it a meaningul one). Default is the current folder.
--metadata [xlsx file] (optional) This is an optional parameter in which you
provide an xlsx file with the metadata.
For the exact format of this file, you should
refer to the microbiome group at the RIVM
(Susana Fuentes). In summary, it should
contain a sheet called 'amplicon_assay', with
an empty row at the top (it does not need to
be empty, but that row will be ignored) and
then a table with at least two columns:
assay_sample and sample_identifier. The
assay_sample should have the names coinciding
with the fastq files (so, like the sample
sheet used for demultiplexing) and the
sample_identifier should contain either the
sample code or one of the codes for
recognizing control samples (ask the microbiome group).
Main Output (automatically generated):
out/ Contains detailed intermediate files.
out/log/ Contains all log files.
out/run_reports/ Folder containint the actual reports of the
run (html file) and the summary table of the
run (used to generate that report) as a csv file.
Parameters:
-o, --output Output directory. Default is 'out/'
--pienter-up If this option is chosen, the default options for the
nasopharynx Triumph project will be used, which means
that truncLenF will be overwritten to 200 and the truncLenR
will be overwritten to 150.
-h, --help Print this help document.
-sh, --snakemake-help Print the Snakemake this help document.
--clean (-y) Removes output. (-y forces 'Yes' on all prompts)
-u, --unlock Removes the lock on the working directory. This happens when
a run ends abruptly and prevents you from doing subsequent
analyses.
-n, --dry_run Shows the process but does not actually run the pipeline.
Dada2 parameters that can be changed in this pipeline (if you decide to be able
to directly modify another parameter when you call the pipeline, contact me
alejandra.hernandez.segura@rivm.nl) for more details:
--trunQ Minimumn quality for truncating a read (part of dada2
filtering). Default: 2
--truncLenF Truncating length for forward reads(part of dada2
filtering). Default: 220
--truncLenR Truncating length for reverse reads(part of dada2
filtering). Default: 100
--maxN Maximum number of N bases accepted per read (part of dada2
filtering). Default: 0. Note that the dada2 requires this
parameter to be 0.
--maxEE Maximum Expected Errors (part of dada2 filtering).
Default: 2
--nomatchID If this file is present, the reads will not be expected to
match. The default is to expect filtered reads to have
matching IDs (part of dada2 filtering). Note that you should
NOT put 'true' or 'false' after the flag. The flag itself
forces the condition to be FALSE
--nbases Number of bases to use in order to learn the errors (part
of dada2 error learning). Default: 1e9
--minlen Minimum length of the inferred SAVs. Any SAV shorter than
minlen will be discarded (part of dada2 inferring SAVs).
Default: 250
--maxlen Maximum length of the inferred SAVs. Any SAV longer than
maxlen will be discarded (part of dada2 inferring SAVs).
Default: 256
--trainset Path to trainset used to assign taxonomy to a phyloseq object.
Dada2 suggests a database in its tutorial. In the RIVM, we have
a copy stored at: '/mnt/db/triumph_taxonomy_db/silva_nr_v138_train_set.fa.gz'
and this path is the default.
--species_db Path to database used to assign species name to a phyloseq
object. Dada2 suggests a database in its tutorial. In the RIVM,
we have a copy stored at: '/mnt/db/triumph_taxonomy_db/silva_species_assignment_v138.fa.gz'
and this path is the default.
--no-randomize-errors, -nr If this flag is present, there will not be randomiza-
tion while learning the errors (part of dada2 learn errors).
The default is to randomize. Note that you should
NOT put 'true' or 'false' after the flag. The flag itself
forces the condition to be FALSE
--just-concatenate If this flag is present, the pairs will be simply concatenated.
(part of dada2 mergePairs). The default is set to FALSE. Note
that you should NOT put 'true' or 'false' after the flag. The
flag itself forces the condition to be FALSE.
--bimera_method Choose a method to remove bimeras (according to dada2
removeBimeraDenovo).
--minboot The minimum bootstrap confidence for assigning a taxonomic level
(part of dada2 assignTaxonomy). Default: 80
--no-tryrc If TRUE, the reverse-complement of each sequences will be
used for classification if it is a better match to the
reference sequences than the forward sequence. The default is
TRUE. Note that you should
NOT put 'true' or 'false' after the flag. The flag itself
forces the condition to be FALSE
--species-allowed This is the number of species that may be enlisted (part of
dada2 addSpecies). Default is 3.
12.1.3.2 Running when metadata is not available
You can also run the pipeline without the metadata:
bash triumph_run_pipeline.sh -i /mnt/scratch_dir/<my_folder>/<my_last_run>/
However, the report produced of such a run is not very informative and the code is not actively maintained. Therefore, the report might fail. The pipeline is not intended to be run without metadata so it will do its job but not efficiently.
12.1.4 Output
A folder called out/
, inside the folder of the pipeline, will be created. If you chose another site for the output, then you will find the same results there. This folder will contain all the results and logging files of your analysis. There will be one folder with all the filtered_fastq files
, one with the plots
and one with the RDS_files
generated when using the dada2
package (for instance, seqtab, merged files, etc.). These RDS files can be loaded into R where you can explore them and further work with them. The last folder is the run_reports
which will have the html report with statistics of the run and the “Project_SummaryTable.csv” with all the statistics in a table format. Please refer to the dada2 documentation to interpret your results.
Note: If you want your output to be stored in a folder with a different name or location, you can use the option -o
(‘o’ from output)
bash triumph_run_pipeline.sh -i /mnt/scratch_dir/<my_folder>/<my_last_run>/ -o /mnt/scratch_dir/<my_folder>/<my_results>/
Another very important output from the pipeline are the logging files and audit trail that contain information of the software versions used, the parameters used, the error messages, etc. They could be important for you if you want to publish or reproduce the analysis at a later timepoint and also to get help from the bioinformatics team if you were to run into trouble with the pipeline. Please read about these files here.
12.1.5 Troubleshooting
Please read first the General Troubleshooting section!