12 Triumph: Run pipeline

CURRENTLY NOT MAINTAINED OR SUPPORTED

This pipeline processes raw fastq files from microbiome data using the Dada2 R-package. At the end, it also produces a report summary of the run.

The pipeline has been tailor-made for the microbiome group at the RIVM, especifically for the Pienter project. It therefore expects the input that is normally generated there and might not be very flexible with sample names and metadata. I will try to explain what is expected from the input when necessary.

12.1 Handbook

12.1.1 Requirements and preparation

See the General Instructions for all pipelines first.

  • Make sure that your files have the right format: they must have a fastq extension (.fastq, .fq, .fastq.gz or .fq.gz) and have specific names. The names should be only numbers (they can be preceded by 1 letter). The numbers should give information about the run and the sample. For instance, in the file name L002-026_R1.fastq.gz, it means that you are processing the project “L”, the run # 2 and the sample “026”. Moreover, the suffix ’_R1’ or ’_R2’ designate forward or reverse reads. The MiSeq sequencer normally adds an extra suffix _0001 to the names of the fastq files L002-026_R1_0001.fastq.gz. These suffix is no problem for the pipeline to work.

  • This pipeline needs a metadata (Excel file) should have a very specific format. For details about how to make the metadata file and what the codes mean, please consult with the microbiome team at the RIVM, especially Susana Fuentes. For the use of the pipeline, the main features that need to be added in this Excel file are:

    • It should have the file extension .xlsx file (no .csv).
    • It should have at least a sheet called “amplicon_assay”. The metadata file for the Triumph project includes many other sheets but this is the only one that is necessary. Leaving the sheet unnamed will not work even if the information inside the sheet is correctly formatted.
    • The amplicon_assay sheet should have at least three columns with the names: assay_sample, sample_indentifier and subject_identifier.
    • The assay_sample column should contain the “root” of the sample names that are used to name the fastq files (see previous point). For instance, the assay_sample code for the file L002-026_R1_0001.fastq.gz is L002-026.
    • The sample_identifier column should contain the identity of the samples. Here specific codes are needed for the control samples. For instance, every non-control sample is named something like S12345678 where “S” denotes that it is a sample belonging to a patient and the numbers are a unique ID for that sample. In contrast, the control samples don’t have unique IDs. They are always named the same way in every run. The accepted codes for the control samples can be found in the Table 1.
    • The subject_identifier column contains the same code than sample_identifier and it is used as a back-up. When the codes in sample_identifier are not found, they are looked for instead in the subject_identifier column.
TABLE 12.1: Table 1. Accepted subject_identifier codes for metadata used in the Triumph project.
subject_identifier subject_description
ZMCD Zymo mock DNA
ZMCB Zymo mock Bacteria
AMCD ATCC mock DNA
UMD UMCU low density mock DNA
MSS Mixed (fecal/NP/OP…) sample control
MSD Mixed (fecal/NP/OP…) sample control DNA
BD Blank from DNA extraction
BP Blank from PCR
LSIC Library spike-in control

12.1.2 Download the pipeline

YOU NEED TO DOWNLOAD THE PIPELINE ONCE OR EVERY TIME YOU WANT TO UPDATE IT

Make sure to have followed the instructions to set up conda before installing any of our pipelines!

Please follow the instructions to download pipelines from the Juno team of the IDS-bioinformatics group. The triumph_microbiome_run pipeline can be found in this link.

12.1.3 Start the analysis

  1. Open the terminal.

  2. Enter the folder of the pipeline:

cd /mnt/scratch_dir/<my_folder>/triumph_dada2_run_pipeline-master/
  • Exact name of downloaded folder might be slightly different depending on the version that was downloaded.
  1. Run the pipeline
bash triumph_run_pipeline.sh -i /mnt/scratch_dir/<my_folder>/<my_last_run>/ --metadata /path/to/metadata.xlsx

Please read the section What to expect while running a Juno pipeline

See the section General Troubleshooting for any problems you may encounter.

Note: Do not keep all your data (including results) on the scratch_dir partition. You are allowed to keep 400GB max and with sequencing data, this can get full quite fast.

12.1.3.1 Run the pipeline changing default parameters

Many of the parameters of the dada2 package can be changed/used directly when calling the pipeline. If you want to do that, please refer to the help of the pipeline by using the command:

bash triumph_run_pipeline.sh -h

It should print something like this:

Usage: bash triumph_run_pipeline.sh -i <INPUT_DIR> <parameters>

Input:

  -i, --input [DIR]   This is the folder containing the input (demultiplexed)
                      fastq files. This should be a folder for one single 
                      run (sub-folder name will be considered the run name! 
                      so give it a meaningul one). Default is the current folder.

  --metadata [xlsx file] (optional) This is an optional parameter in which you 
                                    provide an xlsx file with the metadata. 
                                    For the exact format of this file, you should 
                                    refer to the microbiome group at the RIVM  
                                    (Susana Fuentes). In summary, it should 
                                    contain a sheet called 'amplicon_assay', with
                                    an empty row at the top (it does not need to 
                                    be empty, but that row will be ignored) and
                                    then a table with at least two columns: 
                                    assay_sample and sample_identifier. The 
                                    assay_sample should have the names coinciding 
                                    with the fastq files (so, like the sample 
                                    sheet used for demultiplexing) and the 
                                    sample_identifier should contain either the
                                    sample code or one of the codes for 
                                    recognizing control samples (ask the microbiome group).


Main Output (automatically generated):
  out/                              Contains detailed intermediate files.
  out/log/                          Contains all log files.
  out/run_reports/                  Folder containint the actual reports of the
                                    run (html file) and the summary table of the 
                                    run (used to generate that report) as a csv file.

Parameters:
  -o, --output         Output directory. Default is 'out/'

  --pienter-up        If this option is chosen, the default options for the 
                      nasopharynx Triumph project will be used, which means 
                      that truncLenF will be overwritten to 200 and the truncLenR 
                      will be overwritten to 150.

  -h, --help          Print this help document.

  -sh, --snakemake-help Print the Snakemake this help document.

  --clean (-y)        Removes output. (-y forces 'Yes' on all prompts)

  -u, --unlock        Removes the lock on the working directory. This happens when
                      a run ends abruptly and prevents you from doing subsequent
                      analyses.

  -n, --dry_run       Shows the process but does not actually run the pipeline.

Dada2 parameters that can be changed in this pipeline (if you decide to be able
to directly modify another parameter when you call the pipeline, contact me 
alejandra.hernandez.segura@rivm.nl) for more details:

  --trunQ             Minimumn quality for truncating a read (part of dada2 
                      filtering). Default: 2

  --truncLenF         Truncating length for forward reads(part of dada2 
                      filtering). Default: 220

  --truncLenR         Truncating length for reverse reads(part of dada2 
                      filtering). Default: 100

  --maxN              Maximum number of N bases accepted per read (part of dada2 
                      filtering). Default: 0. Note that the dada2 requires this 
                      parameter to be 0.

  --maxEE             Maximum Expected Errors (part of dada2 filtering). 
                      Default: 2

  --nomatchID               If this file is present, the reads will not be expected to
                      match. The default is to expect filtered reads to have 
                      matching IDs (part of dada2 filtering). Note that you should
                      NOT put 'true' or 'false' after the flag. The flag itself 
                      forces the condition to be FALSE

  --nbases            Number of bases to use in order to learn the errors (part 
                      of dada2 error learning). Default: 1e9

  --minlen            Minimum length of the inferred SAVs. Any SAV shorter than 
                      minlen will be discarded (part of dada2 inferring SAVs). 
                      Default: 250

  --maxlen            Maximum length of the inferred SAVs. Any SAV longer than 
                      maxlen will be discarded (part of dada2 inferring SAVs). 
                      Default: 256

  --trainset                Path to trainset used to assign taxonomy to a phyloseq object. 
                      Dada2 suggests a database in its tutorial. In the RIVM, we have
                      a copy stored at: '/mnt/db/triumph_taxonomy_db/silva_nr_v138_train_set.fa.gz'
                      and this path is the default.

  --species_db        Path to database used to assign species name to a phyloseq 
                      object. Dada2 suggests a database in its tutorial. In the RIVM, 
                      we have a copy stored at: '/mnt/db/triumph_taxonomy_db/silva_species_assignment_v138.fa.gz'
                      and this path is the default.

  --no-randomize-errors, -nr If this flag is present, there will not be randomiza-
                      tion while learning the errors (part of dada2 learn errors). 
                      The default is to randomize. Note that you should
                      NOT put 'true' or 'false' after the flag. The flag itself 
                      forces the condition to be FALSE

  --just-concatenate  If this flag is present, the pairs will be simply concatenated.
                      (part of dada2 mergePairs). The default is set to FALSE. Note 
                      that you should NOT put 'true' or 'false' after the flag. The 
                      flag itself forces the condition to be FALSE.

  --bimera_method     Choose a method to remove bimeras (according to dada2 
                      removeBimeraDenovo). 

  --minboot           The minimum bootstrap confidence for assigning a taxonomic level
                      (part of dada2 assignTaxonomy). Default: 80

  --no-tryrc          If TRUE, the reverse-complement of each sequences will be 
                      used for classification if it is a better match to the 
                      reference sequences than the forward sequence. The default is
                      TRUE. Note that you should
                      NOT put 'true' or 'false' after the flag. The flag itself 
                      forces the condition to be FALSE

  --species-allowed   This is the number of species that may be enlisted (part of 
                      dada2 addSpecies). Default is 3.

12.1.3.2 Running when metadata is not available

You can also run the pipeline without the metadata:

bash triumph_run_pipeline.sh -i /mnt/scratch_dir/<my_folder>/<my_last_run>/ 

However, the report produced of such a run is not very informative and the code is not actively maintained. Therefore, the report might fail. The pipeline is not intended to be run without metadata so it will do its job but not efficiently.

12.1.4 Output

A folder called out/, inside the folder of the pipeline, will be created. If you chose another site for the output, then you will find the same results there. This folder will contain all the results and logging files of your analysis. There will be one folder with all the filtered_fastq files, one with the plots and one with the RDS_files generated when using the dada2 package (for instance, seqtab, merged files, etc.). These RDS files can be loaded into R where you can explore them and further work with them. The last folder is the run_reports which will have the html report with statistics of the run and the “Project_SummaryTable.csv” with all the statistics in a table format. Please refer to the dada2 documentation to interpret your results.

Note: If you want your output to be stored in a folder with a different name or location, you can use the option -o (‘o’ from output)

bash triumph_run_pipeline.sh -i /mnt/scratch_dir/<my_folder>/<my_last_run>/ -o /mnt/scratch_dir/<my_folder>/<my_results>/

Another very important output from the pipeline are the logging files and audit trail that contain information of the software versions used, the parameters used, the error messages, etc. They could be important for you if you want to publish or reproduce the analysis at a later timepoint and also to get help from the bioinformatics team if you were to run into trouble with the pipeline. Please read about these files here.

12.1.5 Troubleshooting

Please read first the General Troubleshooting section!

12.1.5.1 Other problems or failing rules

The triumph_microbiome_run pipeline is currently not maintained or supported.