4 Juno-typing

The goal of this pipeline is to perform bacterial typing (7-locus MLST and serotyping). It takes 2 types of files per sample as input:

Two ‘.fastq’ files (paired-end sequencing) derived from short-read sequencing. They should be already filtered and trimmed (for instance, with the Juno-pipeline).
An assembly from the same sample in the form of a single ‘.fasta’ file.

Importantly, the Juno-typing pipeline works directly on output generated from the Juno-assembly pipeline.

The Juno-typing pipeline will then perform the following steps:

The appropriate 7-locus MLST schema and eventually a serotyper. The supported species for the 7-locus MLST can be found in the database generated by the Center for Genomic Epidemiology from the Technical University of Denmark.
7-locus MLST by using the MLST tool.
If appropriate for the genus/species, the samples will be serotyped. The currently supported species are:
- Salmonella serotyper by using the SeqSero2 tool.
- E. coli serotyper by using the SerotypeFinder tool.
- S. pneumoniae serotyper by using the Seroba tool.
- Shigella serotyper by using the ShigaTyper tool.

Disclaimer!! Importantly, you can provide a species while calling the pipeline and this will be considered as the species for ALL samples. Alternatively, you can provide a metadata file (explained below) with a different species per sample. The species is used to choose MLST scheme and serotyper. If you are using the results of the Juno-assembly as input and you do not provide another metadata file, the results of the species identification step in Juno-assembly will be used! Note that that step might not always be correct so you have to check that the serotyping and the MLST schema were properly chosen.

4.1 Handbook

4.1.1 Requirements and preparation

See the General Instructions for all pipelines first.

This pipeline needs two fastq file (R1 and R2) and an assembly (.fasta) files per sample. The fastq files should have been trimmed and filtered to remove low quality reads/bases. You could use the Juno-assembly pipeline for that. Moreover, that pipeline also provides the de novo assembly for your samples and the output folder of the Juno-assembly pipeline can be used directly into the Juno-typing pipeline. If you, however, prefer to use any other tool for doing your assembly and trimming/filtering, make sure that the fastq files and fasta files have the same name (for instance, sample1_R1_001.fastq.gz, sample1_R1_001.fastq.gz and sample1.fasta). If that is not the case, the files may not be recognized as belonging to the same sample. Also, ALL THREE FILES SHOULD BE IN THE SAME FOLDER! If you have multiple samples, they should all be in the same input folder, NOT IN SUBFOLDERS. The only exception is if you use the Juno-assembly pipeline to pre-process your data. In that case, the pipeline will recognize the subfolders where the fastq files and the fasta files should be.

4.1.2 Download the pipeline

YOU NEED TO DOWNLOAD THE PIPELINE ONCE OR EVERY TIME YOU WANT TO UPDATE IT

Make sure to have followed the instructions to set up conda before installing any of our pipelines!

Please follow the instructions to download pipelines from the Juno team of the IDS-bioinformatics group. The Juno-typing pipeline can be found in this link.

4.1.3 Install conda environment

YOU NEED TO REINSTALL THE MASTER ENVIRONMENT EVERY TIME YOU UPDATE THE PIPELINE (everytime you download the code)

Open a terminal. (Applications>terminal).
Enter the folder of the pipeline using:

cd /mnt/scratch_dir/<my_folder>/Juno-typing

If you already had a juno_typing environment before you need to delete the old one by using the command:

conda env remove -n juno_typing

If you had never created a juno_typing environment before, you can skip this step and go to step 4 instead.

Create a new environment for running Juno_typing by using the command:

conda env create -f envs/master_env.yaml

This step will take some time (few minutes).

Note: If this step would take more than 1 hour, please kill the process (using Ctrl + C or Ctrl + Z) and refer to the section General Troubleshooting. The first issue written there (Failure when installing master environment) often solves the problem. If, however, the problem persists, please contact me by email.

4.1.4 Start the analysis. Basics

Open a terminal. (Applications>terminal).
Enter the folder of the pipeline using:

cd /mnt/scratch_dir/<my_folder>/Juno-typing

Activate juno_typing environment

conda activate juno_typing

If you run in trouble please see the troubleshooting section for conda activate.

Run the pipeline

This can be done in three ways. The first one is just providing an input directory with the results of the juno_assembly pipeline:

python juno-typing -i /mnt/scratch_dir/<my_folder>/<results_juno_assembly>/

The second one is providing an input directory as well as a metadata (csv) file. This file should contain at least one column with the ‘sample’ name (name of the file but removing [_S##]_R1.fastq.gz), a column called ‘genus’ and a column called ‘species’. If a genus + species is provided for a sample, it will overwrite the species identification performed by this pipeline when choosing the scheme for MLST and the serotyper. Example metadata file:

sample	genus	species
sample1	Salmonella	enterica

python juno_typing.py -i my_input_files --metadata path/to/my/metadata.csv

The last way is to tell the pipeline which species the samples have. Note that only ONE species can be given for ALL the samples, so it will be assumed that they all belong to the same one. Each species should have two words (genus + species).

python juno_typing.py -i my_input_files --species salmonella enterica

If you give both, a metadata file and a --species, the --species will take precedence and overwrite the metadata file.

Note: The fastq files corresponding to this sample would probably be something like sample1_S1_R1_0001.fastq.gz and sample2_S1_R1_0001.fastq.gz and the fasta file sample1.fasta.

Please read the section What to expect while running a Juno pipeline

See the section General Troubleshooting for any problems you may encounter.

Note: Do not keep all your data (including results) on the scratch_dir partition. You are allowed to keep 400GB max and with sequencing data, this can get full quite fast.

4.1.5 Output

A folder called output/, inside the folder of the pipeline, will be created. This folder will contain all the results and logging files of your analysis. There will be one folder per step (mlst7 and serotype). Please refer to the manuals of every tool to interpret the results. In each one of these folders, there should be a sub-folder per sample and, for the case of mlst7 and serotype, also a csv file collecting the results of all the samples together: a serotype/serotype_multireport.csv and mlst7/mlst7_multireport.csv.

Note: If you want your output to be stored in a folder with a different name or location, you can use the option -o (‘o’ from output)

python juno_typing.py -i /mnt/scratch_dir/<my_folder>/<my_data>/ -o /mnt/scratch_dir/<my_folder>/<my_results>/

Another very important output from the pipeline are the logging files and audit trail that contain information of the software versions used, the parameters used, the error messages, etc. They could be important for you if you want to publish or reproduce the analysis at a later time point and also to get help from the bioinformatics team if you were to run into trouble with the pipeline. Please read about these files here.

4.1.6 Troubleshooting for this pipeline

Please read first the General Troubleshooting section!

4.1.6.1 Other problems or failing rules

The Juno-typing pipeline is still in development which means that sometimes the process can fail.

Before contacting for help, try these two steps:

Re-run the pipeline again and see if the process continues. If it does, please keep re-running the pipeline until your analysis is finished or there is no longer progress. In this case, send an email after the pipeline is finished so I can troubleshoot the problem.
Download the pipeline again and start from the beginning of this handbook. Sometimes there is an issue that has been resolved in newer versions of the pipeline.

If the pipeline still fails after these two steps, please inform me about the problem. Send an e-mail with the following content:

The log and error files that can be found in the output folder
The path to your input directory
The path to where the pipeline is installed

Note: I cannot help you without this information, if information is missing there will be a delay in troubleshooting the problem.