6 Juno_cgMLST
The goal of this pipeline is to perform cgMLST on bacterial genomes. As input, it requires an assembly for each sample you want to analyze in the form of a single ‘.fasta’ file. Importantly, this pipeline works directly on output generated from the Juno-assembly pipeline.
The Juno-cgMLST uses ChewBBACA to find the cgMLST profile of the given genomes. Besides, the result table with the allele numbers is translated to a table with (sha1) hashes so that they can be shared and compared even if each result used a different database. Note: We highly encourage to use the hashes for the analysis instead of the allele numbers. This will make your data more reproducible in case there was an update in the database that assigns allele numbers or to share it with other colleagues.
6.1 Handbook
6.1.1 Requirements and preparation
See the General Instructions for all pipelines first.
- This pipeline needs one .fasta file (the assembly) per sample as input. ALL THE INPUT FILES SHOULD BE IN THE SAME FOLDER, NOT IN SUBFOLDERS. The only exception is if you use the
Juno-assembly
pipeline to pre-process your data. In that case, the pipeline will recognize the subfolders where the assemblies should be.
6.1.2 Download the pipeline
YOU NEED TO DOWNLOAD THE PIPELINE ONCE OR EVERY TIME YOU WANT TO UPDATE IT
Make sure to have followed the instructions to set up conda before installing any of our pipelines!
Please follow the instructions to download pipelines from the Juno team of the IDS-bioinformatics group. The Juno-cgMLST pipeline can be found in this link.
6.1.3 Install conda environment
YOU NEED TO REINSTALL THE MASTER ENVIRONMENT EVERY TIME YOU UPDATE THE PIPELINE (everytime you download the code)
Open a terminal. (Applications>terminal).
Enter the folder of the pipeline using:
cd /mnt/scratch_dir/<my_folder>/Juno_cgmlst
- If you already had a juno_typing environment before you need to delete the old one by using the command:
conda env remove -n juno_cgmlst
If you had never created a juno_typing environment before, you can skip this step and go to step 4 instead.
- Create a new environment for running Juno_typing by using the command:
conda env create -f envs/master_env.yaml
This step will take some time (few minutes).
Note: If this step would take more than 1 hour, please kill the process (using Ctrl + C
or Ctrl + Z
) and refer to the section General Troubleshooting. The first issue written there (Failure when installing master environment) often solves the problem. If, however, the problem persists, please contact me by email.
6.1.4 Start the analysis. Basics
Open a terminal. (Applications>terminal).
Enter the folder of the pipeline using:
cd /mnt/scratch_dir/<my_folder>/Juno_cgmlst
- Activate juno_cgmlst environment
conda activate juno_cgmlst
If you run in trouble please see the troubleshooting section for conda activate.
- Run the pipeline
This can be done in three ways. The first one is just providing an input directory with the results of the juno_assembly pipeline:
python juno-cgmlst -i /mnt/scratch_dir/<my_folder>/<results_juno_assembly>/
The second one is providing an input directory as well as a metadata (csv) file. This file should contain at least one column with the ‘sample’ name (name of the file but removing [_S##]_R1.fastq.gz) and a column called ‘genus’. If a genus is provided for a sample, it will overwrite the species identification performed by this pipeline when choosing the scheme for MLST and the serotyper. Example metadata file:
sample | genus | species |
---|---|---|
sample1 | Salmonella | enterica |
Note: The fastq files corresponding to this sample would probably be something like sample1_S1_R1_0001.fastq.gz
and sample2_S1_R1_0001.fastq.gz
and the fasta file sample1.fasta
.
-i my_input_files --metadata path/to/my/metadata.csv python juno_cgmlst.py
The last way is to tell the pipeline which genus the samples have. Note that only ONE genus can be given for ALL the samples, so it will be assumed that they all belong to the same one.
-i my_input_files --genus salmonella python juno_cgmlst.py
If you give both, a metadata file and a --genus
, the --genus
will take precedence and overwrite the metadata file.
Please read the section What to expect while running a Juno pipeline
See the section General Troubleshooting for any problems you may encounter.
Note for large datasets: If your dataset is large (more than 30 samples), t is necessary to give the pipeline extra time to process samples. This means that it needs to run with the -w
or --time-limit
argument. The default is 60 (minutes) but for large datasets (more than 30 samples) it should be increased accordingly Example:
python juno_cgmlst.py -i my_large_input_dir -o my_results_dir --db_dir my_db_dir --metadata path/to/my/metadata.csv --time-limit 120
Note: Do not keep all your data (including results) on the scratch_dir partition. You are allowed to keep 400GB max and with sequencing data, this can get full quite fast.
6.1.5 Output
A folder called output/
, inside the folder of the pipeline, will be created. This folder will contain all the results and logging files of your analysis. Please refer to the manual of ChewBBACA to interpret the results.
Note: If you want your output to be stored in a folder with a different name or location, you can use the option -o
(‘o’ from output)
python juno_cgmlst.py -i juno_cgmlst.py -i my_input_files --metadata path/to/my/metadata.csv -o /mnt/scratch_dir/<my_folder>/<my_results>/
Another very important output from the pipeline are the logging files and audit trail that contain information of the software versions used, the parameters used, the error messages, etc. They could be important for you if you want to publish or reproduce the analysis at a later time point and also to get help from the bioinformatics team if you were to run into trouble with the pipeline. Please read about these files here.
6.1.6 Troubleshooting for this pipeline
Please read first the General Troubleshooting section!
6.1.6.1 Other problems or failing rules
The Juno-cgMLST pipeline is still in development which means that sometimes the process can fail.
Before contacting for help, try these two steps:
Re-run the pipeline again and see if the process continues. If it does, please keep re-running the pipeline until your analysis is finished or there is no longer progress. In this case, send an email after the pipeline is finished so I can troubleshoot the problem.
Download the pipeline again and start from the beginning of this handbook. Sometimes there is an issue that has been resolved in newer versions of the pipeline.
If the pipeline still fails after these two steps, please inform me about the problem. Send an e-mail with the following content:
- The log and error files that can be found in the output folder
- The path to your input directory
- The path to where the pipeline is installed
Note: I cannot help you without this information, if information is missing there will be a delay in troubleshooting the problem.