8 Juno-annotation

This pipeline takes assemblies (.fasta) as input and performs gene annotation. It should be able to annotate bacterial genomes and bacterial plasmids. The pipeline follows these steps:

Filtering contigs with more than 200bp. This step is necessary for downstream tools to run properly and, in general, small contigs are often filtered to reduce the noise of possible contamination. Besides, these small contigs often do not contain much valuable information.
Finding the start of the chromosome ( Circlator ). Although in most cases, the chromosome is already circularized after the assembly, in the cases in which that was not possible, this step might help. The start of the chromosome will be set at the beginning of the dnaA gene. For the plasmids or other smaller contigs, the start of a predicted gene near its center is used.
Annotation using Prokka. Prokka is a fast program for prokaryotic genome annotation. We enriched the databases that Prokka relies on by using the RefSeq database. This step is very slow if run locally (~ 3 hours) but in the cluster it takes around 30 min per sample.

8.1 Handbook

8.1.1 Requirements and preparation

See the General Instructions for all pipelines first.

The pipeline needs to know the “Genus” and “Species” of each input fasta file. There are three ways to do this (see the section Run the pipeline for more details on how to do this and the recognized abbreviations):
- The pipeline can guess this information from the file name provided the appropriate abbreviations are used within the name.
- You can provide a metadata file (.csv) that should contain at least three columns: “File_name”, “Genus” and “Species” (mind the capital letters). The File name is case sensitive, so the names of the files (without the full path) should coincide EXACTLY with your input fasta files. The genus and species should be recognized as an official TaxID. You MUST write the genus and the species in the appropriate column, never together in one column. This option cannot be combined with the first one. This means that if you decide to “guess” the genus and species from the file name, the provided metadata file will be ignored.
- You can provide the genus and species directly when calling the pipeline. The genus and species should be recognized as an official TaxID. YOU CAN ONLY PROVIDE ONE GENUS AND ONE SPECIES and this will be used for all samples. You can combine this option with one of the two above. This means that if you have multiple samples but some of them are not enlisted in your metadata file, they will instead inherit the genus and species from the information provided directly when calling the pipeline. If all the files were enlisted in the metadata, this option will be ignored.

8.1.2 Download the pipeline

YOU NEED TO DOWNLOAD THE PIPELINE ONCE OR EVERY TIME YOU WANT TO UPDATE IT

Make sure to have followed the instructions to set up conda before installing any of our pipelines!

Please follow the instructions to download pipelines from the Juno team of the IDS-bioinformatics group. The Juno-annotation pipeline can be found in this link.

8.1.3 Start the analysis. Basics

Open a terminal. (Applications>terminal).
Enter the folder of the pipeline using:


cd /mnt/scratch_dir/<my_folder>/Juno-annotation

Run the pipeline

If all your samples have the same genus and species, for instance, Klebsiella pneumoniae, you run it like this:

bash start_annotation.sh -i /mnt/scratch_dir/<my_folder>/<my_data>/ --genus Klebsiella --species pneumoniae

Note that the genus and species should be ONE word. Do not put genus and species together!

If your samples do not have all the same genus, you can make a metadata file. This file should be a .csv file with at least these columns and information:

File_name	Genus	Species
sample1_Kpn.fasta	Klebsiella	pneumoniae
sample2Pae2.fasta	Pseudomonas	aeruginosa
sample3Sau_1.fasta	Staphylococcus	aureus

Note that the name of the columns should be EXACTLY the same than in this example, including the underscores and the capital letters.

You would then call the pipeline like this:

bash start_annotation.sh -i /mnt/scratch_dir/<my_folder>/<my_data>/ --metadata path/to/metadata.csv

Alternatively, if your input files have one of the following abbreviations somewhere in the data, the genus and species may be automatically guessed from the file name:

Abbreviation	Genus	Species
Cam	Citrobacter	amalonaticus
Cbr	Citrobacter	braakii
Cfr	Citrobacter	freundii
Cse	Citrobacter	sedlakii
Cwe	Citrobacter	werkmanii
Ebu	Enterobacter	bugandensis
Eca	Enterobacter	cancerogenus
Ecl	Enterobacter	cloacae
Eco	Escherichia	coli
Exi	Enterobacter	cloacae
Kae	Klebsiella	aerogenes
Kox	Klebsiella	oxytoca
Kpn	Klebsiella	pneumoniae
Mmo	Morganella	morganii
Pae	Pseudomonas	aeruginosa
Pmi	Proteus	mirabilis
Rpl	Raoultella	planticola
Sar	Staphylococcus	argenteus
Sau	Staphylococcus	aureus
Sma	Serratia	marcescens

Then you would call the pipeline like this:

bash start_annotation.sh -i /mnt/scratch_dir/<my_folder>/<my_data>/ --make-metadata

Please read the section What to expect while running a Juno pipeline

See the section General Troubleshooting for any problems you may encounter.

Note: Do not keep all your data (including results) on the scratch_dir partition. You are allowed to keep 400GB max and with sequencing data, this can get full quite fast.

8.1.4 Output

A folder called out/, inside the folder of the pipeline, will be created. This folder will contain all the results and logging files of your analysis. There will be one folder per tool (circlator, filtered_contigs and prokka). Please refer to the manuals of every tool to interpret the results. Your main result will be two genebank (.gbk) files per sample that contained your annotated genomes/plasmids. You can find them inside the subfolders out/prokka.

Note: If you want your output to be stored in a folder with a different name or location, you can use the option -o (from output)

bash start_annotation.sh -i /mnt/scratch_dir/<my_folder>/<my_data>/ -o /mnt/scratch_dir/<my_folder>/<my_results>/

Another very important output from the pipeline are the logging files and audit trail that contain information of the software versions used, the parameters used, the error messages, etc. They could be important for you if you want to publish or reproduce the analysis at a later timepoint and also to get help from the bioinformatics team if you were to run into trouble with the pipeline. Please read about these files here.

8.1.5 Troubleshooting for this pipeline

Please read first the General Troubleshooting section!

8.1.5.1 Failure to find metadata file

You could get an error associated to the metadata:

The provided species file <your_metadata_file.csv> does not exist. Please provide an existing file
"If you used the option --make-metadata, please check that all the fasta files contain the .fasta extension and that the file names have the right abbreviations for genus/species

This message means that either you provided the wrong path/name to your metadata file, that this has the wrong extension (not .csv) or that, if you used the option to --make-metadata, your input files do not have the right abbreviations. Please check the metadata file provided or the file names and try again.

8.1.5.2 Other problems or failing rules

The Juno-annotation pipeline is still in development which means that sometimes the process can fail. This pipeline is not very actively maintained but if you run into problems, you can ask Roxanne Wolthuis or Fabian Landman.