10 Low frequency variants in Mycobacterium samples

CURRENTLY NOT MAINTAINED OR SUPPORTED

The main purpose of this pipeline is to call minority variants in Mycobacterium samples. I takes paired-end raw fastq files as input (one should contain R1 and the other R2 on the name). The pipeline performs the following steps:

  1. Quality Control of the raw reads using FastQC

  2. Trimming with Trimmomatic

  3. Quality Control of the trimmed reads using FastQC

  4. Alignment to reference genome using BWA

  5. Preparation to load in LoFreq

  6. Calling SNP and indels using LoFreq

  7. Annotating the resulting vcf file using vcf-annotator

Although the pipeline has been written for Mycobacterium, it can easily be extended to other type of bacteria. Please contact me if you want to use the pipeline for other organism.

10.1 Handbook

10.1.1 Requirements and preparation

See the General Instructions for all pipelines first.

  • Make sure that your files have the right format: they must have a fastq extension (.fastq, .fq, .fastq.gz or .fq.gz) and contain the characters R1 (for forward reads) or R2 (for reverse reads) somewhere on the name. As with the folder name, you should avoid rare characters on your file names. Just use letters, numbers, underscores or dashes. Make sure there are no spaces on the file names.

10.1.2 Download the pipeline

YOU NEED TO DOWNLOAD THE PIPELINE ONCE OR EVERY TIME YOU WANT TO UPDATE IT

Make sure to have followed the instructions to set up conda before installing any of our pipelines!

Please follow the instructions to download pipelines from the Juno team of the IDS-bioinformatics group. The Myco-lofreq pipeline can be found in this link.

10.1.3 Start the analysis. Basics

  1. Open the terminal. You can go to the Linux menu called “Applications” and open the program “terminal” or the “terminator” one. Both should work.

  2. Enter the folder of the pipeline

cd /mnt/scratch_dir/<my_folder>/Myco_lofreq
  1. Run the pipeline
bash run_myco_lofreq_pipeline.sh -i /mnt/scratch_dir/<my_folder>/<my_data>/ 

Please read the section What to expect while running a Juno pipeline

See the section General Troubleshooting for any problems you may encounter.

Note: Do not keep all your data (including results) on the scratch_dir partition. You are allowed to keep 400GB max and with sequencing data, this can get full quite fast.

10.1.4 Output

A folder called out/, inside the folder of the pipeline, will be created. This folder will contain all the results and logging files of your analysis. There will be one folder per tool (fastqc, trimmomatic, multiqc, bwa_alignment, lofreq, etc). Please refer to the manuals of every tool to interpret the results. There are two important subfolders generated by this pipeline:

A folder called out/, inside the folder of the pipeline, will be created. This folder will contain all the results and logging files of your analysis. There will be one folder per tool (kmerfinder, mlst7 and serotype). Please refer to the manuals of every tool to interpret the results. Each one of these folders, there should be a sub-folder per sample and, for the case of mlst7 and serotype, also a csv file collecting the results of all the samples together: a serotype/salmonella_serotype_multireport.csv, serotype/ecoli_serotype_multireport.csv and mlst7/mlst7_multireport.csv. Even if your samples are not Salmonella or E. coli you will get the multireport file, althought it will be empty. Although the results of kmerfinder are provided, these have not been validated. They are used only to choose the right scheme for the MLST and the right serotyper. If you would use the results of kmerfinder as a species identification tool, you do it under your own risk and you should be able to interpret correctly the results.

Note: If you want your output to be stored in a folder with a different name or location, you can use the option -o (‘o’ from output)

bash run_myco_lofreq_pipeline.sh -i /mnt/scratch_dir/<my_folder>/<my_data>/ -o /mnt/scratch_dir/<my_folder>/<my_results>/

Another very important output from the pipeline are the logging files and audit trail that contain information of the software versions used, the parameters used, the error messages, etc. They could be important for you if you want to publish or reproduce the analysis at a later timepoint and also to get help from the bioinformatics team if you were to run into trouble with the pipeline. Please read about these files here.

10.1.5 Troubleshooting

Please read first the General Troubleshooting section!

10.1.5.1 Other problems or failing rules

The Myco-lofreq pipeline is still in development which means that sometimes the process can fail. This pipeline is not being maintained anymore so at the moment we cannot provide support for it.