Preamble

what this is, and who it is for:

This document covers how to do the following:

connect through ssh and set up a bash-based bioinformatics project
collect and check a project’s illumina short-read dataset (QC - FastQC+MultiQC)
improve the overall quality of their sequencing dataset (QC - trimmomatic)
remove contaminant and/or host sequences from the dataset (QC - kneaddata)
generate a simple but effective community profile of the microbial community in that dataset (Microbial ecology - Kraken2+Bracken)

We provide minimal guidance on how to adapt the different steps to the workflow (different parameters, different options or tools).

This document was originally put together by Ines Calvete (IPLA/CSIC), Nuria Peǹa Vidal (UCM), and Jamie FitzGerald (Teagasc/UCD) as a roadmap for analysing microbiomes from cheese, and chicken GI tract respectively. It uses material sourced from well-written reference manuals, online communities, and our own mistakes (see References section). We assume a beginner’s understanding of the HPC, ssh, bash, and scripting.

Please note this page is still being written. If something is missing, unclear, or flaming wrong, please point it out!

what this document is not:

This is not a guide to the Why of the methods used here. For that, check the References section at the bottom of the page. In particular, importing data, and analysis of communities, in R will need their own documents:

importing data to R 💻
analysing microbial communities in R: this is a big topic - consider the links below (and bottom of page):

when something goes wrong:

Mistakes, and trial-and-error, are part of learning any skill. Bioinformatics involves a particularly large and constant amount of trialing, as well as lots of errors (see: how to read an error).

One of the best ways to learn quickly (and hopefully make different, exciting mistakes rather than the same ones) is to keep notes of what you do, just like in a lab:

write out your objectives
date your notes
write out what you try to do
note down the outcomes
write down what goes wrong
include the images if you can

if you need help:

in bash: for some e.g. program, try entering program --help, program -h, or man program..
in R: for a function(), enter ?function to see that function’s help page, or try pressing tab after a comma to see other options that function() understands.
in R: for some e.g. object (dataframe, matrix, list, etc.) you can try using the str(object) function to see its structure, use head(object) to see the top 6 lines or rows, class(object) to see what sort of thing it is, dim(object) to see it’s dimensions (row x columns; this will be null if it’s a list or vector etc., as these only have 1 row and 0 columns, and 1*0 = 0)
try searching for your error message online (web search, stackoverflow, biostars, bioconductor, github, huttenhower-biobakery, etc.) - the issue might be simple! If that doesn’t work, or you’re not sure:
Ask someone. This is not easy work, and we all know it. People are glad to help, and to show others what they have found out.

Finally, please remember (when working on the HPC server):

when you login to the HPC, you must always change to a node (eg ssh compute05) before you do any work

0 - setup your environment

It can be helpful to define any frequently used paths / values / variables at the start of your workflow, based on what you want to do. This keeps your workspace visible, easy to refer to, and allows you to reliably change the value throughout your work (e.g. a new project, a new database, etc.).

organising your workspace

A good way to keep track of all your folders/files/values is to use $environmental_variables - these are temporary shortcuts added to bash’s working environment that tell the computer to replace the $variable with the file, folder, or value we have linked it to. For example, entering PLAY=/home/user/Music/playlists would create a new variable, $PLAY: when $PLAY is entered, the computer will spot the $ character, understand that it’s a shortcut, and replace that variable ($PLAY) with the value assigned to it (/home/user/Music/playlists) wherever that variable is found.

We will assign variables to the folders you will use regularly, and for a sample to use for testing things on. Feel free to change these to locations and/or names that make sense to you. Here $CAPITALLETTERS help keep these short words visible, but it’s not necessary if you don’t like shouting. Variables can use letters, numbers, and underscores (more info on variables).

We actually use these variables all time! Those created by the user (“user” variables) only last until you close the terminal window, or until a program or for-loop finishes (see $i in the for-loop section below!) - after this, they will be forgotten. Note that this means you have to enter these variables at each new session (there are clever ways of doing this too). A lot of other variables are created by the system (“system” variables) every time you open a new terminal window:

# $PATH tells the computer all the folders that have programs in them (multiple folder paths)
echo $PATH

# $HOME tells the computer where the user's home directory is (folder paths)
echo $HOME

# $TERM tells the computer what the default command-line interface is  (a value called by other programs)
echo $TERM

# $HISTSIZE tells the computer how many commands to remember (a value)
echo $HISTSIZE

# blah blah blah

where does my data go?

Thank you for asking! Remember, on the HPC:

raw data goes into the primary data directory (under your project name on the hpc) - here we give it the shortcut variable $RAW. This is your ‘primary’ data source, so do not edit it
output data goes in the analysis folder. We set this variable shortcuts as $WRK, and all our output goes inside this dir.

WRK=/data/Food/analysis/R0602_microsupport/r0936

# databases for different tools
DB=/data/databases                                            # HPC only
HGR38_BT=/data/databases/hostremoval/Homo_sapiens/Bowtie2     # HPC only
K2REFDIR=/data/databases/kraken2                              # HPC only
BT_DB=/data/Food/analysis/R0936_redcheese/ref

# our overall data structure
RAW=/data/Food/primary/R0936_redcheese
WRK=/data/Food/analysis/R0936_redcheese

# scripts, and slurms for queueing up jobs
MAT=$WRK/Materials

# outputs etc
QC=$WRK/1__qc
FQC=$QC/fastqc
FILT=$WRK/2__filt
KDOUT=$WRK/3__knead
KROUT=$WRK/4__krak2

#!# set a sample-name to test things on
TEST=Q4T8

what tools do we need?

We will want to have FileZilla (or similar) installed, as well as our ssh client (putty).

Next, we tell the HPC which programs we’ll need for this work, which the HPC has parcelled into “modules”. If we don’t load the modules, our session won’t be able to “see” these programs in the $PATH (list of places the computer looks to find programs) Some programs are always visible (e.g. commands like ls), others are loaded through e.g. conda (e.g. R…), and don’t need to be mentioned here.

FastQC and MultiQC for quality control
Kraken2 for metagenomic community reconstruction
Kaiju, also for metagenomic community reconstruction
GNU-parallel for case where we can run many jobs in parallel.

# this is a list of all programs explicitly required in this workflow
module load parallel fastqc multiqc trimmomatic bowtie2 samtools/1.10 kraken2/2.1.1 braken

are we ready?

Finally, we make sure that all the locations for the work we’re doing exist! When we start, they won’t exist yet (but note how the command looks up the full location for the directory): if $QC has not been defined, it will show you the contents of your current directory; if $QC is defined, but hasn’t been created yet, it will tell you that the folder doesn’t exist (which should be true!).

ls $QC

In our unified data ($RAW) and work folders ($WRK), we create different folders for each output, using the variables above. It’s not a problem if the folder already exists.

# make analysis dirs
mkdir $BT_DB $RAW $WRK $FILT $QC $FQC $KDOUT $KROUT

# make script dirs etc.
mkdir $MAT

# make dirs for qc outputs
mkdir $FQC/fhi_redch_raw $FQC/fhi_redch_filt $FQC/fhi_redch_raw_multi $FQC/fhi_redch_filt_multi

1 - get the sequence data (& check)

We copied the data from IPLA-CSIC (or elsewhere!) manually, into the $RAW dir, using FileZilla. Sometimes we will have to use scp, sFTP, or rsync instead - a web search should show how to do this, and these programs should all be available on the HPC.

For checking the sequence quality, we’ll need the programs FastQC (quality profiling), and MulltiQC (unifying FastQC reports):

# make sure you're on a compute node!
module load fastqc multiqc

Next, for our own information, we look at just the fastq.gz files, count the number of files, and run FastQC and MultiQC on them.

# use ls arguments "-lsh" to output it as an informative list
ls -lsh $RAW

# How many files? - use "|" charater to pipe the ls output to wc (wordcount), and count the n of lines (-l or --lines)
ls -lsh $RAW/*fastq.gz | wc -l

# can also easily check a fastq file
ls -lsh $RAW/$TEST*            # file details
zcat $RAW/$TEST* | head -10    # decompress, read and print top 10 lines
less -S $RAW/$TEST*            # read through file, press q to quit

`FastQC` and `MultiQC`

FastQC is an amazingly useful tool, for profiling the quality, consistency, and content of your sequence data. It takes samples of each file it is shown (compressed or uncompressed), checks the sequences and returns a HTML & zip file for each FASTQ: the HTML is a report, and the zip is all the data from that report. Next, MultiQC combines all the FastQC outputs into one report, summarising all that info in one place. Always check this!

FastQC will process all fastq(.qz) files that it is passed. MultiQC will process all the FastQC outputs it finds in a directory, combine them, and place output in the dir specified (multiqc_report.html in -o). Keep in mind that FastQC and MultiQC outputs also contain text files of all this data (see the multiqc_data dir), if you want to use them for other graphics/reports etc.

Check sequence quality, with `FastQC`, on one sample:

# then run a fastqc for F and R files, output in the dirs we made above
fastqc -t 4 $RAW/${TEST}* -o $FQC/fhi_redch_raw

# make a report that includes both F and R reads for this sample
multiqc $FQC/fhi_redch_raw -o $FQC/fhi_redch_raw_multi

Copy the output multiqc files to your local computer, and doubleclick to open in your browser. For help reading FastQC/MultiQC, there’s an introductory tutorial video in the top of the MultiQC report.

Check sequence quality, with `FastQC`, on multiple sample:

We need to be responsible with our use of the HPC. Here we write a script in slurm(see the HPC website for more info on slurm!!) format to queue all these jobs politely into one FastQC task, and then summarise using MultiQC again.

We also found out that some fasstq files are compressed as .gz, and some are compressed at .bz2. This means that we need to specify both type fo fastq if we want it to work. Best way is to change our command to match both! We show the computer the two different compression types using the curly brackets: {gz,bz2}. This allows the computer to accept *fastq.gz and fastq.bz2:

# write a slurm script first
echo '#!/bin/bash

SBATCH --job-name=knead_fastq
SBATCH --output=knead_fastq.txt
SBATCH --ntasks=15
SBATCH --time=15:00

IN=$1
OUT=$2

# time just tells us how long this takes, so we know for next time
# -t is the number of threads (tasks) to use
# curly brackets {} allow us to match either gz or bz2 file extensions
time fastqc -t 15 $IN/*fastq.{bz2,gz} -o $OUT
' > $MAT/slurm_fqc.sh

# review:
cat $MAT/slurm_fqc.sh

This slurm script will give a name, output-log, runtime, and number of threads needed to the slurm manager, and use the input and output dirs we give to find fastq.gz\bz2 files, start FastQC, and place outputs in the $OUT dir. We need to ‘give’ this slurm to sbatch:

NB: we have noticed that the {} brackets do not solve the problem of missing the bz2 samples. We simply resubmitted the samples as fastqc -t 4 $RAW/*fastq.bz2 -o $FQC/fhi_redch_raw - this is not solving the problem!

# trust slurm
sbatch $MAT/slurm_fqc.sh $RAW $FQC/fhi_redch_raw

# combine outputs
time multiqc $FQC/fhi_redch_raw -o $FQC/fhi_redch_raw_multi

# then copy to local computer (try FileZilla), and open in your browser!

Again, open the HTML report in your browser to see what the overall quality is like, and how much cleaning you will need to do.

choosing your sample trimming parameters

With information on all the samples, it is possible (but not necessarily easy) to pick appropriate trimming and quality parameters.

Trimming of the 5' end removes “artificial” bases at the start of F+R reads : check the FastQC or MultiQC outputs to see how long it takes for Per-Base Sequence Content to become homogeneous/balanced
Trimming of the 3' end removes bases where the sequencing quality declines gradually (especially the R reads)
removing adapters removes synthetic sequences that were used to construct the sequencing library, but are not biological

Consider: How do the sequences look? Is quality uniform throughout, or does it vary? What other warnings (in red) do FQC+MQC detect?

2 - quality control (& check)

first pass with `Trimmomatic`

Hopefully, there are not too many problems with the data! Nevertheless, we always do a little cleaning of the sequences before we analyse them. There are several things that can go wrong, and we can make the overall dataset better by removing reads or trimming off:

bases with low quality scores (Q score, for illumina: < ~20), meaning that you cannot be sure a base is correct / accurate
even if the Q score is good, the start (5') and end (3') of a read are often best removed
“adapter”, “artefact”, or “technical” sequences (the bits of DNA that attach the DNA sample to the machine) can get included by accident
“read-through” sequences, where the sequencing gets all the way to the opposite end, and starts reading adapter sequences, or making sequences up
other strange problems with operation

We use the program Trimmomatic to do all of this. This java-based program can do many things to check your quality, but we will focus on:

removing adapter sequences
trimming the start and ends of reads, based on the MultiQC profile we made in section 1.

For quality control, we’ll need the programs Trimmomatic (trimming and filtering), FastQC (quality profiling), and MulltiQC (unifying FastQC reports):

# make sure you're on a compute node!
module load fastqc multiqc trimmomatic

First, we will use echo '...' > file.fa to make a fasta file of the known adapter sequences, using references available on the Teagsc HPC, as well as the sequences provided by Trimmomatic.

echo '>adapter_seq
CTGTCTCTTATACACATCT
>adapter_mate_seq
AGATGTGTATAAGAGACAG
>Transposase_Adap__for_tagmentation_1
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG
>Transposase_Adap__for_tagmentation_2
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG
>PCR_primer_index_1
CAAGCAGAAGACGGCATACGAGATNNNNNNNGTCTCGTGGGCTCGG
>PCR_primer_index_2
AATGATACGGCGACCACCGAGATCTACACNNNNNTCGTCGGCAGCGTC
>PCR_primer_index_2_rc
GACGCTGCCGACGANNNNNGTGTAGATCTCGGTGGTCGCCGTATCATT
>PCR_primer_index_1_rc
CCGAGCCCACGAGACNNNNNNNATCTCGTATGCCGTCTTCTGCTTG
>Transposase_Adap__for_tagmentation_2_rc
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
>Transposase_Adap__for_tagmentation_1_rc
CTGTCTCTTATACACATCTGACGCTGCCGACGA
>adapter_mate_seq_rc
CTGTCTCTTATACACATCT
>adapter_seq_rc
AGATGTGTATAAGAGACAG' > $MAT/fqc_trimmo_ill_ref.fa

# have a look!
less $MAT/fqc_trimmo_ill_ref.fa   # press q to exit

first pass, with `Trimmomatic`, on one sample:

To clean the $TEST sample, we use these variables:

$RAW : this was input dir assigned as /data/Food/primary/../././
$FILT : this was the output dir assigned as /data/Food/analysis/../././
$TEST : this was the ‘test’ sample, assigned as XXXXXXX
$MAT : this was where we stored scripts, logs, and the fasta above assigned as ./././Materials
- note paths above not completed without reference to Teagasc HPC…

These variables help us to tell Trimmomatic the following:

option	effect
`PE`	paired-end mode, so examine them together
`$RAW/${TEST}_R1_001.fastq.gz`	input `F` reads
`$RAW/${TEST}_R2_001.fastq.gz`	input `R` reads
`$FILT/${TEST}_R1_trimm.fastq.gz`	output for trimmed `F` reads
`$FILT/${TEST}_R1_trimm_unpaired.fastq.gz`	output for trimmed `F` reads that have lost their `R` mate (see below)
`$FILT/${TEST}_R2_trimm.fastq.gz`	output for trimmed `R` reads
`$FILT/${TEST}_R2_trimm_unpaired.fastq.gz`	output for trimmed `R` reads that have lost their `F` mate (see below)
`HEADCROP:20`	trim the first 25bp from 5’ ends
`CROP:130`	trim everythingafter* 125bp from 3’ ends
`ILLUMINACLIP:$MAT/fqc_trimmo_ill_ref.fa:2:30:10:5`	using the reference `fasta` above, we trim reads with check parameters on trimming
`SLIDINGWINDOW:6:15`	check parameters on trimming
`MINLEN:80`	after cropping & trimming, sequences must be at least 80bp long to be kept (otherwise delete them)
`-threads 6`	use 6 threads (tasks/jobs) in parallel
`> $FILT/trimmo_${TEST}.out`	use `>` to send error or system messages to a `logfile` so you can check for errors afterwards.

Trimmomatic checks the quality of our sequences, trimming off “bad” sequence bases, and discarding “bad” reads. Most reads will still be paired (have a matching F and R read). In some cases, the F read is discarded (thrown away) but we keep the matching R read because it is good quality (or keep F and discard R). As a result of this, Trimmomatic will sort each sample into 4 files:

paired F reads that have a matching R read: $FILT/${TEST}_R1_trimm.fastq.gz
paired R reads that have a matching F read: $FILT/${TEST}_R2_trimm.fastq.gz
unpaired F reads that don’t have a matching R read: $FILT/${TEST}_R1_trimm_unpaired.fastq.gz
unpaired R reads that don’t have a matching F read: $FILT/${TEST}_R2_trimm_unpaired.fastq.gz

For now, we will not use the unpaired reads (we ignore them!), as it can lead to complications elsewhere (because of the mix of paired and unpaired reads, which is a problem for some programs). Note also that there are many other options available, to tackle different problems - see the Trimmomatic manual for more.

When it is all put together, it looks like this (the \ character allows us to start a newline without interrupting the command):

trimmomatic PE \
  $RAW/${TEST}_R1_001.fastq.gz $RAW/${TEST}_R2_001.fastq.gz \
  $FILT/${TEST}_R1_trimm.fastq.gz $FILT/${TEST}_R1_trimm_unpaired.fastq.gz \
  $FILT/${TEST}_R2_trimm.fastq.gz $FILT/${TEST}_R2_trimm_unpaired.fastq.gz \
  HEADCROP:20 \
  CROP:130 \
  ILLUMINACLIP:$MAT/fqc_trimmo_ill_ref.fa:2:30:10:5 \
  SLIDINGWINDOW:6:15 MINLEN:80 \
  -threads 6 > $FILT/trimmo_${TEST}.out

# when finished, look at the output:
less $FILT/trimmo_${TEST}.out

first pass, with `Trimmomatic`, on multiple samples:

Because these reads are paired-end, we need to set up a slurm script to process each sample name, find the F and R fastq.gz files, process them together, and output them to the correct folders. There are a few ways to get the sample names, but because we will need to process each sample name several times, we are going to use a simple, re-usable method:

copy the sample names (not the full file names) to a text file, with one sample name per line
to process each sample, we then go through the text file, using each sample name in turn.

If you are very careful and have a lot of time, you can type all the sample names into a text file by hand. You can also do it quickly using a chain of bash commands joined together by the ‘|’ or pipe character:

command	output
`ls $RAW/*fastq.gz`	list all filenames in `$RAW` ending in `fastq.gz`
`sed -e ...`	simplifies the filename using `regex` (`-e` flag; `sed` & `regex` are complex but ++useful - see intro here)
`s/a/b/g`	`sed`: `s`ubstitute `a` for `b`, everywhere its found (i.e. `g`lobally)
`.\/$.$_R._001.*`	`sed` - our `a` to be replaced: find text matching `/(...)_R._001`, where ‘`.`’ can be anything, and keep the bit in brackets
`\1`	`sed` - our `b` to replace `a`: just paste in the bit found in brackets
`sort`	sorts filenames alphabetically - this is necessary for `uniq`
`uniq`	gives each unique name
`> $MAT/samples`	send filenames to a file in `$MAT` called `samples`

# combine all those different parts!
ls $RAW/*fastq.gz | sed -e 's/.*\/\(.*\)_R._001.*/\1/g' | sort | uniq > $MAT/samples

We can then use cat $MAT/samples or less $MAT/samples to print out all the sample names. We can also feed all these sample names into a for-loop, which takes each row (sample name) and places it inside the for-loop wherever it finds (e.g.) $i ($i is what everyone uses, but it’s just a variable like $MAT etc - it could be anything you choose).

First, we make a slurm script for running Trimmomatic:

echo '#!/bin/bash

#SBATCH --job-name=trimmoRaw
#SBATCH --output=trimmoRaw.txt
#SBATCH --ntasks=6
#SBATCH --time=18:00

# these take the terms given after the scriptname, i.e. "... $i $RAW $FILT $MAT"
SAMPLE=$1
IN=$2
OUT=$3
REFDIR=$4

# trimmomatic - use backslash to separate rows
trimmomatic PE \
$IN/${SAMPLE}_R1.fastq.gz \
$IN/${SAMPLE}_R2.fastq.gz \
$OUT/${SAMPLE}_R1_trimm.fastq.gz \
$OUT/${SAMPLE}_R1_trimm_unpaired.fastq.gz \
$OUT/${SAMPLE}_R2_trimm.fastq.gz \
$OUT/${SAMPLE}_R2_trimm_unpaired.fastq.gz \
HEADCROP:25 \
CROP:125 \
ILLUMINACLIP:$REFDIR/fqc_trimmo_ill_ref.fa:2:30:10:5 \
SLIDINGWINDOW:6:15 \
MINLEN:80 \
-threads 6 > $OUT/trimmo_${SAMPLE}.out 2>&1' > $MAT/slurm_trimmo.sh

NB: increased time from 6min to 18min to avoid job being terminated; had to replace bad - with --

Then we make a simple for-loop (see this link) that repeats the command sbatch $MAT/slurm_trimmo.sh $i $RAW $FILT $MAT for every possible value of $i (all the sample names). In this way, we can use 6 threads to process one sample after another, with the job using 6 threads as we specified above (i.e. process samples in serial rather than in parallel).

# this command lists all the sample names
cat $MAT/samples

# we can put that command $(in backets);, and feed it into the for-loop like this:
for i in $(cat $MAT/samples);

  # and it will repeat this command for all the different values of $i, below:
  do sbatch $MAT/slurm_trimmo.sh $i $RAW $FILT $MAT;

# close the for-loop
done

Alternatively, there might be a sample we need to re-run. We do not need to use for-loops etc. to pass jobs to slurm: we can simply tell it what job to do, on what sample name:

# replace th $i in the for loop with the sample name (e.g. P2620), and delete the starting "do" and the end ";"
sbatch $MAT/slurm_trimmo.sh P2620 $RAW $FILT $MAT
sbatch $MAT/slurm_trimmo.sh P2233 $RAW $FILT $MAT
sbatch $MAT/slurm_trimmo.sh P2312 $RAW $FILT $MAT
sbatch $MAT/slurm_trimmo.sh etc... $RAW $FILT $MAT

fastqc and multiqc

Again, we check the Trimmomatic output with FastQC/MultiQC, copying the multiqc_report to our computer, and seeing if there’s much improvement.

fastqc -t 4 $FILT/*fastq.gz -o $FQC/fhi_redch_filt

# move unpaired reports away - avoid clogging up multiqc vis
mkdir $FQC/fhi_redch_filt_unpaired
mv $FQC/fhi_redch_filt/*unpaired* $FQC/fhi_redch_filt_unpaired

# look at just the trimmed, paired data (no unpaired reads)
multiqc $FQC/fhi_redch_filt/ -o $FQC/fhi_redch_filt_multi

second pass etc.

It is hard to know when the quality is ‘good enough’. In general, we should keep above Q30, but this is not an indicator of other things like adapter contamination. Often, a later step will not work, and it can be necessary to be stricter in our cleaning, or try different approaches etc. Be clear with yourself, and with other people, about what pre-processing steps you are taking.

Sometimes, it might be necessary to use Trimmomatic more than once before the sequence is ‘clean’. In general it is easier to program, easier to organise, and easier to reproduce a metagenomics analysis when all sequence cleaning steps are carried out in one go. Again, keep notes of what you have done - this makes remembering what happened much easier.

3 - decontaminate data (& check)

Filtering and trimming in the previous steps should have removed nearly all (🤞) of the fake sequences: low-quality bases that can not be trusted (trimming), as well as artificial sequences introduced by library-preparation and sequencing (filtering).

Next, it is a very good idea to consider what real sequences have been included in our dataset, but do not belong in our analysis. Contaminants could come from:

the experimental environment: host DNA, DNA from surfaces or substrates (foods?) in the environment
the method used: i.e. sampling methodology, lab equipment
the researcher: sneezing, dribbling, rhinotillexis, colds, coughs, sneezes, or deep PhD-related sighs

In some environments (e.g. faeces, sludge, soil), contaminating sequences might only comprise a small proportion of the total sequences, and are unlikely to change the “big picture” of the dataset, but could interfere with analysis of specific microbes, or metabolic functions in ways we cannot predict. In other environments (e.g. microbiomes from biopsy, the aeolian microbiome), contaminants can make up the majority of a sample, altering the “big picture” significantly, and making it difficult to draw meaningful conclusions - and you won’t know why.

To address this as best we can, we remove contaminants based on a database of reference sequences appropriate to our study. We collect these sequences (search for the host, known contaminants, artificial sequences, foodstuffs, etc.), compile them into a database (i.e. a big file that’s formatted so that it’s easy for computers to search through it quickly), and then ask a program (e.g. the DNA-DNA alignment program, bowtie2) to look for, and remove, any sequences that match those contaminants.

For decontamination, we’ll need the programs BowTie2 (database-building & aligning) and samtools (filtering files in sam and bam formats), as well as FastQC & MultiQC for profiling the sequences output.

# make sure you're on a compute node!
# note the samtools version.. for the teagasc HPC at any rate.
module load bowtie2 samtools/1.10 parallel fastqc multiqc

Decontaminate with `BowTie2`

We need reference genomes relevant to our experiment. For this tutorial, we consider several possible sources of host contamination:

human genome, assembly GRCh38.p14, GenBank GCF_000001405.40 - contamination from the preparation, extraction, & sampling
the “decoy genome”, hs37d5ss (’11) - a bundle of synthetic, virus-associated, and other miscellaneous sequences that don’t map well from/to the human genome (see more jhere)
the cow genome, GenBank GCA_021234555.1 (’21, contig/scaffold N50s: 50M/104M) - from the source milk
the sheep genome, GenBank GCA_016772045.1 (’21, contig/scaffold N50s: 43M/101M) - from the source milk
the goat genome, GenBank GCA_001704415.2 (’16, contig/scaffold N50s: 26M/87M) - from the source milk
the chicken genome, genBank GCA_016699485.1 (’21, contig/scaffold N50s: 19M/91M) - host for the chicken gut microbiome

We will first obtain all of these sequences from the NCBI (should/could use ENA…)

echo '#!/bin/bash

cd $1

# download your own genome (more or less) - 1.2 GB, 12min :: from the NCBI browser https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.40/
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v1/genome/accession/GCF_000001405.40/download?filename=GCF_000001405.40.zip" -H "Accept: application/zip"

# old decoy genome was 9 MB, now (Feb 2025) using newer, smaller decoy genome! (smaller because human material better integrated into GRCh38...)
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/786/075/GCA_000786075.2_hs38d1/GCA_000786075.2_hs38d1_genomic.fna.gz

# cow genome - 825 MB, 5min
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v1/genome/accession/GCA_021234555.1/download?filename=GCA_021234555.1.zip" -H "Accept: application/zip"

# sheep genome - 890 MB, 4min
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v1/genome/accession/GCF_016772045.1/download?filename=GCF_016772045.1.zip" -H "Accept: application/zip"

# goat genome - 930 MB, 3min
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v1/genome/accession/GCF_001704415.2/download?filename=GCF_001704415.2.zip" -H "Accept: application/zip"

# chicken genome - 440 MB, 1min
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v1/genome/accession/GCF_016699485.2/download?filename=GCF_016699485.2.zip" -H "Accept: application/zip"' > $MAT/trimmo_genomes_dl.sh

# do check that curl is installed!
# if not, try activating it: "module load curl"
curl -h

# give "run" permission for script to be run, for the user only
chmod u+x $MAT/trimmo_genomes_dl.sh
# make a dir for genomes in our RAW folder
mkdir $RAW/ref
# run the script while pointing at those files
$MAT/trimmo_genomes_dl.sh $RAW/ref

Our genomes should be inside the downloaded .zip files, under the directory /ncbi_dataset/data/<genbank-accession>/, and will be the largest file.fna in there. Extract them to your $RAW/ref folder, and then compress the .fna to .fna.gz with gzip to save space. Genome data is fairly dense, so this will/might only compress to about 1/3 the raw size.

NB: check here that you have the genomes relevant to you, and make sure that the filenames above match the filenames used in the following step (e.g., does a file end in .fna or .fa?…)

parallel gzip {} ::: $RAW/ref/*fna

`BowTie2` : make database

We then need use bowtie2 to build a reference database for all of these genomes in one place. BT_threads=9 because available locally - feel free to increase to something appropriate to your setup, as we check the decoy genome first.

# number of processes/threads to use for jobs
BT_threads=10

# the reference genomes (above) were downloaded here:
ls $RAW/ref

# the new bowtie2 databases will go here:
BT_DB=/data/Food/analysis/R0936_redcheese/ref
mkdir $BT_DB

# bowtie2-build [options]* <reference_in> <bt2_base>
time bowtie2-build --t $BT_threads $RAW/ref/hs37d5ss.fa.gz BT_DB/decoy_hs37d5ss.bt2 > $BT_DB/decoy_hs37d5ss.bt2.buildlog


## individual builds:
# decoy.9MB.fna.gz; 19 seconds for decoy, 9 threads, 15GB RAM
# chicken.340MB.fna.gz; 10min, 9 threads, 15GB RAM
# cow.853MB.fna.gz; 28m, 9 threads, 15GB RAM

That was quick! It was quick because the amount of data (9MB!) was small. Larger (real) genomes will require much longer to run, and likely require a slurm script.

Although jfg is not certain of the theory behind aligning to an index built from multiple reference genomes (probably it uses the first match, even if there are several paralogous sequences), it should be possible by listing several fasta files as input to bowtie2-build file1.fna.gz,file2.fna.gz,file3.fna.gz... etc (as per the bowtie2 manual). This avoids having to cat all the fasta files together to make one giant fasta.

We need to notify slurm of the requirements:

as the human genome takes about a half hour to build, and there are 5-and-a-half genomes, we notify slurm that the job should take 30min * 5.5 ~= 3hr for the genomes used here, with 10 processors and loads of memory. Alternatively, remove the `–time`` line to avoid the job being killed if it runs over.
there are suggestions that adding more and more threads to bowtie2 doesn’t make a huge difference, and it could mean we wait a long time for our job to be started. So we ask for only 10 threads. We don’;‘t use the $BT_threads variable this time, as slurm can’t ’see’ this variable when it starts.
Some desktop-pc tests have shown that 15GB of RAM is not enough for this job (>3GB per genome, x5.5…). RAM parameter is not specified below as the requirement is unknown.

# build our BT2 database in slurm:
echo '#!/bin/bash

#SBATCH --job-name=bt2DB
#SBATCH --output=bt2DB.txt
#SBATCH --ntasks=10
#SBATCH --time=179:00

INOUT=$1

# feel free to add / remove genomes that are relevant to you

# do check the names before you run! 
time bowtie2-build --large-index --threads 10 \
$INOUT/GCA_021234555.1_ARS-LIC_NZ_Jersey_genomic.fna.gz,\
$INOUT/GCF_000001405.40_GRCh38.p14_genomic.fna.gz,\
$INOUT/GCF_001704415.2_ARS1.2_genomic.fna.gz,\
$INOUT/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.fna.gz,\
$INOUT/GCF_016772045.1_ARS-UI_Ramb_v2.0_genomic.fna.gz,\
$INOUT/hs37d5ss.fa.gz \
$INOUT/mult_ref #> $INOUT/mult_ref.buildlog # 2>&1

' > $MAT/slurm_bt2db.sh

# and then run
$MAT/slurm_bt2db.sh $BT_DB

This will give us a number of files ending in .bt2 in our output dir. These are the reference files that bowtie2 uses to check for contaminant sequences in the fastq data (if you want, you can now delete the reference fna.gz genomes!).

`BowTie2` : find contaminants

Bowtie2 will align our dataset against these reference indices, and make a note of what reads matched between the input (dataset) and reference (host). For our study, we will be removing any reads that match the reference (hosts) as we are looking at the microbiome specifically. However, in other studies, we might be interested in the host sequences only, and would throw away everything else (all the microbes).

NB: bowtie2 output interpretation

NB: ths SAM tools spec

## firstly, align $TEST sequences to ref using bt2.
bowtie2 -p $BT_threads -x $BT_DB/mult_ref -1 $FILT/${TEST}_R1_trimm.fastq.gz -2 $FILT/${TEST}_R2_trimm.fastq.gz -S $KDOUT/${TEST}_bt2_refmapped.sam

## feel free to look at the output
less $KDOUT/${TEST}_bt2_refmapped.sam

We can use BowTie2 directly for filtering out contaminant reads, but because there are paired (F+R) reads, it’s not quite as specific as we’d like. For better control, we can use the samtools program below.

`samtools` : delete contaminants

Now that Bowtie2 has identified the contaminant sequences, we use samtools to:

convert the sam file to a bam file
filter out any reads assigned to the decontamination database
sort all the remaining reads, so that F+R are in the same order
convert the bam file to a fastq (throwing the extra sam info into the /dev/null bin)

## 1 - view with samtools, and convert to bam (binary)
samtools view -bS $KDOUT/${TEST}_bt2_refmapped.sam > $KDOUT/${TEST}_bt2_refmapped.bam

## 2 - filter the contaminant sequences out
# F 256 : exclude those mapping to ref
# f 12 : F and R *both* unmapped
samtools view -b -f 12 -F 256 $KDOUT/${TEST}_bt2_refmapped.bam > $KDOUT/${TEST}_bt2_decontamd.bam

## 3 = ensure reads are in a sensible order 
samtools sort -n -m 5G -@ 2 $KDOUT/${TEST}_bt2_decontamd.bam -o $KDOUT/${TEST}_bt2_decontamd_sort.bam > ${TEST}_bam_sort.out

## 4 - stream outputs of non-interest to /dev/null
samtools fastq -@ 8 $KDOUT/${TEST}_bt2_decontamd_sort.bam -1 $KDOUT/${TEST}_bt2decon_R1.fastq.gz -2 $KDOUT/${TEST}_bt2decon_R2.fastq.gz -0 /dev/null -s /dev/null -n

# look at the fastq! 
ls -lsh $KDOUT/${TEST}*

## removing intermediates: SAM and BAM files are huge! delete --WHEN FINISHED--
rm $KDOUT/${TEST}*[sam,bam]

decontaminate multiple samples

We can apply the same technique to all of our samples using slurm. The database is already present, so we write slurm scripts for the BowTie2 alignment, samtools filtering, and samtools processing.

Separating out the jobs allows us to troubleshoot issues more easily (hopefully), but complicates batch processing of samples. It is more efficient, and possibly easier to supervise, if each sample was processed together, and errors produce a simple message that we can check. Here, we use if statements to check that the outputs are finished, before we delete all the intermediate bam and sam files (which get very large).

**NB-jfg** : `time` flag is totally spurious ; lots of large intermediate files being created - we'll need to delete these as we go;  **see the `bt_deco.txt` for output**, which should be in the `~` directory... Problems may be recorded there ; also check name formats in bt2 align step; issue with streaming applications sending binary to .out/.log files - logfile call deleted in repsonse ; need slurm individual name tag


echo '#!/bin/bash

#SBATCH --job-name=deco_bt2.%j
#SBATCH --output=deco_bt2.%j.txt
#SBATCH --ntasks=10

SAMPLE=$1
IN=$2
OUT=$3
REF=$4
THREADS=$5


# load necess
module load bowtie2 samtools/1.10


## stamp for the output
DAT_TIM=$(date +"%T %F")
echo "--------------------------------------------"
echo $DAT_TIM
echo sampl: $SAMPLE
echo input: $IN
echo outpt: $OUT
echo "--------------------------------------------"


echo "--   align   -------------------------------------"
# -x $REF/mult_ref \
time bowtie2 -p $THREADS \
-x $REF/mult_ref \
-1 $IN/${SAMPLE}_R1_trimm.fastq.gz \
-2 $IN/${SAMPLE}_R2_trimm.fastq.gz \
-S $OUT/${SAMPLE}_bt2_refmapped.sam


if [ $? = 0 ] && [ -s $OUT/${SAMPLE}_bt2_refmapped.sam ]
then
        echo "--   change format   -------------------------------------"
        time samtools view -bS $OUT/${SAMPLE}_bt2_refmapped.sam > $OUT/${SAMPLE}_bt2_refmapped.bam

        echo "--   remove matches   ------------------------------------"
        time samtools view -b -f 12 -F 256 $OUT/${SAMPLE}_bt2_refmapped.bam > $OUT/${SAMPLE}_bt2_decontamd.bam

        ## obviate - not enough memory, and not essential for kraken & co.
        # echo "--   re-organise reads   ---------------------------------"
        # time samtools sort -n -m 5G -@ $THREADS $OUT/${SAMPLE}_bt2_decontamd.bam -o $OUT/${SAMPLE}_bt2_decontamd_sort.bam

        echo "--   to fastq, dump other   ------------------------------"
        time samtools fastq -@ $THREADS $OUT/${SAMPLE}_bt2_decontamd_sort.bam -1 $OUT/${SAMPLE}_bt2decon_R1.fastq.gz -2 $OUT/${SAMPLE}_bt2decon_R2.fastq.gz -0 /dev/null -s /dev/null -n

        echo "---   $SAMPLE completed  :)   ----------------------------------"


else
        echo ""
        echo "!!!---   align failed   ::   $SAMPLE   ---!!!"
        echo ""
        echo ""

fi


## shrink intermediates
if [ -s $OUT/${SAMPLE}_bt2decon_R1.fastq.gz ] && [ -s $OUT/${SAMPLE}_bt2decon_R2.fastq.gz ] ;
then 
        echo "--   shrinking *AMs   ----------------------------------" ;
        gzip $OUT/${SAMPLE}_*.{bam,sam} &
fi

' > $MAT/slurm_deco_full.sh # SAMPLE  IN  OUT REF THREADS


# send just one sample to the script with "head -1"
for i in $( cat $MAT/samples | head -1 );
    do sbatch $MAT/slurm_deco_full.sh $i $FILT $KDOUT $BT_DB 10 ;
done

Often some samples don’t work, e.g. due to our database being too big (bowtie2 error code 11: SEGV). We can make a file named bad_samples that lists the samples that didn’t work, and re-submit them with fewer threads. This seems to work fine, but if not consider checking your error messages in (deco_bt2.####.txt).

## feel free to make up some samples 
echo -e "XXXX\nYYYY\nZZZ\nWWW" > $MAT/bad_samples


# delete the broken pieces of samples that didnt finish
for i in $( cat $MAT/bad_samples );
    do rm $KDOUT/${i}* ;
done


## try fix bad_samples
for i in $( cat $MAT/bad_samples );
    do sbatch $MAT/slurm_deco_full.sh $i $FILT $KDOUT $BT_DB 4 ;
done


# check outputs, check error messages

Alternatively: decontaminate with `KneadData`

Kneaddata is a quality control pipeline developed by the Huttenhower group (home of HUMAnN*, MetaPhlan*, LEfSe, etc.) to do all of the above steps: trimmomatic, bowtie2, and samtools, in one go, using the human genome as a default reference. It can be very useful if we’re setting things up for the first time, as it will download the required programs and a human database for us. We did not use it here because it is incredibly useful to have some idea of how to use trimmomatic and bowtie2 to refine & identify reads of interest, alongside the face that our host sequences were not limited to humans and our programs are already installed.

If we wanted to achieve the same things with KneadData, we could provide similar arguments to those above (often, its the exact same argument, we’re just asking KneadData to tell Trimmomatic/BowTie2). Note that we would still have had to download our reference genomes and build a database from them using bowtie-build.

4 - Microbiome Community Profiling

Once we have our clean, host-free F+R reads, we assign identity to them as precisely as is reliable. We can try do this using the nucleotide sequences to be as exact as we can (Kraken2), and from the available identities of those reads, try to re-assemble the makeup of the microbial community they came from (Bracken). Alternatively, we could translate those nucleotide sequences into all 6 reading frames, and try reconstruct the microbial community based on broader, more conserved protein-sequence similarities.

However, there are many, many different approaches, and programs, for estimating community composition! Feel free to try alternate methods, but be wary of the small voice which suggests trying everything… Instead, check the literature for a comparison, have a look at a really nice comparison by Walsh et al. from 2018, or see similar articles from 2019 onwards.

identify reads with `Kracken2` 🦑

Kraken2 was not designed to estimate community profiles! It was designed to identify reads, based on the sequences patterns (k-mer frequencies) in each read. It uses these patterns to find the “best” match in the reference database’ phylogenetic tree. If Kraken2 finds multiple “best” matches for a read and cannot decide who the read belongs to, it will check the phylogenetic tree, and instead assign that read to the most recent common ancestor of all those best matches (providing the match to the recent ancestor is a good match). If you have lots of novel taxa in your dataset, or a poor reference database, or sick data, this can lead to a large amount of your reads getting ‘stranded’ in the branches of your tree, with no good way of telling where they come from.

Note: using Kraken2 and Bracken together is the recommended route by the developers of Kraken2 for community profiling (there is no equivalent step for Kaiju).

`Kraken2` databases

If setting up Kracken2, you’ll be directed to build a “default” database based on a set of NCBI sequences, using the kraken2-build command. This will generate three files: opts.k2d,hash.k2d, and taxo.k2d. Other files/folders can be deleted, but note that Bracken needs the library and taxonomy folders to build its own database.

We often use the default Kraken2 database, built from the NCBI non-redundant refseq database. We could alternatively use the GTDB, which is a brave, fresh attempt at rationalising how taxonomy is applied (a much bigger topic, but a very good database, though it’s identities don’t always correspond to the NCBI one). Both databases are present on the HPC (see ls /data/databases/{K,k}raken*)

It is possible to build a custom database, based on a different set of sequences using kraken2-build.

We can also use someone else’s database as long as you have access to opts.k2d,hash.k2d, and taxo.k2d. We will try use the “maxikraken” database from the Loman Lab:

# same folder as the BT2 database
cd $WRK/ref 
mkdir maxikraken2_1903_140GB/ ; cd maxikraken2_1903_140GB/
wget -c https://refdb.s3.climb.ac.uk/maxikraken2_1903_140GB/hash.k2d &
wget https://refdb.s3.climb.ac.uk/maxikraken2_1903_140GB/opts.k2d &
wget https://refdb.s3.climb.ac.uk/maxikraken2_1903_140GB/taxo.k2d &

# for reference
MAX=$WRK/ref/maxikraken2_1903_140GB
# note there is no taxonomy / library dir - can't use this for Bracken

`Kraken2` parameters

Parameters have not been explored in this document, but it is worth doing so here. There a number of handy settings we can use to tune our assignments. Often these are semi-sensible, but we should always be aware of what they assume for our data, as it’s easy to forget about later.

In particular, these values (as well as the composition of the database!) affect how and where your assignments are made:

--confidence (default 0) : allows us to be more confident in the assignments. 0-1, higher values meaning more certain that a classification is correct. This is a slightly newer feature, and the default is 0.0.
--minimum-hit-groups (default: 2) : how many ‘hits’ need to agree before an identity is decided
--minimum-base-quality (default: 0) : how good a base needs to be in order to be used (for fastq only).

There are also some format options, the first of which is important for the Bracken step:

--use-mpa-style : output data like mpa (metaphlan, another shotgun profiler) does (needed for Bracken)
--report-zero-counts : include this to report all entries in the database, even if nothing was found.
--report-minimizer-data : additional info for where and how hits landed on the minimizers which correspond to the different taxa.
--unclassified-out - Note Kraken2 needs a # character for splitting F and R unclassified reads (in case we’re curious)
additional options: enter Kraken2 into the terminal to see the full list.

For our purposes, we are most interested in the %>% --report that Kraken2 creates - this is what we’ll use to create out communtiy profile with Bracken. However, the other parts (the output, the unclassified) are still very informative - have a look using less. Later, we’ll simply send these to `/dev/null.

First we try our $TEST sample, but feel free to change hit groups, base quality, and *confidence*, to see how that changes assignments (It’s very fast!). Note that we include --report-zero-counts here so we can use the entire taxonomic database to build a taxonomic table later - but feel free to remove this if it is getting in the way of reading the table. We also avoid using --use-mpa-style as it isn’t compatible with Bracken.

# module load kraken2/2.1.1   # in case not already loaded!

# set how many jobs we plan on using
KR_threads=5

# test
time kraken2 --db $K2REFDIR \
$KDOUT/${TEST}_bt2decon_R1.fastq.gz \
$KDOUT/${TEST}_bt2decon_R2.fastq.gz \
--paired \
--confidence 0.15 \
--minimum-hit-groups 3 \
--minimum-base-quality 10 \
--report-zero-counts \
--gzip-compressed \
--threads $KR_threads \
--report $KROUT/${TEST}_test_kraken_report \
--unclassified-out $KROUT/${TEST}_test_kraken_unclass# \
--output $KROUT/${TEST}_test_kraken_output

# changing --cnf from 0.15 to 0.5: 3.6% drop in assignment (7.3-10.7%)
# changing --mhg from 2 to 5: no difference in assignment (10.7%)
# changing --qual: not checked

If that works, scale it for slurm! For BowTie2, we sent all samples to sbatch at the same time, to all be processed at the same time (i.e. in parallel). Kraken2 is very fast, but can also have a very large database (depending on what type/version you are using), which can cause problems with enough memory. Instead of doing it in parallel, we write a script that will process each sample, one after the other (in serial). This is almost as fast with Kraken2, but avoids huge memory requirements.

We also use a small if statement, to avoid re-running samples unnecessarily, structured as if [ $SAMPLE.output exists ]; then skip re-running ; else run Kraken2 on $SAMPLE ; fi. More info on how this works can be found here

echo '#!/bin/bash

#SBATCH --job-name=krak2_loop
#SBATCH --output=krak2_loop.%j.txt
#SBATCH --ntasks=10
#SBATCH --time=180:00

LIST=$1
IN=$2
OUT=$3
DB=$4
THREADS=$5

module load kraken2/2.1.1


# explicit about our parameters
CONF=0.15
MINH=3
MINQ=10


#one big loop / aw yeh aw yehh
for i in $( cat $LIST );
do
  if [ -s $OUT/${i}*_kraken_report ]
  then
    echo "## krak2 output for $i already in $OUT - stopping"
  else
    echo "## firing krak2 on $i ; conf $CONF: min-hits $MINH; min qual: $MINQ "

    time kraken2 --db $DB \
    $IN/${i}_bt2decon_R1.fastq.gz \
    $IN/${i}_bt2decon_R2.fastq.gz \
    --paired \
    --confidence $CONF \
    --minimum-hit-groups $MINH \
    --minimum-base-quality $MINQ \
    --threads $THREADS \
    --gzip-compressed \
    --report $OUT/${i}_kraken_report \
    --unclassified-out /dev/null/${i}_kraken_unclass# \
    --output /dev/null/${i}_kraken_output

  fi

done' > $MAT/slurm_krak2_loop.sh

Because the for-loop is contained in the slurm script above, instead of a for-loop we pass sbatch a list of samples (stored in $MAT/samples), the dir for the filtered input data ($KDOUT), the output folder for Kraken2 ($KROUT), the location of the Kraken2 DB we use ($K2REFDIR), and the number of processors ($KR_threads).

# pass all to slurm
sbatch $MAT/slurm_krak2_loop.sh $MAT/samples $KDOUT $KROUT $K2REFDIR $KR_threads

# progress is in the slurm txt file
cat krak2_loop.*.txt

# have a look at the outputs once finished
less -S $KROUT/${TEST}_kraken_report

estimate abundances with `Bracken` 🌿

Kraken2 tried to assign an identity to each read, as best possible. If we knew that we had 1M reads in our $TEST sample, we could count how many reads are for (e.g.) Escherichia coli, and decide that the relative abundance of E. coli in the sample is simply n_reads_E_coli / n_total_sample_reads. This might work in some cases, but it would only count reads that were assigned exactly to E. coli, and would miss all the E. coli reads that could only be assigned to (e.g.) genus Escherichia, or class Gammaproteobacteria, because there were too many similar matches. At the end, you would have a community table based only on unique sequence matches and ignoring all other data, which probably isn’t what you want.

Bracken addresses this problem by using Bayesian re-estimation to estimate, using the distribution of all reads in the reference phylogenetic tree, what the “real” abundances were for all (e.g.) species in the original sample. This:

uses all the reads in the sample
tends to give a more accurate representation of what the community looks like, at (e.g.) the species-level
as above - it’s the recommended route by the developers

Here, because Bracken is fast and relatively lightweight, slurm would be unnecessarily complicating things. We will make:

a database - this step will be done for us on the HPC, but it’s shown below anyway.
a report for each sample processed by Kraken2
combine these reports into one document - this is our community table!

Bioinformaticians use three weird tricks to make their microbial dreams come true:

-t = threshold, def = 10 - number of reads below which a taxo-branch on the kraken2 tree will be dropped from further assignment of abundances
-l = Level, def=S - among the options K, P, C, O, F, G, S. We must set the taxonomic level we want to pool our reads to: usually we are interested in taxa at a species level, but this can be changed using the -l parameter when we run Bracken to get a better idea of composition at different taxonomic groupings
-r = read lengths - taken as the starting read length, refer to your FastQC output for this!

make a `Bracken` database from `Kraken2` database

There are several databases on the HPC for Bracken, but if there is not one for our read-length (130bp?), we will have to make one!

## build data base first
BR_kmer=35    # this is the default kmer length of the Kraken2 DB on the HPC
BR_leng=130   # length post-trimm

## make a local install of bracken via github for building DB
mkdir ~/bin ; cd ~/bin &&
wget https://github.com/jenniferlu717/Bracken/archive/refs/heads/master.zip ; unzip master.zip && rm master.zip
cd ~/bin/Bracken-master
chmod +x install_bracken.sh ; ./install_bracken.sh

echo '#!/bin/bash

#SBATCH --job-name=brack_db
#SBATCH --njobs=35
#SBATCH --output=brack_db.%j.txt

#SBATCH --time=240:00

BRDB_DIR=$1
BRDB_KMER=$2
BRDB_LENG=$3
BRDB_THREADS=$4

## not tested
time ~/bin/Braken-master/bracken-build -d $BRDB_DIR -k $BRDB_KMER -l $BRDB_LENG -t $BRDB_THREADS

' > $MAT/slurm_brak_db.sh # dir kmer length threads

# run on the existing, default kraken database
sbatch slurm_brak_db.sh /data/databases/kraken2 $BR_kmer $BR_leng 20

run `Bracken`

We then use that database, and process our Kraken2 outputs. This step is very fast, as we simply run bracken on the kraken_report file. Try it first on our test file!

# note name is wrong
module load braken

# crucial parameters! fee free to adapt s and l. r probably more use fi read length differs from DB being used
BR_r=130
BR_l=S
BR_t=10   # counts! not threads

# run bracken
bracken -d $WRK/ref/ -i $KROUT/${TEST}_kraken_report -o $KROUT/${TEST}.bracken -r $BR_r -l $BR_l -t $BR_t

# note the header for the different columns:
head -1 $KROUT/${TEST}.bracken

less -S $KROUT/${TEST}.bracken

Then try it on all samples. It won’t take much more time than doing one, but this time we combine all the samples to one large table (combine_bracken_outputs.py, kreport2mpa.py):

for i in $(cat $MAT/samples | head -1);
  do bracken -d $K2REFDIR/ -i $KROUT/${i}_kraken_report -o $KROUT/${i}.bracken -r $BR_r -l $BR_l -t $BR_t ;
done                                                                                        

## combine reports to one monolith.tsv
combine_bracken_outputs.py --files $KROUT/*.bracken -o $KROUT/krakenStnd_abundances.tsv

output taxonomic hierarchy

Kraken2+Bracken will only give us the abundances at the level (-l) requested (we asked Bracken for Species, above). If interested in showing (e.g.) the Phylum level, we could redo this Bracken step, or could redo it for all levels, and have multiple versions of the community, at different taxonomic ranks. It wouldn’t take much, but it sounds complicated.

Instead, here:

we install KrakenTools, a small folder of helper scripts by Bracken’s author
we convert just one sample to MPA format (from ye olde metaphlan)
keep only the lines informative to the species-level
replace the "|" characters with tabs, so that you can open it in excel etc.

step	effect
`kreport2mpa.py`	change `Kraken2` output for `Metaphlan` compatability
`grep -h '\|s_'`	find lines with info to `S`pecies level, as they contain `'\|s_'...`
`cut -f 1`	chop line at spaces, and return the first part
`sort ...`	sort the taxonomic table by name, could optionally call `uniq` on it also but not necessary really
`sed 's/\|/\t/g`	swap `"\|"` for `"\t"`

NB: because we used the --report-zero-counts option in Kraken2 for the test report, this will have all the species-level taxa in the database, and we can use it with any of our samples (this taxonomic table will do for all our samples). As such, we only need to do this step once per project. If you lack this test report, simply run Kraken2 on any sample, using the --report-zero-counts flag.

Note also that some lifeforms, i.e. viri, don’t have the same number or type of taxonomic ranks as e.g. us, which can be problematic for the taxonomic table, as the columns will suddenly mean different things between types of life - this is simplified by sorting the data later but it can still be an issue. (if --report-zero-counts was omitted, just repeat a single Kraken2 step above and convert the output.)

# have a luk at the Kraken2 output - note no one line has all the info on D/P/C/O/F/G/S...
less -S $KROUT/${TEST}_test_kraken_report

# download KrakenTools to your ~
mkdir $HOME/bin ; cd $HOME/bin ; git clone https://github.com/jenniferlu717/KrakenTools.git ; cd $HOME

# KrakenTools: convert to mpa stylee
~/bin/KrakenTools/kreport2mpa.py -r $KROUT/${TEST}_test_kraken_report -o $KROUT/${TEST}_kraken_mpa

# pipe mpa file to reformat
grep -h '|s_' $KROUT/${TEST}_kraken_mpa | cut -f 1 | sort | uniq | sed 's/|/\t/g' > $KROUT/krakenStnd_taxonomy.tsv

# a last look
less -S $KROUT/krakenStnd_taxonomy.tsv

check `Kraken2+Bracken` outputs

Finally! Most of the data management from here will take place in R, but we can check our outputs here & now before moving over:

# sample abundances
less $KROUT/krakenStnd_abundances.tsv

# taxonomic hierarchy:
less $KROUT/krakenStnd_taxonomy.tsv

identify reads with `Kaiju` 🐲

It is nice when things work, but this is not always the case. Here we use Kaiju, a program similar to Kraken2 in that it uses sequences patterns (kmers) and not just alignments, but Kaiju uses amino acid (AA) patterns rather than nucleotides. Our steps, and many of the concepts are similar to Kraken2, although note that there is no `Bracken-like step with Kaiju, although we will employ a similar reads threshold of --c 10, so that only assignments detected 10 or more times are kept. Note also that (depending on database and sample) Kaiju (₁₀60 min) takes a lot longer to run than Kraken2 (₂8min), as it needs to translate all sequences into 6 different reading frames.

parameter	function
-e	allowed mismatches ( `def = 3` )
-m	match length ( `def = 11` )
-s	match score ( `def = 65` )
-z	number of threads ( `def = 1` )

First, we try a test example:


## setup kaiju  ==========================================
    
conda create -n kaijamie
conda install -c bioconda kaiju
conda activate kaijamie

KaDB=/data/databases/kaiju/nr     # nr-euk very big..
Ka_THREADS=10
KaOUT=$WRK/4__kaiju
mkdir $KaOUT
# ulimit -s unlimited    # nasa mem trick


## choose!
#TEST=Q4T8
#TEST=_N5_Nuria_S296_L001


## test kaiju  ==========================================
    
time kaiju \
-v \
-z 5 \
-e 3 \
-m 11 \
-s 65 \
-t $KaDB/nodes.dmp \
-f $KaDB/kaiju_db_nr.fmi \
-i $KDOUT/${TEST}_bt2decon_R1.fastq.gz \
-j $KDOUT/${TEST}_bt2decon_R2.fastq.gz \
-o $KaOUT/${TEST}_kaiju

less -S $KaOUT/${TEST}_kaiju

# check rates of assignment
KAI_C=$(grep -cE '^C' $KaOUT/${TEST}_kaiju )
KAI_U=$(grep -cE '^U' $KaOUT/${TEST}_kaiju )
echo "scale=3; $KAI_C / $KAI_U" | bc

And then, we create a slurm script to process the samples in serial, to avoid crashing the airplane due to our memory demands. note we also use if-statements to check whether or not the sample has been run before - this can be helpful to not needlessly duplicate work. We also count up our reads assigned : total reads.


## slurm-loop  kaiju   =======================================

echo '#!/bin/bash

#SBATCH --job-name=kaij_loop
#SBATCH --output=kaij_loop.%j.txt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --time=150:00

LIST=$1
IN=$2
OUT=$3
DB=$4
THREADS=$5


## one big slurm to for-loop over kaiju, thereby avoiding mem issues

# operate over some list of sample names
for i in $( cat $LIST );
do
  if [ -s $OUT/${i}*_kaiku ]
  then
    echo "## kaiju output for $i already in $OUT - stopping"
  else
    echo "## firing kaiju on $i"

    time kaiju \
    -v \
    -e 3 \
    -m 11 \
    -s 65 \
    -z $THREADS \
    -t $DB/nodes.dmp \
    -f $DB/kaiju_db_nr.fmi \
    -i $IN/${i}_bt2decon_R1.fastq.gz \
    -j $IN/${i}_bt2decon_R2.fastq.gz \
    -o $OUT/${i}_kaiju

    # count things
    if [ -s $OUT/${i}*_kaiju ]
    then
      KAI_C=$(grep -cE '^C' $OUT/${i}_kaiju ) &
      KAI_U=$(grep -cE '^U' $OUT/${i}_kaiju ) 
      KAI_TOT=$(echo "$KAI_C + $KAI_U" | bc)
      KAI_PC=$(echo "scale=2; ($KAI_C / $KAI_TOT)*100" | bc)
      echo "## kaiju sample processed: ${i} : total classified: ${KAI_PC}% (total: $KAI_TOT read-pairs)  ------"
    else
      echo "## no output - kaiju for sample ${i} failed"
    fi

  fi
    
done


' > $MAT/slurm_kaij_loop.sh  # **LIST** IN OUT DB THREADS


## fire all in serial
sbatch $MAT/slurm_kaij_loop.sh  $NUR/Materials/samples $KDOUT $KaOUT $KaDB $Ka_THREADS ;

Finally for Kaiju, we export all our samples into one combined table using kaiju2table to summarise to -r species level, including only those taxa seen (counted) more than -c 10 times, and including the -l domain,phylum,class,order,family,genus,species taxon levels. We will then process this table in R, as with (but different to…) Kraken2.

## we can redo the counting of taxonomic assignemnts outiside of the loop too:
for i in $( cat $MAT/samples );
do 
  KAI_C=$(grep -cE '^C' $KaOUT/${i}_kaiju )
  KAI_U=$(grep -cE '^U' $KaOUT/${i}_kaiju ) 
  KAI_TOT=$(echo "$KAI_C + $KAI_U" | bc)
  KAI_PC=$(echo "scale=2; ($KAI_C / $KAI_TOT)*100" | bc)
  echo "## kaiju sample processed: ${i} : total classified: ${KAI_PC}% (total: $KAI_TOT read pairs)  ------"
done


## all taxonomy together, *AT THE SPECIES LEVEL* - some issue solved by adding -l arg in kaiju2Table
kaiju2table -t $KaDB/nodes.dmp -n $KaDB/names.dmp -r species -c 10 -l domain,phylum,class,order,family,genus,species -o $KaOUT/kaiju_summary_species.tsv $KaOUT/*_kaiju


# step off to R!

check `Kaiju` outputs

Finally, again! Just as with Kraken2+Bracken, it’s a good idea to look at the data, and consider what information it provides.

# sample abundances
less $KaOUT/kaiju_summary_species.tsv

Review / Next Steps

It’s worth checking what we have done in this page before moving on - here we check the outputs, and the size of data associated:

## obtained the raw sequences
ls -lsh $RAW
du -sh $FILT


## QC - checked quality
ls -lsh $FQC


## QC - removed artificial sequences and low quality sequences
ls -lsh $FILT
du -sh $FILT


## built a bowtie2 database of possible contaminants for our data
ls -lsh $BT_DB
du -sh $BT_DB


## removed contaminants from our data
ls -lsh $KDOUT
du -sh $KDOUT


## generated microbial profiles for our data using nucleotide seq info (Kraken2/Bracken)
ls -lsh $KROUT

# sample abundances
less $KROUT/krakenStnd_abundances.tsv

# taxonomic hierarchy:
less $KROUT/krakenStnd_taxonomy.tsv


## generated microbial profiles for our data using translated seq info (Kaiju)
ls -lsh $KaOUT

# sample abundances
less $KaOUT/kaiju_summary_species.tsv


## the scripts and parameters etc:
lk $MAT

towards analysis in `R` 💻

There are many things that we can do to change the above, but once happy:

try bringing your data into R to start exploring!

Reading / Reference

- linux `101`

Linux command line for beginners - quite quite good!

slurm and variables

more slurm formatting

glomming

regular expressions (regex)

`if-statements - controlling what happens

for-loops - first step in bash superpowers

How to read an error

info on variables

- DNA sequencing & co

FastQC

MultiQC site + support

Trimmomatic site

Trimmomatic manual

bowtie2 / samtools filtering

- metagenomics

AstroBioMike - Bioinformatics for beginners - extremely good and thorough guide.

The Kraken Suite - Sept 2022 nature paper from the developers of Kraken, KrakenUniq, & Bracken etc.

microbiome code browser - look v good also - Dan Knights & co., 2016

Assembling Metagenomic Sequence Data 🧗

jfg

24 Feb 2025, 15:55

Preamble

what this is, and who it is for:

what this document is not:

when something goes wrong:

if you need help:

0 - setup your environment

organising your workspace

where does my data go?

what tools do we need?

are we ready?

1 - get the sequence data (& check)

FastQC and MultiQC

Check sequence quality, with FastQC, on one sample:

Check sequence quality, with FastQC, on multiple sample:

choosing your sample trimming parameters

2 - quality control (& check)

first pass with Trimmomatic

first pass, with Trimmomatic, on one sample:

first pass, with Trimmomatic, on multiple samples:

fastqc and multiqc

second pass etc.

3 - decontaminate data (& check)

Decontaminate with BowTie2

BowTie2 : make database

BowTie2 : find contaminants

samtools : delete contaminants