This document covers how to do the following:
ssh
and set up a
bash
-based bioinformatics projectFastQC+MultiQC
)trimmomatic
)kneaddata
)Kraken2+Bracken
)We provide minimal guidance on how to adapt the different steps to the workflow (different parameters, different options or tools).
This document was originally put together by Ines Calvete
(IPLA/CSIC), Nuria PeĒ¹a Vidal (UCM), and Jamie FitzGerald (Teagasc/UCD)
as a roadmap for analysing microbiomes from cheese, and chicken GI tract
respectively. It uses material sourced from well-written reference
manuals, online communities, and our own mistakes (see
References section). We assume a beginnerās understanding of
the HPC, ssh
, bash
, and scripting.
Please note this page is still being written. If something is missing, unclear, or flaming wrong, please point it out!
This is not a guide to the Why of the
methods used here. For that, check the References section at
the bottom of the page. In particular, importing data, and analysis of
communities, in R
will need their own documents:
analysing microbial communities in R: this is a big topic - consider the links below (and bottom of page):
Mistakes, and trial-and-error, are part of learning any skill. Bioinformatics involves a particularly large and constant amount of trialing, as well as lots of errors (see: how to read an error).
One of the best ways to learn quickly (and hopefully make different, exciting mistakes rather than the same ones) is to keep notes of what you do, just like in a lab:
bash
: for some e.g.Ā program
, try
entering program --help
, program -h
, or
man program
..R
: for a function()
, enter
?function
to see that functionās help page, or try pressing
tab
after a comma to see other options that
function()
understands.R
: for some e.g.Ā object
(dataframe,
matrix, list, etc.) you can try using the str(object)
function to see its structure, use head(object)
to see the
top 6 lines or rows, class(object)
to see what sort of
thing it is, dim(object)
to see itās dimensions
(row x columns
; this will be null if itās a list or vector
etc., as these only have 1 row and 0 columns, and
1*0 = 0
)error message
online (web
search, stackoverflow, biostars, bioconductor, github,
huttenhower-biobakery, etc.) - the issue might be simple! If that
doesnāt work, or youāre not sure:Finally, please remember (when working on the HPC server):
- when you login to the HPC, you must always change to a
node
(egssh compute05
) before you do any work
It can be helpful to define any frequently used paths / values / variables at the start of your workflow, based on what you want to do. This keeps your workspace visible, easy to refer to, and allows you to reliably change the value throughout your work (e.g.Ā a new project, a new database, etc.).
A good way to keep track of all your folders/files/values is to use
$environmental_variables
- these are
temporary shortcuts added to bash
ās working environment
that tell the computer to replace the $variable
with the
file, folder, or value we have linked it to. For example, entering
PLAY=/home/user/Music/playlists
would create a new
variable, $PLAY
: when $PLAY
is entered, the
computer will spot the $
character, understand that itās a
shortcut, and replace that variable ($PLAY
) with the value
assigned to it (/home/user/Music/playlists
) wherever that
variable is found.
We will assign variables to the folders you will use
regularly, and for a sample to use for testing things on. Feel
free to change these to locations and/or names that make sense to you.
Here $CAPITALLETTERS
help keep these short words visible,
but itās not necessary if you donāt like shouting. Variables can use
letters, numbers, and underscores (more info on
variables).
We actually use these variables all time! Those created by the user
(āuserā variables) only last until you close the terminal window, or
until a program or for-loop
finishes (see $i
in the for-loop section below!) - after this, they will be forgotten.
Note that this means you have to enter these variables at each new
session (there are clever ways of doing this too). A lot of other
variables are created by the system (āsystemā variables) every time you
open a new terminal window:
# $PATH tells the computer all the folders that have programs in them (multiple folder paths)
echo $PATH
# $HOME tells the computer where the user's home directory is (folder paths)
echo $HOME
# $TERM tells the computer what the default command-line interface is (a value called by other programs)
echo $TERM
# $HISTSIZE tells the computer how many commands to remember (a value)
echo $HISTSIZE
# blah blah blah
Thank you for asking! Remember, on the HPC:
primary
data
directory (under your project name on the hpc
) - here we
give it the shortcut variable $RAW
. This is your āprimaryā
data source, so do not edit itanalysis
folder. We set this variable shortcuts as $WRK
, and all our
output goes inside this dir.WRK=/data/Food/analysis/R0602_microsupport/r0936
# databases for different tools
DB=/data/databases # HPC only
HGR38_BT=/data/databases/hostremoval/Homo_sapiens/Bowtie2 # HPC only
K2REFDIR=/data/databases/kraken2 # HPC only
BT_DB=/data/Food/analysis/R0936_redcheese/ref
# our overall data structure
RAW=/data/Food/primary/R0936_redcheese
WRK=/data/Food/analysis/R0936_redcheese
# scripts, and slurms for queueing up jobs
MAT=$WRK/Materials
# outputs etc
QC=$WRK/1__qc
FQC=$QC/fastqc
FILT=$WRK/2__filt
KDOUT=$WRK/3__knead
KROUT=$WRK/4__krak2
#!# set a sample-name to test things on
TEST=Q4T8
We will want to have FileZilla
(or similar) installed, as well as our ssh
client (putty
).
Next, we tell the HPC which programs weāll need for this work, which
the HPC has parcelled into āmodulesā. If we donāt load the modules, our
session wonāt be able to āseeā these programs in the $PATH
(list of places the computer looks to find programs) Some programs are
always visible (e.g.Ā commands like ls
), others are loaded
through e.g.Ā conda
(e.g.Ā R
ā¦), and donāt need
to be mentioned here.
FastQC
and MultiQC
for quality
controlKraken2
for metagenomic community reconstructionKaiju
, also for metagenomic community
reconstructionGNU-parallel
for case where we can run many jobs in
parallel.# this is a list of all programs explicitly required in this workflow
module load parallel fastqc multiqc trimmomatic bowtie2 samtools/1.10 kraken2/2.1.1 braken
Finally, we make sure that all the locations for the work weāre doing
exist! When we start, they wonāt exist yet (but note how the command
looks up the full location for the directory): if $QC
has
not been defined, it will show you the contents of your current
directory; if $QC
is defined, but hasnāt been
created yet, it will tell you that the folder doesnāt exist
(which should be true!).
ls $QC
In our unified data ($RAW
) and work folders
($WRK
), we create different folders for each output, using
the variables above. Itās not a problem if the folder already
exists.
# make analysis dirs
mkdir $BT_DB $RAW $WRK $FILT $QC $FQC $KDOUT $KROUT
# make script dirs etc.
mkdir $MAT
# make dirs for qc outputs
mkdir $FQC/fhi_redch_raw $FQC/fhi_redch_filt $FQC/fhi_redch_raw_multi $FQC/fhi_redch_filt_multi
We copied the data from IPLA-CSIC (or elsewhere!) manually, into the
$RAW
dir, using FileZilla
. Sometimes we will
have to use scp
, sFTP
, or rsync
instead - a web search should show how to do this, and these programs
should all be available on the HPC.
For checking the sequence quality, weāll need the programs
FastQC
(quality profiling), and MulltiQC
(unifying FastQC
reports):
# make sure you're on a compute node!
module load fastqc multiqc
Next, for our own information, we look at just the
fastq.gz
files, count the number of files, and run
FastQC
and MultiQC
on them.
# use ls arguments "-lsh" to output it as an informative list
ls -lsh $RAW
# How many files? - use "|" charater to pipe the ls output to wc (wordcount), and count the n of lines (-l or --lines)
ls -lsh $RAW/*fastq.gz | wc -l
# can also easily check a fastq file
ls -lsh $RAW/$TEST* # file details
zcat $RAW/$TEST* | head -10 # decompress, read and print top 10 lines
less -S $RAW/$TEST* # read through file, press q to quit
FastQC
and MultiQC
FastQC
is an amazingly useful tool, for profiling the quality, consistency, and
content of your sequence data. It takes samples of each file it is shown
(compressed or uncompressed), checks the sequences and returns a
HTML
& zip
file for each
FASTQ
: the HTML
is a report, and the
zip
is all the data from that report. Next,
MultiQC
combines all the FastQC
outputs into
one report, summarising all that info in one place. Always check
this!
FastQC
will process all fastq(.qz)
files
that it is passed. MultiQC
will
process all the FastQC
outputs it finds in a directory,
combine them, and place output in the dir specified
(multiqc_report.html
in -o
). Keep in mind that
FastQC
and MultiQC
outputs also contain text
files of all this data (see the multiqc_data
dir), if you
want to use them for other graphics/reports etc.
FastQC
, on one
sample:# then run a fastqc for F and R files, output in the dirs we made above
fastqc -t 4 $RAW/${TEST}* -o $FQC/fhi_redch_raw
# make a report that includes both F and R reads for this sample
multiqc $FQC/fhi_redch_raw -o $FQC/fhi_redch_raw_multi
Copy the output multiqc
files to your local computer,
and doubleclick to open in your browser. For help reading
FastQC/MultiQC
, thereās an introductory tutorial
video in the top of the MultiQC
report.
FastQC
, on multiple
sample:We need to be responsible with our use of the HPC. Here we write a
script in slurm
(see the HPC website for more info on
slurm
!!) format to queue all these jobs politely into one
FastQC
task, and then summarise using MultiQC
again.
We also found out that some fasstq
files are compressed
as .gz
, and some are compressed at .bz2
. This
means that we need to specify both type fo fastq
if we want
it to work. Best way is to change our command to match both! We show the
computer the two different compression types using the curly brackets:
{gz,bz2}
. This allows the computer to accept
*fastq.gz
and fastq.bz2
:
# write a slurm script first
echo '#!/bin/bash
SBATCH --job-name=knead_fastq
SBATCH --output=knead_fastq.txt
SBATCH --ntasks=15
SBATCH --time=15:00
IN=$1
OUT=$2
# time just tells us how long this takes, so we know for next time
# -t is the number of threads (tasks) to use
# curly brackets {} allow us to match either gz or bz2 file extensions
time fastqc -t 15 $IN/*fastq.{bz2,gz} -o $OUT
' > $MAT/slurm_fqc.sh
# review:
cat $MAT/slurm_fqc.sh
This slurm
script will give a name, output-log, runtime,
and number of threads needed to the slurm manager
, and use
the input and output dirs we give to find fastq.gz\bz2
files, start FastQC, and place outputs in the $OUT dir. We need to
āgiveā this slurm
to sbatch
:
NB: we have noticed that the
{}
brackets do not solve the problem of missing thebz2
samples. We simply resubmitted the samples asfastqc -t 4 $RAW/*fastq.bz2 -o $FQC/fhi_redch_raw
- this is not solving the problem!
# trust slurm
sbatch $MAT/slurm_fqc.sh $RAW $FQC/fhi_redch_raw
# combine outputs
time multiqc $FQC/fhi_redch_raw -o $FQC/fhi_redch_raw_multi
# then copy to local computer (try FileZilla), and open in your browser!
Again, open the HTML report in your browser to see what the overall quality is like, and how much cleaning you will need to do.
With information on all the samples, it is possible (but not necessarily easy) to pick appropriate trimming and quality parameters.
5'
end removes āartificialā bases at
the start of F+R
reads : check the FastQC or MultiQC
outputs to see how long it takes for
Per-Base Sequence Content
to become
homogeneous/balanced3'
end removes bases where the
sequencing quality declines gradually (especially the R
reads)Consider: How do the sequences look? Is quality uniform throughout,
or does it vary? What other warnings (in red) do FQC+MQC
detect?
Trimmomatic
Hopefully, there are not too many problems with the data! Nevertheless, we always do a little cleaning of the sequences before we analyse them. There are several things that can go wrong, and we can make the overall dataset better by removing reads or trimming off:
5'
) and end
(3'
) of a read are often best removedWe use the program Trimmomatic
to do all of this. This java-based program can do many things to check
your quality, but we will focus on:
MultiQC
profile we made in section 1.For quality control, weāll need the programs Trimmomatic
(trimming and filtering), FastQC
(quality profiling), and
MulltiQC
(unifying FastQC
reports):
# make sure you're on a compute node!
module load fastqc multiqc trimmomatic
First, we will use echo '...' > file.fa
to make a
fasta
file of the known adapter sequences, using references
available on the Teagsc HPC, as well as the sequences provided by
Trimmomatic.
echo '>adapter_seq
CTGTCTCTTATACACATCT
>adapter_mate_seq
AGATGTGTATAAGAGACAG
>Transposase_Adap__for_tagmentation_1
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG
>Transposase_Adap__for_tagmentation_2
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG
>PCR_primer_index_1
CAAGCAGAAGACGGCATACGAGATNNNNNNNGTCTCGTGGGCTCGG
>PCR_primer_index_2
AATGATACGGCGACCACCGAGATCTACACNNNNNTCGTCGGCAGCGTC
>PCR_primer_index_2_rc
GACGCTGCCGACGANNNNNGTGTAGATCTCGGTGGTCGCCGTATCATT
>PCR_primer_index_1_rc
CCGAGCCCACGAGACNNNNNNNATCTCGTATGCCGTCTTCTGCTTG
>Transposase_Adap__for_tagmentation_2_rc
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
>Transposase_Adap__for_tagmentation_1_rc
CTGTCTCTTATACACATCTGACGCTGCCGACGA
>adapter_mate_seq_rc
CTGTCTCTTATACACATCT
>adapter_seq_rc
AGATGTGTATAAGAGACAG' > $MAT/fqc_trimmo_ill_ref.fa
# have a look!
less $MAT/fqc_trimmo_ill_ref.fa # press q to exit
Trimmomatic
, on one sample:To clean the $TEST
sample, we use these variables:
$RAW
: this was input dir assigned as
/data/Food/primary/../././
$FILT
: this was the output dir assigned as
/data/Food/analysis/../././
$TEST
: this was the ātestā sample, assigned as
XXXXXXX
$MAT
: this was where we stored scripts, logs, and the
fasta
above assigned as ./././Materials
These variables help us to tell Trimmomatic
the
following:
option | effect |
---|---|
PE |
paired-end mode, so examine them together |
$RAW/${TEST}_R1_001.fastq.gz |
input F reads |
$RAW/${TEST}_R2_001.fastq.gz |
input R reads |
$FILT/${TEST}_R1_trimm.fastq.gz |
output for trimmed F reads |
$FILT/${TEST}_R1_trimm_unpaired.fastq.gz |
output for trimmed F reads that have lost their
R mate (see below) |
$FILT/${TEST}_R2_trimm.fastq.gz |
output for trimmed R reads |
$FILT/${TEST}_R2_trimm_unpaired.fastq.gz |
output for trimmed R reads that have lost their
F mate (see below) |
HEADCROP:20 |
trim the first 25bp from 5ā ends |
CROP:130 |
trim everything*after** 125bp from 3ā ends |
ILLUMINACLIP:$MAT/fqc_trimmo_ill_ref.fa:2:30:10:5 |
using the reference fasta above, we trim reads with
check parameters on trimming |
SLIDINGWINDOW:6:15 |
check parameters on trimming |
MINLEN:80 |
after cropping & trimming, sequences must be at least 80bp long to be kept (otherwise delete them) |
-threads 6 |
use 6 threads (tasks/jobs) in parallel |
> $FILT/trimmo_${TEST}.out |
use > to send error or system messages to a
logfile so you can check for errors afterwards. |
Trimmomatic
checks the quality of our sequences,
trimming off ābadā sequence bases, and discarding ābadā reads. Most
reads will still be paired (have a matching F
and
R
read). In some cases, the F
read is
discarded (thrown away) but we keep the matching R
read
because it is good quality (or keep F
and discard
R
). As a result of this, Trimmomatic
will sort
each sample into 4 files:
$FILT/${TEST}_R1_trimm.fastq.gz
$FILT/${TEST}_R2_trimm.fastq.gz
$FILT/${TEST}_R1_trimm_unpaired.fastq.gz
$FILT/${TEST}_R2_trimm_unpaired.fastq.gz
For now, we will not use the unpaired reads (we ignore them!), as it can lead to complications elsewhere (because of the mix of paired and unpaired reads, which is a problem for some programs). Note also that there are many other options available, to tackle different problems - see the Trimmomatic manual for more.
When it is all put together, it looks like this (the \
character allows us to start a newline without interrupting the
command):
trimmomatic PE \
$RAW/${TEST}_R1_001.fastq.gz $RAW/${TEST}_R2_001.fastq.gz \
$FILT/${TEST}_R1_trimm.fastq.gz $FILT/${TEST}_R1_trimm_unpaired.fastq.gz \
$FILT/${TEST}_R2_trimm.fastq.gz $FILT/${TEST}_R2_trimm_unpaired.fastq.gz \
HEADCROP:20 \
CROP:130 \
ILLUMINACLIP:$MAT/fqc_trimmo_ill_ref.fa:2:30:10:5 \
SLIDINGWINDOW:6:15 MINLEN:80 \
-threads 6 > $FILT/trimmo_${TEST}.out
# when finished, look at the output:
less $FILT/trimmo_${TEST}.out
Trimmomatic
, on multiple samples:Because these reads are paired-end, we need to set up a
slurm
script to process each sample name,
find the F
and R
fastq.gz
files,
process them together, and output them to the correct folders. There are
a few ways to get the sample names, but because we will need to process
each sample name several times, we are going to use a simple, re-usable
method:
If you are very careful and have a lot of time, you can type all the
sample names into a text file by hand. You can also do it quickly using
a chain of bash commands joined together by the ā|
ā
or pipe
character:
command | output |
---|---|
ls $RAW/*fastq.gz |
list all filenames in $RAW ending in
fastq.gz |
sed -e ... |
simplifies the filename using regex (-e
flag; sed & regex are complex but ++useful
- see intro
here) |
s/a/b/g |
sed : s ubstitute a for
b , everywhere its found (i.e.Ā g lobally) |
.*\/\(.*\)_R._001.* |
sed - our a to be replaced: find text
matching /(...)_R._001 , where ā. ā can be
anything, and keep the bit in brackets |
\1 |
sed - our b to replace a :
just paste in the bit found in brackets |
sort |
sorts filenames alphabetically - this is necessary for
uniq |
uniq |
gives each unique name |
> $MAT/samples |
send filenames to a file in $MAT called
samples |
# combine all those different parts!
ls $RAW/*fastq.gz | sed -e 's/.*\/\(.*\)_R._001.*/\1/g' | sort | uniq > $MAT/samples
We can then use cat $MAT/samples
or
less $MAT/samples
to print out all the sample names. We can
also feed all these sample names into a for-loop
, which
takes each row (sample name) and places it inside the
for-loop
wherever it finds (e.g.) $i
($i
is what everyone uses, but itās just a variable like
$MAT etc - it could be anything you choose).
First, we make a slurm
script for running
Trimmomatic
:
echo '#!/bin/bash
#SBATCH --job-name=trimmoRaw
#SBATCH --output=trimmoRaw.txt
#SBATCH --ntasks=6
#SBATCH --time=18:00
# these take the terms given after the scriptname, i.e. "... $i $RAW $FILT $MAT"
SAMPLE=$1
IN=$2
OUT=$3
REFDIR=$4
# trimmomatic - use backslash to separate rows
trimmomatic PE \
$IN/${SAMPLE}_R1.fastq.gz \
$IN/${SAMPLE}_R2.fastq.gz \
$OUT/${SAMPLE}_R1_trimm.fastq.gz \
$OUT/${SAMPLE}_R1_trimm_unpaired.fastq.gz \
$OUT/${SAMPLE}_R2_trimm.fastq.gz \
$OUT/${SAMPLE}_R2_trimm_unpaired.fastq.gz \
HEADCROP:25 \
CROP:125 \
ILLUMINACLIP:$REFDIR/fqc_trimmo_ill_ref.fa:2:30:10:5 \
SLIDINGWINDOW:6:15 \
MINLEN:80 \
-threads 6 > $OUT/trimmo_${SAMPLE}.out 2>&1' > $MAT/slurm_trimmo.sh
NB: increased time from 6min to 18min to avoid job being terminated; had to replace bad
-
with--
Then we make a simple for-loop
(see this link) that repeats the command
sbatch $MAT/slurm_trimmo.sh $i $RAW $FILT $MAT
for every
possible value of $i
(all the sample names). In this way,
we can use 6 threads to process one sample after another, with
the job using 6 threads as we specified above (i.e.Ā process samples
in serial rather than in parallel).
# this command lists all the sample names
cat $MAT/samples
# we can put that command $(in backets);, and feed it into the for-loop like this:
for i in $(cat $MAT/samples);
# and it will repeat this command for all the different values of $i, below:
do sbatch $MAT/slurm_trimmo.sh $i $RAW $FILT $MAT;
# close the for-loop
done
Alternatively, there might be a sample we need to re-run. We do not need to use for-loops etc. to pass jobs to slurm: we can simply tell it what job to do, on what sample name:
# replace th $i in the for loop with the sample name (e.g. P2620), and delete the starting "do" and the end ";"
sbatch $MAT/slurm_trimmo.sh P2620 $RAW $FILT $MAT
sbatch $MAT/slurm_trimmo.sh P2233 $RAW $FILT $MAT
sbatch $MAT/slurm_trimmo.sh P2312 $RAW $FILT $MAT
sbatch $MAT/slurm_trimmo.sh etc... $RAW $FILT $MAT
Again, we check the Trimmomatic
output with
FastQC/MultiQC
, copying the multiqc_report to our computer,
and seeing if thereās much improvement.
fastqc -t 4 $FILT/*fastq.gz -o $FQC/fhi_redch_filt
# move unpaired reports away - avoid clogging up multiqc vis
mkdir $FQC/fhi_redch_filt_unpaired
mv $FQC/fhi_redch_filt/*unpaired* $FQC/fhi_redch_filt_unpaired
# look at just the trimmed, paired data (no unpaired reads)
multiqc $FQC/fhi_redch_filt/ -o $FQC/fhi_redch_filt_multi
It is hard to know when the quality is āgood enoughā. In general, we
should keep above Q30
, but this is not an indicator of
other things like adapter contamination. Often, a later step will not
work, and it can be necessary to be stricter in our cleaning, or try
different approaches etc. Be clear with yourself, and with other people,
about what pre-processing steps you are taking.
Sometimes, it might be necessary to use Trimmomatic
more
than once before the sequence is ācleanā. In general it is easier to
program, easier to organise, and easier to reproduce a metagenomics
analysis when all sequence cleaning steps are carried out in one go.
Again, keep notes of what you have done - this makes
remembering what happened much easier.
Filtering and trimming in the previous steps should have removed nearly all (š¤) of the fake sequences: low-quality bases that can not be trusted (trimming), as well as artificial sequences introduced by library-preparation and sequencing (filtering).
Next, it is a very good idea to consider what real sequences have been included in our dataset, but do not belong in our analysis. Contaminants could come from:
In some environments (e.g.Ā faeces, sludge, soil), contaminating sequences might only comprise a small proportion of the total sequences, and are unlikely to change the ābig pictureā of the dataset, but could interfere with analysis of specific microbes, or metabolic functions in ways we cannot predict. In other environments (e.g.Ā microbiomes from biopsy, the aeolian microbiome), contaminants can make up the majority of a sample, altering the ābig pictureā significantly, and making it difficult to draw meaningful conclusions - and you wonāt know why.
To address this as best we can, we remove contaminants based on a
database of reference sequences appropriate to our
study. We collect these sequences (search for the host, known
contaminants, artificial sequences, foodstuffs, etc.), compile them into
a database (i.e.Ā a big file thatās formatted so that itās easy for
computers to search through it quickly), and then ask a program
(e.g.Ā the DNA-DNA alignment program, bowtie2
) to look for,
and remove, any sequences that match those contaminants.
For decontamination, weāll need the programs BowTie2
(database-building & aligning) and samtools
(filtering
files in sam
and bam
formats), as well as
FastQC & MultiQC for profiling the sequences output.
# make sure you're on a compute node!
# note the samtools version.. for the teagasc HPC at any rate.
module load bowtie2 samtools/1.10 parallel fastqc multiqc
BowTie2
We need reference genomes relevant to our experiment. For this tutorial, we consider several possible sources of host contamination:
GRCh38.p14, GenBank GCF_000001405.40
-
contamination from the preparation, extraction, & samplinghs37d5ss
(ā11) - a bundle of synthetic,
virus-associated, and other miscellaneous sequences that donāt map well
from/to the human genome (see more
jhere)GCA_021234555.1
(ā21, contig/scaffold
N50s: 50M/104M) - from the source milkGCA_016772045.1
(ā21, contig/scaffold
N50s: 43M/101M) - from the source milkGCA_001704415.2
(ā16, contig/scaffold
N50s: 26M/87M) - from the source milkGCA_016699485.1
(ā21,
contig/scaffold N50s: 19M/91M) - host for the chicken gut
microbiomeWe will first obtain all of these sequences from the NCBI (should/could use ENAā¦)
echo '#!/bin/bash
cd $1
# download your own genome (more or less) - 1.2 GB, 12min :: from the NCBI browser https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_000001405.40/
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v1/genome/accession/GCF_000001405.40/download?filename=GCF_000001405.40.zip" -H "Accept: application/zip"
# old decoy genome was 9 MB, now (Feb 2025) using newer, smaller decoy genome! (smaller because human material better integrated into GRCh38...)
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/786/075/GCA_000786075.2_hs38d1/GCA_000786075.2_hs38d1_genomic.fna.gz
# cow genome - 825 MB, 5min
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v1/genome/accession/GCA_021234555.1/download?filename=GCA_021234555.1.zip" -H "Accept: application/zip"
# sheep genome - 890 MB, 4min
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v1/genome/accession/GCF_016772045.1/download?filename=GCF_016772045.1.zip" -H "Accept: application/zip"
# goat genome - 930 MB, 3min
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v1/genome/accession/GCF_001704415.2/download?filename=GCF_001704415.2.zip" -H "Accept: application/zip"
# chicken genome - 440 MB, 1min
curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v1/genome/accession/GCF_016699485.2/download?filename=GCF_016699485.2.zip" -H "Accept: application/zip"' > $MAT/trimmo_genomes_dl.sh
# do check that curl is installed!
# if not, try activating it: "module load curl"
curl -h
# give "run" permission for script to be run, for the user only
chmod u+x $MAT/trimmo_genomes_dl.sh
# make a dir for genomes in our RAW folder
mkdir $RAW/ref
# run the script while pointing at those files
$MAT/trimmo_genomes_dl.sh $RAW/ref
Our genomes should be inside the downloaded .zip
files,
under the directory
/ncbi_dataset/data/<genbank-accession>/
, and will be
the largest file.fna
in there. Extract them to your
$RAW/ref
folder, and then compress the .fna
to
.fna.gz
with gzip
to save space. Genome data
is fairly dense, so this will/might only compress to about 1/3 the raw
size.
NB: check here that you have the genomes relevant to
you, and make sure that the filenames above match the filenames
used in the following step (e.g., does a file end in .fna
or .fa
?ā¦)
parallel gzip {} ::: $RAW/ref/*fna
BowTie2
: make databaseWe then need use bowtie2
to build a reference database
for all of these genomes in one place. BT_threads=9
because
available locally - feel free to increase to something appropriate to
your setup, as we check the decoy genome
first.
# number of processes/threads to use for jobs
BT_threads=10
# the reference genomes (above) were downloaded here:
ls $RAW/ref
# the new bowtie2 databases will go here:
BT_DB=/data/Food/analysis/R0936_redcheese/ref
mkdir $BT_DB
# bowtie2-build [options]* <reference_in> <bt2_base>
time bowtie2-build --t $BT_threads $RAW/ref/hs37d5ss.fa.gz BT_DB/decoy_hs37d5ss.bt2 > $BT_DB/decoy_hs37d5ss.bt2.buildlog
## individual builds:
# decoy.9MB.fna.gz; 19 seconds for decoy, 9 threads, 15GB RAM
# chicken.340MB.fna.gz; 10min, 9 threads, 15GB RAM
# cow.853MB.fna.gz; 28m, 9 threads, 15GB RAM
That was quick! It was quick because the amount of data
(9MB
!) was small. Larger (real) genomes will require much
longer to run, and likely require a slurm
script.
Although jfg
is not certain of the theory behind
aligning to an index built from multiple reference genomes
(probably it uses the first match, even if there are several paralogous
sequences), it should be possible by listing several fasta
files as input to
bowtie2-build file1.fna.gz,file2.fna.gz,file3.fna.gz...
etc
(as per the bowtie2
manual). This avoids having to cat
all the
fasta
files together to make one giant
fasta
.
We need to notify slurm
of the requirements:
slurm
that the job should
take 30min * 5.5 ~= 3hr
for the genomes used here, with 10
processors and loads of memory. Alternatively, remove the `ātime`` line
to avoid the job being killed if it runs over.bowtie2
doesnāt make a
huge difference, and it could mean we wait a long time for our job to be
started. So we ask for only 10 threads. We donā;āt use the
$BT_threads
variable this time, as slurm
canāt
āseeā this variable when it starts.# build our BT2 database in slurm:
echo '#!/bin/bash
#SBATCH --job-name=bt2DB
#SBATCH --output=bt2DB.txt
#SBATCH --ntasks=10
#SBATCH --time=179:00
INOUT=$1
# feel free to add / remove genomes that are relevant to you
# do check the names before you run!
time bowtie2-build --large-index --threads 10 \
$INOUT/GCA_021234555.1_ARS-LIC_NZ_Jersey_genomic.fna.gz,\
$INOUT/GCF_000001405.40_GRCh38.p14_genomic.fna.gz,\
$INOUT/GCF_001704415.2_ARS1.2_genomic.fna.gz,\
$INOUT/GCF_016699485.2_bGalGal1.mat.broiler.GRCg7b_genomic.fna.gz,\
$INOUT/GCF_016772045.1_ARS-UI_Ramb_v2.0_genomic.fna.gz,\
$INOUT/hs37d5ss.fa.gz \
$INOUT/mult_ref #> $INOUT/mult_ref.buildlog # 2>&1
' > $MAT/slurm_bt2db.sh
# and then run
$MAT/slurm_bt2db.sh $BT_DB
This will give us a number of files ending in .bt2
in
our output dir. These are the reference files that bowtie2
uses to check for contaminant sequences in the fastq
data
(if you want, you can now delete the reference fna.gz
genomes!).
BowTie2
: find contaminantsBowtie2
will align our dataset against these reference indices, and make a
note of what reads matched between the input (dataset) and reference
(host). For our study, we will be removing any reads that match the
reference (hosts) as we are looking at the microbiome specifically.
However, in other studies, we might be interested in the host sequences
only, and would throw away everything else (all the microbes).
NB: bowtie2 output interpretation
NB: ths SAM tools spec
## firstly, align $TEST sequences to ref using bt2.
bowtie2 -p $BT_threads -x $BT_DB/mult_ref -1 $FILT/${TEST}_R1_trimm.fastq.gz -2 $FILT/${TEST}_R2_trimm.fastq.gz -S $KDOUT/${TEST}_bt2_refmapped.sam
## feel free to look at the output
less $KDOUT/${TEST}_bt2_refmapped.sam
We can use BowTie2
directly for filtering out
contaminant reads, but because there are paired (F+R
)
reads, itās not quite as specific as weād like. For better control, we
can use the samtools
program below.
samtools
: delete contaminantsNow that Bowtie2
has identified the contaminant
sequences, we use samtools
to:
sam
file to a bam
fileF+R
are in the
same orderbam
file to a fastq (throwing the extra sam
info into the /dev/null
bin)## 1 - view with samtools, and convert to bam (binary)
samtools view -bS $KDOUT/${TEST}_bt2_refmapped.sam > $KDOUT/${TEST}_bt2_refmapped.bam
## 2 - filter the contaminant sequences out
# F 256 : exclude those mapping to ref
# f 12 : F and R *both* unmapped
samtools view -b -f 12 -F 256 $KDOUT/${TEST}_bt2_refmapped.bam > $KDOUT/${TEST}_bt2_decontamd.bam
## 3 = ensure reads are in a sensible order
samtools sort -n -m 5G -@ 2 $KDOUT/${TEST}_bt2_decontamd.bam -o $KDOUT/${TEST}_bt2_decontamd_sort.bam > ${TEST}_bam_sort.out
## 4 - stream outputs of non-interest to /dev/null
samtools fastq -@ 8 $KDOUT/${TEST}_bt2_decontamd_sort.bam -1 $KDOUT/${TEST}_bt2decon_R1.fastq.gz -2 $KDOUT/${TEST}_bt2decon_R2.fastq.gz -0 /dev/null -s /dev/null -n
# look at the fastq!
ls -lsh $KDOUT/${TEST}*
## removing intermediates: SAM and BAM files are huge! delete --WHEN FINISHED--
rm $KDOUT/${TEST}*[sam,bam]
We can apply the same technique to all of our samples using
slurm
. The database is already present, so we write
slurm
scripts for the BowTie2
alignment,
samtools
filtering, and samtools
processing.
Separating out the jobs allows us to troubleshoot issues more easily
(hopefully), but complicates batch processing of samples. It is more
efficient, and possibly easier to supervise, if each sample was
processed together, and errors produce a simple message that we can
check. Here, we use if
statements to check that the outputs
are finished, before we delete all the intermediate bam
and
sam
files (which get very large).
**NB-jfg** : `time` flag is totally spurious ; lots of large intermediate files being created - we'll need to delete these as we go; **see the `bt_deco.txt` for output**, which should be in the `~` directory... Problems may be recorded there ; also check name formats in bt2 align step; issue with streaming applications sending binary to .out/.log files - logfile call deleted in repsonse ; need slurm individual name tag
echo '#!/bin/bash
#SBATCH --job-name=deco_bt2.%j
#SBATCH --output=deco_bt2.%j.txt
#SBATCH --ntasks=10
SAMPLE=$1
IN=$2
OUT=$3
REF=$4
THREADS=$5
# load necess
module load bowtie2 samtools/1.10
## stamp for the output
DAT_TIM=$(date +"%T %F")
echo "--------------------------------------------"
echo $DAT_TIM
echo sampl: $SAMPLE
echo input: $IN
echo outpt: $OUT
echo "--------------------------------------------"
echo "-- align -------------------------------------"
# -x $REF/mult_ref \
time bowtie2 -p $THREADS \
-x $REF/mult_ref \
-1 $IN/${SAMPLE}_R1_trimm.fastq.gz \
-2 $IN/${SAMPLE}_R2_trimm.fastq.gz \
-S $OUT/${SAMPLE}_bt2_refmapped.sam
if [ $? = 0 ] && [ -s $OUT/${SAMPLE}_bt2_refmapped.sam ]
then
echo "-- change format -------------------------------------"
time samtools view -bS $OUT/${SAMPLE}_bt2_refmapped.sam > $OUT/${SAMPLE}_bt2_refmapped.bam
echo "-- remove matches ------------------------------------"
time samtools view -b -f 12 -F 256 $OUT/${SAMPLE}_bt2_refmapped.bam > $OUT/${SAMPLE}_bt2_decontamd.bam
## obviate - not enough memory, and not essential for kraken & co.
# echo "-- re-organise reads ---------------------------------"
# time samtools sort -n -m 5G -@ $THREADS $OUT/${SAMPLE}_bt2_decontamd.bam -o $OUT/${SAMPLE}_bt2_decontamd_sort.bam
echo "-- to fastq, dump other ------------------------------"
time samtools fastq -@ $THREADS $OUT/${SAMPLE}_bt2_decontamd_sort.bam -1 $OUT/${SAMPLE}_bt2decon_R1.fastq.gz -2 $OUT/${SAMPLE}_bt2decon_R2.fastq.gz -0 /dev/null -s /dev/null -n
echo "--- $SAMPLE completed :) ----------------------------------"
else
echo ""
echo "!!!--- align failed :: $SAMPLE ---!!!"
echo ""
echo ""
fi
## shrink intermediates
if [ -s $OUT/${SAMPLE}_bt2decon_R1.fastq.gz ] && [ -s $OUT/${SAMPLE}_bt2decon_R2.fastq.gz ] ;
then
echo "-- shrinking *AMs ----------------------------------" ;
gzip $OUT/${SAMPLE}_*.{bam,sam} &
fi
' > $MAT/slurm_deco_full.sh # SAMPLE IN OUT REF THREADS
# send just one sample to the script with "head -1"
for i in $( cat $MAT/samples | head -1 );
do sbatch $MAT/slurm_deco_full.sh $i $FILT $KDOUT $BT_DB 10 ;
done
Often some samples donāt work, e.g.Ā due to our database being too big
(bowtie2
error code 11: SEGV). We can make a file named
bad_samples
that lists the samples that didnāt work, and
re-submit them with fewer threads. This seems to work fine, but if not
consider checking your error messages in
(deco_bt2.####.txt
).
## feel free to make up some samples
echo -e "XXXX\nYYYY\nZZZ\nWWW" > $MAT/bad_samples
# delete the broken pieces of samples that didnt finish
for i in $( cat $MAT/bad_samples );
do rm $KDOUT/${i}* ;
done
## try fix bad_samples
for i in $( cat $MAT/bad_samples );
do sbatch $MAT/slurm_deco_full.sh $i $FILT $KDOUT $BT_DB 4 ;
done
# check outputs, check error messages
KneadData
Kneaddata
is a quality control pipeline developed by the
Huttenhower group
(home of HUMAnN*
, MetaPhlan*
,
LEfSe
, etc.) to do all of the above steps:
trimmomatic
, bowtie2
, and
samtools
, in one go, using the human genome as a default
reference. It can be very useful if weāre setting things up for the
first time, as it will download the required programs and a human
database for us. We did not use it here because it is incredibly useful
to have some idea of how to use trimmomatic
and
bowtie2
to refine & identify reads of interest,
alongside the face that our host sequences were not limited to humans
and our programs are already installed.
If we wanted to achieve the same things with KneadData
,
we could provide similar arguments to those above (often, its the
exact same argument, weāre just asking KneadData
to tell Trimmomatic
/BowTie2
). Note that we
would still have had to download our reference genomes and build a
database from them using bowtie-build
.
Once we have our clean, host-free F+R
reads, we assign
identity to them as precisely as is reliable. We can try do this using
the nucleotide sequences to be as exact as we can
(Kraken2
), and from the available identities of those
reads, try to re-assemble the makeup of the microbial community they
came from (Bracken
). Alternatively, we could translate
those nucleotide sequences into all 6 reading frames, and try
reconstruct the microbial community based on broader, more conserved
protein-sequence similarities.
However, there are many, many different approaches, and programs, for estimating community composition! Feel free to try alternate methods, but be wary of the small voice which suggests trying everythingā¦ Instead, check the literature for a comparison, have a look at a really nice comparison by Walsh et al. from 2018, or see similar articles from 2019 onwards.
Kracken2
š¦Kraken2
was not
designed to estimate community profiles! It was designed to identify
reads, based on the sequences patterns (k-mer
frequencies)
in each read. It uses these patterns to find the ābestā match in the
reference databaseā phylogenetic tree. If Kraken2
finds
multiple ābestā matches for a read and cannot decide who the read
belongs to, it will check the phylogenetic tree, and instead assign that
read to the most recent common ancestor of all those best matches
(providing the match to the recent ancestor is a good match). If you
have lots of novel taxa in your dataset, or a poor reference database,
or sick data, this can lead to a large amount of your reads getting
āstrandedā in the branches of your tree, with no good way of telling
where they come from.
Note: using Kraken2
and Bracken
together is the recommended route
by the developers of Kraken2 for community profiling (there is
no equivalent step for Kaiju
).
Kraken2
databasesIf setting up Kracken2
, youāll be directed to build a
ādefaultā database based on a set of NCBI sequences, using the kraken2-build
command. This will generate three files:
opts.k2d
,hash.k2d
, and taxo.k2d
.
Other files/folders can be deleted, but note that Bracken needs the
library
and taxonomy
folders to build its own
database.
We often use the default Kraken2
database, built from
the NCBI non-redundant refseq
database. We could
alternatively use the GTDB
, which is a brave, fresh attempt
at rationalising how taxonomy is applied (a much bigger topic, but a
very good database, though itās identities donāt always correspond to
the NCBI one). Both databases are present on the HPC (see
ls /data/databases/{K,k}raken*
)
It is possible to build a custom
database, based on a different set of sequences using
kraken2-build
.
We can also use someone elseās database as long as you have access to
opts.k2d
,hash.k2d
, and taxo.k2d
.
We will try use the āmaxikraken
ā database from the Loman
Lab:
# same folder as the BT2 database
cd $WRK/ref
mkdir maxikraken2_1903_140GB/ ; cd maxikraken2_1903_140GB/
wget -c https://refdb.s3.climb.ac.uk/maxikraken2_1903_140GB/hash.k2d &
wget https://refdb.s3.climb.ac.uk/maxikraken2_1903_140GB/opts.k2d &
wget https://refdb.s3.climb.ac.uk/maxikraken2_1903_140GB/taxo.k2d &
# for reference
MAX=$WRK/ref/maxikraken2_1903_140GB
# note there is no taxonomy / library dir - can't use this for Bracken
Kraken2
parametersParameters have not been explored in this document, but it is worth doing so here. There a number of handy settings we can use to tune our assignments. Often these are semi-sensible, but we should always be aware of what they assume for our data, as itās easy to forget about later.
In particular, these values (as well as the composition of the database!) affect how and where your assignments are made:
--confidence (default 0)
: allows us to be more
confident in the assignments. 0-1, higher values meaning more certain
that a classification is correct. This is a slightly newer
feature, and the default is 0.0.--minimum-hit-groups (default: 2)
: how many āhitsā
need to agree before an identity is decided--minimum-base-quality (default: 0)
: how good a base
needs to be in order to be used (for fastq
only).There are also some format options, the first of which is important
for the Bracken
step:
--use-mpa-style
: output data like mpa
(metaphlan, another shotgun profiler) does (needed for Bracken)--report-zero-counts
: include this to report all
entries in the database, even if nothing was found.--report-minimizer-data
: additional info for where and
how hits landed on the minimizers
which correspond to the
different taxa.--unclassified-out
- Note Kraken2
needs a
#
character for splitting F
and R
unclassified reads (in case weāre curious)Kraken2
into the terminal to
see the full list.For our purposes, we are most interested in the
%>% --report
that Kraken2
creates - this is
what weāll use to create out communtiy profile with
Bracken
. However, the other parts (the output, the
unclassified) are still very informative - have a look using
less
. Later, weāll simply send these to
`/dev/null
.
First we try our $TEST
sample, but feel free to change
hit groups
, base quality
, and
*confidence*
, to see how that changes assignments (Itās
very fast!). Note that we include --report-zero-counts
here
so we can use the entire taxonomic database to build a taxonomic table
later - but feel free to remove this if it is getting in the way of
reading the table. We also avoid using --use-mpa-style
as
it isnāt compatible with Bracken
.
# module load kraken2/2.1.1 # in case not already loaded!
# set how many jobs we plan on using
KR_threads=5
# test
time kraken2 --db $K2REFDIR \
$KDOUT/${TEST}_bt2decon_R1.fastq.gz \
$KDOUT/${TEST}_bt2decon_R2.fastq.gz \
--paired \
--confidence 0.15 \
--minimum-hit-groups 3 \
--minimum-base-quality 10 \
--report-zero-counts \
--gzip-compressed \
--threads $KR_threads \
--report $KROUT/${TEST}_test_kraken_report \
--unclassified-out $KROUT/${TEST}_test_kraken_unclass# \
--output $KROUT/${TEST}_test_kraken_output
# changing --cnf from 0.15 to 0.5: 3.6% drop in assignment (7.3-10.7%)
# changing --mhg from 2 to 5: no difference in assignment (10.7%)
# changing --qual: not checked
If that works, scale it for slurm
! For
BowTie2
, we sent all samples to sbatch
at the
same time, to all be processed at the same time (i.e.Ā in parallel).
Kraken2
is very fast, but can also have a very large
database (depending on what type/version you are using), which can cause
problems with enough memory. Instead of doing it in parallel, we write a
script that will process each sample, one after the other (in serial).
This is almost as fast with Kraken2
, but avoids huge memory
requirements.
We also use a small if
statement, to avoid re-running
samples unnecessarily, structured as
if [ $SAMPLE.output exists ]; then skip re-running ; else run Kraken2 on $SAMPLE ; fi
.
More info on how this works can be found here
echo '#!/bin/bash
#SBATCH --job-name=krak2_loop
#SBATCH --output=krak2_loop.%j.txt
#SBATCH --ntasks=10
#SBATCH --time=180:00
LIST=$1
IN=$2
OUT=$3
DB=$4
THREADS=$5
module load kraken2/2.1.1
# explicit about our parameters
CONF=0.15
MINH=3
MINQ=10
#one big loop / aw yeh aw yehh
for i in $( cat $LIST );
do
if [ -s $OUT/${i}*_kraken_report ]
then
echo "## krak2 output for $i already in $OUT - stopping"
else
echo "## firing krak2 on $i ; conf $CONF: min-hits $MINH; min qual: $MINQ "
time kraken2 --db $DB \
$IN/${i}_bt2decon_R1.fastq.gz \
$IN/${i}_bt2decon_R2.fastq.gz \
--paired \
--confidence $CONF \
--minimum-hit-groups $MINH \
--minimum-base-quality $MINQ \
--threads $THREADS \
--gzip-compressed \
--report $OUT/${i}_kraken_report \
--unclassified-out /dev/null/${i}_kraken_unclass# \
--output /dev/null/${i}_kraken_output
fi
done' > $MAT/slurm_krak2_loop.sh
Because the for-loop
is contained in the
slurm
script above, instead of a for-loop
we
pass sbatch
a list of samples (stored in
$MAT/samples
), the dir for the filtered input data
($KDOUT
), the output folder for Kraken2
($KROUT
), the location of the Kraken2 DB we use
($K2REFDIR
), and the number of processors
($KR_threads
).
# pass all to slurm
sbatch $MAT/slurm_krak2_loop.sh $MAT/samples $KDOUT $KROUT $K2REFDIR $KR_threads
# progress is in the slurm txt file
cat krak2_loop.*.txt
# have a look at the outputs once finished
less -S $KROUT/${TEST}_kraken_report
Bracken
šæKraken2
tried to assign an identity to each read, as
best possible. If we knew that we had 1M
reads in our
$TEST
sample, we could count how many reads are for (e.g.)
Escherichia coli, and decide that the relative abundance of
E. coli in the sample is simply
n_reads_E_coli / n_total_sample_reads
. This might work in
some cases, but it would only count reads that were assigned
exactly to E. coli, and would miss all the E.
coli reads that could only be assigned to (e.g.) genus
Escherichia, or class Gammaproteobacteria, because there were
too many similar matches. At the end, you would have a community table
based only on unique sequence matches and ignoring all other data, which
probably isnāt what you want.
Bracken
addresses
this problem by using Bayesian re-estimation to
estimate, using the distribution of all reads in the reference
phylogenetic tree, what the ārealā abundances were for all (e.g.)
species in the original sample. This:
Here, because Bracken
is fast and relatively
lightweight, slurm
would be unnecessarily complicating
things. We will make:
Kraken2
Bioinformaticians use three weird tricks to make their microbial dreams come true:
-t = threshold, def = 10
- number of reads below which
a taxo-branch on the kraken2 tree will be dropped from further
assignment of abundances-l = Level, def=S
- among the options
K, P, C, O, F, G, S
. We must set the taxonomic level we
want to pool our reads to: usually we are interested in taxa at a
species level, but this can be changed using the
-l
parameter when we run Bracken to get a better idea of
composition at different taxonomic groupings-r = read lengths
- taken as the starting read length,
refer to your FastQC output for this!Bracken
database from Kraken2
databaseThere are several databases on the HPC for Bracken, but if there is not one for our read-length (130bp?), we will have to make one!
## build data base first
BR_kmer=35 # this is the default kmer length of the Kraken2 DB on the HPC
BR_leng=130 # length post-trimm
## make a local install of bracken via github for building DB
mkdir ~/bin ; cd ~/bin &&
wget https://github.com/jenniferlu717/Bracken/archive/refs/heads/master.zip ; unzip master.zip && rm master.zip
cd ~/bin/Bracken-master
chmod +x install_bracken.sh ; ./install_bracken.sh
echo '#!/bin/bash
#SBATCH --job-name=brack_db
#SBATCH --njobs=35
#SBATCH --output=brack_db.%j.txt
#SBATCH --time=240:00
BRDB_DIR=$1
BRDB_KMER=$2
BRDB_LENG=$3
BRDB_THREADS=$4
## not tested
time ~/bin/Braken-master/bracken-build -d $BRDB_DIR -k $BRDB_KMER -l $BRDB_LENG -t $BRDB_THREADS
' > $MAT/slurm_brak_db.sh # dir kmer length threads
# run on the existing, default kraken database
sbatch slurm_brak_db.sh /data/databases/kraken2 $BR_kmer $BR_leng 20
Bracken
We then use that database, and process our Kraken2
outputs. This step is very fast, as we simply run bracken on the
kraken_report
file. Try it first on our test file!
# note name is wrong
module load braken
# crucial parameters! fee free to adapt s and l. r probably more use fi read length differs from DB being used
BR_r=130
BR_l=S
BR_t=10 # counts! not threads
# run bracken
bracken -d $WRK/ref/ -i $KROUT/${TEST}_kraken_report -o $KROUT/${TEST}.bracken -r $BR_r -l $BR_l -t $BR_t
# note the header for the different columns:
head -1 $KROUT/${TEST}.bracken
less -S $KROUT/${TEST}.bracken
Then try it on all samples. It wonāt take much more time than doing
one, but this time we combine all the samples to one large table
(combine_bracken_outputs.py
,
kreport2mpa.py
):
for i in $(cat $MAT/samples | head -1);
do bracken -d $K2REFDIR/ -i $KROUT/${i}_kraken_report -o $KROUT/${i}.bracken -r $BR_r -l $BR_l -t $BR_t ;
done
## combine reports to one monolith.tsv
combine_bracken_outputs.py --files $KROUT/*.bracken -o $KROUT/krakenStnd_abundances.tsv
Kraken2+Bracken
will only give us the abundances at the
level (-l
) requested (we asked Bracken
for
S
pecies, above). If interested in showing (e.g.) the
P
hylum level, we could redo this Bracken
step,
or could redo it for all levels, and have multiple versions of the
community, at different taxonomic ranks. It wouldnāt take much, but it
sounds complicated.
Instead, here:
KrakenTools
, a small folder of helper
scripts by Bracken
ās authorMPA
format (from ye olde
metaphlan)"|"
characters with tabs, so that you can
open it in excel etc.step | effect |
---|---|
kreport2mpa.py |
change Kraken2 output for Metaphlan
compatability |
grep -h '|s_' |
find lines with info to S pecies level, as they contain
'|s_'... |
cut -f 1 |
chop line at spaces, and return the first part |
sort ... |
sort the taxonomic table by name, could optionally call
uniq on it also but not necessary really |
sed 's/|/\t/g |
swap "|" for "\t" |
NB: because we used the
--report-zero-counts
option in Kraken2
for the
test report, this will have all the
species-level taxa in the database, and we can use it with any of our
samples (this taxonomic table will do for all our samples). As such, we
only need to do this step once per project. If you lack this
test report, simply run Kraken2
on any sample, using the
--report-zero-counts
flag.
Note also that some lifeforms, i.e.Ā viri, donāt have the same number
or type of taxonomic ranks as e.g.Ā us, which can be problematic for the
taxonomic table, as the columns will suddenly mean different things
between types of life - this is simplified by sorting the data later but
it can still be an issue. (if --report-zero-counts
was
omitted, just repeat a single Kraken2
step above and
convert the output.)
# have a luk at the Kraken2 output - note no one line has all the info on D/P/C/O/F/G/S...
less -S $KROUT/${TEST}_test_kraken_report
# download KrakenTools to your ~
mkdir $HOME/bin ; cd $HOME/bin ; git clone https://github.com/jenniferlu717/KrakenTools.git ; cd $HOME
# KrakenTools: convert to mpa stylee
~/bin/KrakenTools/kreport2mpa.py -r $KROUT/${TEST}_test_kraken_report -o $KROUT/${TEST}_kraken_mpa
# pipe mpa file to reformat
grep -h '|s_' $KROUT/${TEST}_kraken_mpa | cut -f 1 | sort | uniq | sed 's/|/\t/g' > $KROUT/krakenStnd_taxonomy.tsv
# a last look
less -S $KROUT/krakenStnd_taxonomy.tsv
Kraken2+Bracken
outputsFinally! Most of the data management from here will take place in
R
, but we can check our outputs here & now before
moving over:
# sample abundances
less $KROUT/krakenStnd_abundances.tsv
# taxonomic hierarchy:
less $KROUT/krakenStnd_taxonomy.tsv
Kaiju
š²It is nice when things work, but this is not always the case. Here we
use Kaiju
, a program
similar to Kraken2
in that it uses sequences
patterns (kmers
) and not just alignments,
but Kaiju
uses amino acid (AA
) patterns rather
than nucleotides. Our steps, and many of the concepts are similar to
Kraken2
, although note that there is no
`Bracken
-like step with Kaiju
, although we
will employ a similar reads threshold of --c 10
, so that
only assignments detected 10 or more times are kept. Note also that
(depending on database and sample) Kaiju
(1060
min) takes a lot longer to run than Kraken2
(28min), as it needs to translate all sequences into 6
different reading frames.
parameter | function |
---|---|
-e | allowed mismatches ( def = 3 ) |
-m | match length ( def = 11 ) |
-s | match score ( def = 65 ) |
-z | number of threads ( def = 1 ) |
First, we try a test example:
## setup kaiju ==========================================
conda create -n kaijamie
conda install -c bioconda kaiju
conda activate kaijamie
KaDB=/data/databases/kaiju/nr # nr-euk very big..
Ka_THREADS=10
KaOUT=$WRK/4__kaiju
mkdir $KaOUT
# ulimit -s unlimited # nasa mem trick
## choose!
#TEST=Q4T8
#TEST=_N5_Nuria_S296_L001
## test kaiju ==========================================
time kaiju \
-v \
-z 5 \
-e 3 \
-m 11 \
-s 65 \
-t $KaDB/nodes.dmp \
-f $KaDB/kaiju_db_nr.fmi \
-i $KDOUT/${TEST}_bt2decon_R1.fastq.gz \
-j $KDOUT/${TEST}_bt2decon_R2.fastq.gz \
-o $KaOUT/${TEST}_kaiju
less -S $KaOUT/${TEST}_kaiju
# check rates of assignment
KAI_C=$(grep -cE '^C' $KaOUT/${TEST}_kaiju )
KAI_U=$(grep -cE '^U' $KaOUT/${TEST}_kaiju )
echo "scale=3; $KAI_C / $KAI_U" | bc
And then, we create a slurm
script to process the
samples in serial, to avoid crashing the airplane due to our memory
demands. note we also use if-statements
to check whether or
not the sample has been run before - this can be helpful to not
needlessly duplicate work. We also count up our reads assigned : total
reads.
## slurm-loop kaiju =======================================
echo '#!/bin/bash
#SBATCH --job-name=kaij_loop
#SBATCH --output=kaij_loop.%j.txt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --time=150:00
LIST=$1
IN=$2
OUT=$3
DB=$4
THREADS=$5
## one big slurm to for-loop over kaiju, thereby avoiding mem issues
# operate over some list of sample names
for i in $( cat $LIST );
do
if [ -s $OUT/${i}*_kaiku ]
then
echo "## kaiju output for $i already in $OUT - stopping"
else
echo "## firing kaiju on $i"
time kaiju \
-v \
-e 3 \
-m 11 \
-s 65 \
-z $THREADS \
-t $DB/nodes.dmp \
-f $DB/kaiju_db_nr.fmi \
-i $IN/${i}_bt2decon_R1.fastq.gz \
-j $IN/${i}_bt2decon_R2.fastq.gz \
-o $OUT/${i}_kaiju
# count things
if [ -s $OUT/${i}*_kaiju ]
then
KAI_C=$(grep -cE '^C' $OUT/${i}_kaiju ) &
KAI_U=$(grep -cE '^U' $OUT/${i}_kaiju )
KAI_TOT=$(echo "$KAI_C + $KAI_U" | bc)
KAI_PC=$(echo "scale=2; ($KAI_C / $KAI_TOT)*100" | bc)
echo "## kaiju sample processed: ${i} : total classified: ${KAI_PC}% (total: $KAI_TOT read-pairs) ------"
else
echo "## no output - kaiju for sample ${i} failed"
fi
fi
done
' > $MAT/slurm_kaij_loop.sh # **LIST** IN OUT DB THREADS
## fire all in serial
sbatch $MAT/slurm_kaij_loop.sh $NUR/Materials/samples $KDOUT $KaOUT $KaDB $Ka_THREADS ;
Finally for Kaiju
, we export all our samples into one
combined table using kaiju2table
to summarise to
-r species level
, including only those taxa seen (counted)
more than -c 10
times, and including the
-l domain,phylum,class,order,family,genus,species
taxon
levels. We will then process this table in R
, as with (but
different toā¦) Kraken2
.
## we can redo the counting of taxonomic assignemnts outiside of the loop too:
for i in $( cat $MAT/samples );
do
KAI_C=$(grep -cE '^C' $KaOUT/${i}_kaiju )
KAI_U=$(grep -cE '^U' $KaOUT/${i}_kaiju )
KAI_TOT=$(echo "$KAI_C + $KAI_U" | bc)
KAI_PC=$(echo "scale=2; ($KAI_C / $KAI_TOT)*100" | bc)
echo "## kaiju sample processed: ${i} : total classified: ${KAI_PC}% (total: $KAI_TOT read pairs) ------"
done
## all taxonomy together, *AT THE SPECIES LEVEL* - some issue solved by adding -l arg in kaiju2Table
kaiju2table -t $KaDB/nodes.dmp -n $KaDB/names.dmp -r species -c 10 -l domain,phylum,class,order,family,genus,species -o $KaOUT/kaiju_summary_species.tsv $KaOUT/*_kaiju
# step off to R!
Kaiju
outputsFinally, again! Just as with Kraken2+Bracken
, itās a
good idea to look at the data, and consider what information it
provides.
# sample abundances
less $KaOUT/kaiju_summary_species.tsv
Itās worth checking what we have done in this page before moving on - here we check the outputs, and the size of data associated:
## obtained the raw sequences
ls -lsh $RAW
du -sh $FILT
## QC - checked quality
ls -lsh $FQC
## QC - removed artificial sequences and low quality sequences
ls -lsh $FILT
du -sh $FILT
## built a bowtie2 database of possible contaminants for our data
ls -lsh $BT_DB
du -sh $BT_DB
## removed contaminants from our data
ls -lsh $KDOUT
du -sh $KDOUT
## generated microbial profiles for our data using nucleotide seq info (Kraken2/Bracken)
ls -lsh $KROUT
# sample abundances
less $KROUT/krakenStnd_abundances.tsv
# taxonomic hierarchy:
less $KROUT/krakenStnd_taxonomy.tsv
## generated microbial profiles for our data using translated seq info (Kaiju)
ls -lsh $KaOUT
# sample abundances
less $KaOUT/kaiju_summary_species.tsv
## the scripts and parameters etc:
lk $MAT
R
š»There are many things that we can do to change the above, but once happy:
AstroBioMike - Bioinformatics for beginners - extremely good and thorough guide.
The
Kraken
Suite - Sept 2022 nature paper from the developers of Kraken, KrakenUniq, & Bracken etc.
microbiome code browser - look v good also - Dan Knights & co., 2016