v0.4, Jun 2026 - jfgThis is a quick(er), simplified(ier) guide to processing large shotgun short-read sequencing datasets. Please forward bugs/suggestions/typos here.
`0.` setup and configure - installing programmes and databases
`1.` download (and check)
`2.` trimm (and check)
`3.` decontaminate (and check)
`4.` identify (and check)
NB: the guide is not fully tested, not completed for using
SLURM, and needs more explanation for step#4.
First, start a screen session. As well as other fancy
stuff, screen keeps your stuff running, even if you get
disconnected / crash your computer etc. You can keep coming back to this
window and continue your work.
A type of nick-name called a local variable will help us
keep things tidy. First, “assign” a short name like
basepath to the folder location (a.k.a. path)
where you want to do your work - now, we can use that short name to
refer to the long sequence of folder names anywhere in our script, by
adding $ to the variable: $basepath.
If our folder path ever changes, we only need to update the one place where we define the variable - the nickname stays the same. Much easier to read, copy, remember, and type.
# should match your ID on HPC/server/garrison
self=cleverusername
# pick a fancyname
proj=fancy__games_dot_com
# use $self from above to define your working directory
basepath=/home/$self
# check them - if the $var doesnt exist they'll print empty
echo $self $proj $basepath $thisdoesntexist
# beware of undefined (empty) vars, and vars assigned to typos/nonsense/outdated data
# computer doesn't know the difference. yet.First, save each folder name as a local variable, then
pass those local variables to mkdir to
make those folders real.
# using basepath from above
raw=$basepath/raws/${proj}__raw
join=$basepath/raws/${proj}__join
wrk=$basepath/work/$proj
# for storing metadata etc.
mat=$wrk/Materials
# assumed location, update as necess
db=$basepath/db
qc=$wrk/1__qc
filt=$wrk/2__filt
host=$wrk/3__deacon
krak=$wrk/4__krak2
## make folders from vars
mkdir -p $raw $join $wrk $qc $filt $host $krak $mat
mkdir -p $qc/${proj}_join $qc/${proj}_filt $qc/${proj}_host $qc/${proj}_join_multi $qc/${proj}_filt_multi $qc/${proj}_host_multimamba (i.e. a quicker version of conda) is
your new best friend.
# folder for your programmes - adapt as best suits you
b_path=~/bin
mkdir -p $b_path ; cd $b_path
# first, get mamba (miniforge) and run through installation
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh
# install tools needed into fresh environments using mamba. or dont.
mamba create -n mgx -c bioconda -c conda-forge trimmomatic fastqc multiqc deacon -y # 346MB
mamba create -n k2 -c bioconda bracken kraken2 krakentools parallel -y # >300MB
# while here, get KrakenTools for later
git clone https://github.com/jenniferlu717/KrakenTools.gitWe can also write our own (simple!) programmes - this one compares samples in a folder against a sample list, and reports back any missing samples:
catcher (){
local DirCheck=$1
# -23 : suppress files present in DirCheck / common to both - show just missing samples from samplelist
comm -23 \
<( sort $mat/${proj}__samples.txt ) \
<( ls $DirCheck | sed -E 's/_(S[0-9]*).*/_\1/g' | sort -u)
}
# e.g. check samples downloaded (should return nothing!)
catcher $join
# e.g. check samples filtered (might return something...)
catcher $filtMost programmes have resources they need to get - time to get the ones you need (only need to get them once).
# filtering & trimming - trimmomatic standard reference sequences to be removed
echo '>Ampli_Tru_Seq_adapter__MOST_IMPORTANT_FEEEL_ME
CTGTCTCTTATACACATCT
>Transposase_Adap__for_tagmentation_1
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG
>Transposase_Adap__for_tagmentation_2
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG
>PCR_primer_index_1
CAAGCAGAAGACGGCATACGAGATNNNNNNNGTCTCGTGGGCTCGG
>PCR_primer_index_2
AATGATACGGCGACCACCGAGATCTACACNNNNNTCGTCGGCAGCGTC
>TruSeq_single_index_LT_CD_HT__1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>TruSeq_single_index_LT_CD_HT__2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
>PCR-Free_Prep__Tagm__additional_seq
ATGTGTATAAGAGACA
>Ampli_Tru_Seq_adapter__MOST_IMPORTANT_FEEEL_ME__mate_RC
AGATGTGTATAAGAGACAG
>Transposase_Adap__for_tagmentation_1.RC:
CTGTCTCTTATACACATCTGACGCTGCCGACGA
>Transposase_Adap__for_tagmentation_2.RC:
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
>PCR_primer_index_1_RC:
CCGAGCCCACGAGACNNNNNNNATCTCGTATGCCGTCTTCTGCTTG
>PCR_primer_index_2_RC:
GACGCTGCCGACGANNNNNGTGTAGATCTCGGTGGTCGCCGTATCATT
>TruSeq_single_index_LT_CD_HT__1_RC:
TGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>TruSeq_single_index_LT_CD_HT__2_RC:
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PCR-Free_Prep__Tagm__additional_seq_RC:
TGTCTCTTATACACAT
>polG_just_from_PCF_concerns
GGGGGGGGGGGGGGGGGGGGGGGG
>polA_just_from_PCF_concerns
AAAAAAAAAAAAAAAAAAAAAAAA'> $mat/fqc_trimmo_ill_ref.fa
## more commonly, oprogrammes need databases: recall we defined a $db var for this.
lk $db
# decontamination - Download validated 3GB human pangenome index (version 0.13.0 or later)
deacon index fetch panhuman-1 -o $db/deacon_dbKraken2 uses a database of {k-mer} patterns in different
microbes, and compares these patterns against those seen in your reads.
Bracken (a post-hoc for Kraken2) uses a
portion of this same database to estimate community composition.
Building these databases is ideal, but can be a challenge - Ben Langmead’s
research group host pre-made versions of many databases, assuming
that you want to use their pre-decided read-lengths. Here, we go for the
125bp versions with the PlusPF database - but be aware that
it’s +100GB of data, so it will take a while to download.
Ideally, there would only be a single copy on the server - but for now
we will download our own.
## assignment - kraken2 / Bracken
cd $db
wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_20260226.tar.gz
## -Xtract Ze Vucking Files (extract, gzipped, verbose, file-destination: )
tar -xzvf $db/kraken2_plusPF__langmead $db/k2_pluspf_20260226.tar.gz
## look
ls -lsh $db/kraken2_plusPF__langmeadNon-mamba: install basespace from illumina,
for getting sequence data: firstly, get an Illumina account here.
## record your own username and password as best you see fit:
> u: _____________
> p: _____________
Then get and activate the bs:
bs=$b_path/bs
## if you don't have it, download the basepspace programme (aka bs) to $bs :
wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O $bs ; chmod 775 $bs
# follow instructions to complete setup:
$bs auth
# more info if you like: (press Q to exit)
$bs download -hHoo. Nearly there.
NB: In reality, if we’re running these programmes on
UCC’s HPC cluster, we will have to change the format for these commands
to work with a HPC management programme called SLURM. We’ll
cover this later, but for the moment any SLURM work will go
into a drop-down box like the one below - ignore these for
now.
Last thing - because we used mamba to install our
programmes above into the mgx environment, we need to
“activate” the mgx environment before use or the computer
will think the programmes are not installed.
Time to go.
bs +
fastqcBasespace (aka bs) should have been installed &
authenticated above. Now, follow the basespace link sent to you by your
sequencer - it will add the data to your illumina account (which should
also have been set up above).
https://basespace.illumina.com/s/abcdefghijkl # not a real link!
When you click the above it will bring you to a webpage - the project ID for download is in that page’s URL address e.g. :
https://basespace.illumina.com/projects/123456789/about
, where 123456789 is the project ID.
Mad I know.
# save your real project ID here
projectID=123456789
## download FASTQ to the folder we chose/created above - will take a while to finish (screen will protect us)
$bs download project --id $projectID --output $raw --log=$f_path/fdl.log &
## have a lük: red = compressed
ls $raw/*/*gz
# use that to pull a sample list for constant re-use. always worth doublechecking.
ls $raw/*/*gz | sed -r 's/.*\/(.*)_L00._R._001.fastq.gz/\1/g' | sort -u > $mat/${proj}__samples.txt
head -n 10 $mat/${proj}__samples.txt
## tidy up - join lanes (L001, L002) and remove _L00*_R*_001 from name.
time parallel -j 16 "cat $raw/*/{}_L00*_R1_001.fastq.gz > ${join}/{}_R1.fastq.gz ; cat $raw/*/{}_L00*_R2_001.fastq.gz > ${join}/{}_R2.fastq.gz" :::: $mat/${proj}__samples.txtFastQC
checks each sample and generates a report you can read ; MultiQC combines
multiple reports together.
Once FastQC has finished running, collate those
reports:
SLURM version + fiddling
SLURM guidance can be sought here and
across the webs.
Make a new script file:
-and add these lines:
#!/bin/bash
#SBATCH --job-name=fqc20
#SBATCH --output=omm__1__fqc20.%j.slrm
#SBATCH --ntasks=20
#SBATCH --time=0-1:00:00
INPATH=$1
OUTPAT=$2
FQC_PROCS=$3
# note the time above is a guess, lets see
source /home/EXT05188/bin/miniforge3/etc/profile.d/conda.sh
conda activate geno
if [ -z $FQC_PROCS ] | [ $FQC_PROCS gt 20 ] ; then FQC_PROCS=20 ; echo " + + + hogging 20 cores only" ; fi
if [[ -d $INPATH ]] && [[ -d $OUTPAT ]] ;
then
echo " + + + fastqc on ${INPATH##*\/} --- - - -- - - "
fastqc -t $FQC_PROCS ${INPATH}/*fastq.gz -o $OUTPAT # just about works!!
else
echo " < ! > something awry with FQC's fastq.gz path - check paths given : ${INTXT} ; ${OUTPAT} "
fi
# separate R1 and R2, as their profiles can differ..
mkdir -p $OUTPAT/fwd $OUTPAT/rev ${OUTPAT}_multi
mv $OUTPAT/*R1* $OUTPAT/fwd/
mv $OUTPAT/*R2* $OUTPAT/rev/
echo " + + + running multiqc -- -- --- -- - - --- "
multiqc $OUTPAT/fwd -o ${OUTPAT}_multi/multi_fwd
multiqc $OUTPAT/rev -o ${OUTPAT}_multi/multi_revNow to see how that looks. Copy the multiQC.html output
to your computer and have a look:
# XXX.YYY.34.1 is the address for the HPC (replace XXX.YYY with our real prefix) - we use -o ProxyJump to send the request through garrison.ucc.ie
# run this command on the server (to get the correct paths etc.), then copy and paste on your local (UNIX) machine
echo "scp -o ProxyJump=${self}@garrison.ucc.ie ${self}@XXX.YYY.34.1:$qc/${proj}_join_multi/multiqc_report.html ./Desktop/"
# in your browser, open the `multiqc_report.html` that has appeared on your desktop Lots of info. Focus on Per Base Sequence Content,
clicking through a few different samples.
per base sequence quality
Considering the faec & vag samples come
from very different substrates, these look v comparable - probably
because we haven’t remove the host DNA yet. The smooth lines indicate
regions where the overall sampling of communtiy DNA is so balanced that
the lines separate reliably along the difference in DNA GC
content (a characteristic of each sample - usually 40-60%). this is
good! Note how things start and end badly though - we always
trim the beginning and end.
faecal sample
vaginal sample
These are more jagged, more erratic - this indicates different bases
are spiking up and down because there are so few reads, you’re not
getting a balanced sample across all positions. This is particulalrly
evident at the 5' (start) of the read: each read starts
with a synthetic adapter, all of which have the same sequence. As a
result, the bases at those locations are clearly A, T, G,
or C.
Note that a negative control _should_look like this - the less DNA in the negative control the better!
Vaginal-bad
However, this is a vaginal sample that did not perform well (likely low yield). Something to be aware of.
control sample
Finally, this is what the trimmomatic step is combating
- adapter content (synthetic DNA sequences used to tie the sample to the
machine chip). These will interfere at every stage of the analysis if
not removed, so always check it. Note also there are two different
trends here - one where the adapter content starts at 30(!)bp, and one
where it starts at 100bp. Can you guess why?
adapter removal
After clicking through a few samples, you’ll get a feel for where the
‘kinks’/changes/wobbles in the read-quality are (at beginning and end).
Here, I think 20bp (on the 5' / start) and
105bp (on the 3' / end). This means we’re throwing away
more than a third of our data, before we even start. Brave
face.
Additionally, although trimming requirements are equivalent at the moment, it might make sense to process the faecal and vaginal samples separately. We can come back to this as necessary.
Looking at all samples together is informative, but it takes a long time to test & run.
Instead of testing this on all +300 samples, lets choose a practice
subset (faster, easier, lighter workload). As a representative sample,
taking all samples from volunteer 23 (vaginal and faecal, 2
timepoints each) - this means 4 sample IDs (so
4 X (R1 + R2) = 8 fastq files in total).
# grep looks for the pattern "23-V" in $mat/${proj}__samples.txt and returns those lines, we send it to a new file with ">"
grep "23-V" $mat/${proj}__samples.txt > $mat/${proj}__tester.txt
# have a little lük, its just the sample IDs
head $mat/${proj}__tester.txt
# and a test sample (just take the first one)
test=$( head -n 1 $mat/${proj}__tester.txt)trimmomaticRun trimmomatic,
then QC the outputs:
##--< ! > MAKE SURE you check F+MQC to set your parameters below < ! >--
# use ILLUMINACLIP before CROP to maximise space for pattern recognition
cat $mat/${proj}__tester.txt | parallel -j 4 "trimmomatic PE \
$join/{}_R1.fastq.gz \
$join/{}_R2.fastq.gz \
$filt/{}_R1_trimm.fastq.gz \
$filt/{}_R1_trimm_unpaired.fastq.gz \
$filt/{}_R2_trimm.fastq.gz \
$filt/{}_R2_trimm_unpaired.fastq.gz \
ILLUMINACLIP:$mat/fqc_trimmo_ill_ref.fa:2:30:10:5 \
SLIDINGWINDOW:6:15 \
CROP:105 \
HEADCROP:19 \
MINLEN:50 \
-threads 4 > $filt/{}_trim.log 2>&1"
SLURM version of trimmomatic
Write the SLURM script:
In it, place:
#!/bin/bash
#SBATCH --job-name=trimmo4
#SBATCH --output=omm__2__trimmo4.%j.slrm
#SBATCH --time=0-00:30:00
#SBATCH --ntasks=4
sl_samp=$1
sl_indir=$2
sl_outdir=$3
source /home/EXT05188/bin/miniforge3/etc/profile.d/conda.sh
conda activate geno
# sl_samp=$test # choose a sample ID for testing on
# sl_indir=$raw
# sl_outdir=$filt
# echo $sl_samp $sl_indir $sl_outdir ${sl_outdir%\/*}
##--< ! > MAKE SURE you check F+MQC to set your parameters below < ! >--
trimmomatic PE \
$sl_indir/${sl_samp}_R1.fastq.gz \
$sl_indir/${sl_samp}_R2.fastq.gz \
$sl_outdir/${sl_samp}_R1_trimm.fastq.gz \
$sl_outdir/${sl_samp}_R1_trimm_unpaired.fastq.gz \
$sl_outdir/${sl_samp}_R2_trimm.fastq.gz \
$sl_outdir/${sl_samp}_R2_trimm_unpaired.fastq.gz \
ILLUMINACLIP:${sl_outdir%\/*}/Materials/fqc_trimmo_ill_ref.fa:2:30:10:5 \
SLIDINGWINDOW:6:15 \
CROP:105 \
HEADCROP:20 \
MINLEN:50 \
-threads 4
# presume logged to trimmo4.%j.slrmRun the SLURM script:
# CHECK THE CROP VALUES (105, 20, 50), then insert the script above:
nano $mat/${proj}__slurm__trimmo4.sh
## separate processing? note you abandon the neg ctrls here - do you really want to do this?
# grep "faecal" $mat/${proj}__samples.txt > $mat/${proj}__samples-faecal.txt
# grep "vaginal" $mat/${proj}__samples.txt > $mat/${proj}__samples-vaginal.txt
# for f in $( cat $mat/${proj}__tester.txt ) ;
for f in $( cat $mat/${proj}__samples.txt ) ;
do
sbatch $mat/${proj}__slurm__trimmo4.sh $f $join $filt ;
doneNote, if something goes wrong or we need to re-fire some of the
samples, we can use SLURM with the catcher
function we wrote above (see the “programmes etc.” drop-down
box) to call out missing samples, and re-submit them to run:
A crucial step is to check the output from
trimmomatic, looking at the forward and reverse outputs
separately to see how trimming has affected them. This is especially
true when you have more than one sample type (e.g. faecal &
vaginal), and also when you have challenging sample types (vaginal).
Luckily, we can use the same FQC/MQC code that we wrote
above, and simply change the files we want to check:
# remove unpaired stuff
mkdir $filt/unpaired ; mv $filt/*unpaired* $filt/unpaired/
#
fastqc -t 16 $filt/*_trimm.fastq.gz -o $qc/${proj}_filt ; multiqc $qc/${proj}_filt -o $qc/${proj}_filt_multiSLURM FQC-MQC of trimmomatic
outputs.
FQC-MQC output - did
Trimmomatic help?
Prob worth checking how many reads per sample you’re getting back
(can also get this from the Trimmomatic output, but easier
/ more comparable here). Compare it with the raw counts from above!
trimmed reads output
Do the per-base sequence composition plots look stable? Overall plot should be homogeneous-grey…
trimmed samples
But make sure to click into a few sequences:
a trimmed faceal sample
a trimmed vaginal sample
And check for the absence of adapters:
trimmed and filtered adapters
Possibly run catcher over it also:
catcher $qc/${proj}__filt
# bunch of failures due to truncated outputs from trimmomatic - rerun Trimmomatic on these samples only
for rr in $( catcher $qc/${proj}_filt );
do
sbatch $mat/${proj}__slurm__trimmo4.sh $rr $join $filt ;
done
# fqc just those samples?...
catcher $qc/${proj}_filt | parallel -j 6 \
fastqc -t 2 $filt/{}_R*_trimm.fastq.gz -o $qc/${proj}_filt
done
multiqc $qc/${proj}_filt -o $qc/${proj}_filt_multiCheck it again! Keep solving until the problem is resolved.
DeaconDeacon is a
powerful, fast decontamination programme that focuses on using
“minimisers” (rare/unusual DNA patterns related to k-mers) to
detect contaminant (host and bacterial) DNA sequences. Future work
should include developing a better suited database to catch possible
contaminants. Consider also doubling up with an alignment-based method
also (e.g. bowtie2), depending on output - even if only for
evaluation’s sake.
# downloaded the correct (human) index when we installed...
## tester set:
# time cat $mat/${proj}__tester.txt | parallel -j 8 "deacon filter -d \
# ~15mins
time cat $mat/${proj}__samples.txt | parallel -j 8 "deacon filter -d \
$db/deacon_db/panhuman-1.k31w15.idx \
$filt/{}_R1_trimm.fastq.gz \
$filt/{}_R2_trimm.fastq.gz \
-o $host/{}_R1_deco.fastq.gz \
-O $host/{}_R2_deco.fastq.gz \
--threads 4 > $host/{}_hostless.log 2>&1"SLURM version of deacon
Write the SLURM script:
nano $mat/${proj}__slurm__deacon6.sh
#!/bin/bash
#SBATCH --job-name=deacon6
#SBATCH --partition=cpuq-long
#SBATCH --output=omm__3__deacon6.%j.slrm
#SBATCH --ntasks=6
sl_samp=$1
sl_indir=$2
sl_outdir=$3
source /home/EXT05188/bin/miniforge3/etc/profile.d/conda.sh
conda activate geno
deacon filter -d \
$db/deacon_db/panhuman-1.k31w15.idx \
$sl_indir/${sl_samp}_R1_trimm.fastq.gz \
$sl_indir/${sl_samp}_R2_trimm.fastq.gz \
-o $sl_outdir/${sl_samp}_R1_deco.fastq.gz \
-O $sl_outdir/${sl_samp}_R2_deco.fastq.gz \
--threads 6 > $sl_outdir/${sl_samp}_hostless.log 2>&1Run the SLURM script:
Again, it’s crucial that we check what we’ve done. A quick test is likely to show that we keep (nearly) all of the gut stuff, and loose (nearly) all of the vaginal stuff:
# very quick check of outputs
grep -h "Retained" $host/*log
## oof! keep ~95% gut, keep 3-5% vaginal
# need to see how that looks for the sequences - FQC/ MQC
sbatch $mat/${proj}__slurm__fqc20.sh $host $qc/host 20slurm and FQC-MQC aren’t
communicating…
Note we split out the L / R samples for consideration separately:
And bring it back home for a look:
# enter this on your session with the terminal, so that the variables fill it out for you
echo "scp -o ProxyJump=${self}@garrison.ucc.ie ${self}@XXX.YYY.34.1:$qc/${proj}_host_multi/multiqc_report.html ./Desktop/"Generally less than 1M reads per vaginal sample. Sometimes, removing huge amounts of host sequence data can reveal big problems with the underlying microbial data (the part you might care about). In this case, the total adapter content increased a little (1%?), but overall things look decent.
whats left of the adapters, in whats left of the reads
Kraken2 + BrackenPrevious steps have been very clear in their objective. For
#4, need to have a think about what is required or sensible
for your own project - more interested in specifics (genomes) or in
communities (metagenomes)? Genes more than taxa (pathway analysis)?
Below, we presume metagenomics is the main interest,
and that we want something powerful and quick (and good). Like
Deacon, Kraken2
looks for unique sub-sequences (i.e k-mers) of DNA in our
reads, and checks what species are known to have that pattern in their
genomes.
However, Kraken2 is only designed to find a
“best match” for each sequence. Applying statistics to this data
directly is prone to errors, so Bracken estimates what the
total community probably looked like, based on the output that
Kraken2 gives (see here
for more detail). The output will be an abundance feature table of
taxa x samples.
As Kraken2 databases can be very large (>100GB), putting
all of this data into RAM can be a huge overhead - especially if doing
so multiple times. There are options in Kraken2 (use
--memory-mapping) and in linux (see copying
the DB directly to /temp/shm) to work around this. There
are also smaller databases that can be used - see here
and here.
Note: you’ll need to have made (or downloaded)
matching databases for Kraken2 and Bracken, as
well as downloading KrakenTools from github - see the
programmes drop-down box earlier in this document for more
details.
# activate the conda env wiht kraken2 and bracken installed
mamba activate k2
## check variables:
kr_db=$db/kraken2_plusPF_langmead
kr_threads=5
br_r=100 # $br_leng = match the post-host length
br_l=S
br_t=50 # counts of microbes! not threads
## apply for use with --memory-mapping, to avoid OOM
vmtouch -t $kr_db/*.k2d
## Kraken2 -------------------------------------------------------------
## could prb mem-map all samples in on ego, then parallel bracken
## ~~skip the loop altogether~~ can define outputs via glob, mixes samples.
## avoiding these outputs, as we rarely look at them.
# --unclassified-out $krak/${test}_kraken_unclass# \
# --output $krak/${test}_kraken_output
time for test in $(cat $mat/${proj}__samples.txt );
do
echo " + + + Kraken2 on ${test} - conf 0.1, min-hits 5, min-qual 20 (mem-mapp)"
kraken2 --db $kr_db \
$host/${test}_R1_deco.fastq.gz \
$host/${test}_R2_deco.fastq.gz \
--paired \
--threads $kr_threads \
--confidence 0.1 \
--memory-mapping \
--gzip-compressed \
--report-zero-counts \
--minimum-hit-groups 5 \
--minimum-base-quality 20 \
--report $krak/${test}_kraken2_report \
done > $krak/k2__all_memmap.log 2>&1
## Bracken -------------------------------------------------------------
# bracken -d $kr_db/ -i $krak/${test}_kraken2_report -o $krak/${test}.bracken -r $br_r -l $br_l -t $br_t &&
parallel -j $kr_threads bracken -d $kr_db/ -i {} -o {}.bracken -r $br_r -l $br_l -t $br_t ::: $krak/*kraken2_report > $krak/k2_br.log 2>&1
## tidy - scraunch if necessary
parallel -j $kr_threads gzip {} ::: $krak/${test}* SLURM for Kraken2/Bracken - to do (but should
be v similar to the above)
Given that Kraken2 is oh-so-fast (3-10 minutes pre
sample) these jobs could instead be sent to the
cpuq-short partition, a section of the HPC that has much
more RAM but limits jobs to 4 hours:
#SBATCH partition=cpuq-short. Only possible if that
partition is available : check with sinfo.
script contains:
#!/bin/bash
#SBATCH --job-name=kraken5
#SBATCH --output=omm__4__krak2-5.%j.slrm
#SBATCH --ntasks=5
#SBATCH --time=0-0:10:00
## #SBATCH --partition=cpuq-short
samp=$1
inpath=$2
outpat=$3
kr_db=$4
# define here in case we tweak later
kr_threads=5
br_r=100 # $br_leng = match the post-host length
br_l=S
br_t=50 # counts of microbes! not threads
source /home/EXT05188/bin/miniforge3/etc/profile.d/conda.sh
conda activate k2
echo " + + + Kraken2 on ${samp} - conf 0.1, min-hits 5, min-qual 20; Bracken read:${br_r}, count:${br_t}"
kraken2 --db $kr_db \
$inpath/${samp}_R1_deco.fastq.gz \
$inpath/${samp}_R2_deco.fastq.gz \
--paired \
--threads $kr_threads \
--confidence 0.1 \
--gzip-compressed \
--report-zero-counts \
--minimum-hit-groups 5 \
--minimum-base-quality 20 \
--report $outpat/${samp}_kraken2_report \
--unclassified-out $outpat/${samp}_kraken_unclass# \
--output $outpat/${samp}_kraken_output
kr_stat=$?
if [ ! $kr_stat -eq 0 ];
then
echo " < ! > sample ${samp} failed Kraken2 :: $kr_stat"
else
bracken -d $kr_db/ -i $outpat/${samp}_kraken2_report -o $outpat/${samp}.bracken -r $br_r -l $br_l -t $br_t &&
echo " + + + sample ${samp} completed Kraken2/Bracken"
fi
# need paralell in k2 env
parallel -j $kr_threads gzip {} ::: $outpat/${samp}*Send to SLURM:
Turn one-output-per-sample into a single file with all of the outputs (like an excel table: Rows*Columns). We also take this chance to create a taxonomy table for all the entries, which will make exploring the data much easier.
Note these scripts can be found in the KrakenTools
repo from the excellent Jennifer Lu (author of
Bracken and co-author of Kraken2) - see the
installation section above if you havent got them.
## in case you've forgotten, your bin and KrakenTools folder are here:
lk $b_path
lk $b_path/KrakenTools
## combine
$b_path/KrakenTools/combine_kreports.py -r $krak/*_bracken_species -o $krak/${proj}__kraken2_plusPF_abundances.tsv
## for the taxonomy file, we first convert to MPA format (its just useful)
$b_path/KrakenTools/kreport2mpa.py -r $krak/${proj}__kraken2_plusPF_abundances.tsv -o $krak/${proj}__${test}_kraken2_mpa
grep -h '|s_' $krak/${proj}__${test}_kraken2_mpa | cut -f 1 | sort | uniq | sed 's/|/\t/g' > $krak/${proj}__kraken2_plusPF_taxonomy.tsv
## have a lük
less -S $krak/${proj}__kraken2_plusPF_taxonomy.tsvA speedrun through a specific type of short-read shotgun processing is done (this was metagenomics). Analysis and exploration are the next steps, and these typically take place on a local computer. Lets save out our data into a single small(er) file:
mkdir $mat/${proj}_output
cp $krak/${proj}__kraken2_plusPF_abundances.tsv ${KRAK}*/${proj}__kraken2_plusPF_taxonomy.tsv $mat/${proj}_output/
# compress all that to a single gzipped archive file
tar -czvf ${mat}/${proj}_output.tar.gz ${mat}/${proj}_outputGetting outputs back to the desktop can be daunting, but there are a
few ways of doing it (e.g. FileZilla,
MobaXterm) - scp is a built in go-to.
In your terminal, run the following: this will (should) print out the
full command with the correct names (again, using the
local variables above):
# dont forget to fix XXX.YYY to the correct address
echo "scp -oProxyJump=${self}@garrison.ucc.ie ${self}@XXX.YYY.34.1:${mat}/${proj}_output.tar.gz ~/Desktop/"Copy this to plaintext if it helps read it:
<1> ---< 2 >------- ------------< 3 >------------------ ---------------< 4 >---------------------- ---< 5 >---
scp -i .ssh/id_file -oProxyJump=${self}@garrison.ucc.ie ${self}@XXX.YYY.34.1:${mat}/${proj}.tar.gz ~/Desktop/
scp, or “secure copy” - a mix of ssh and
cpBecause 4. has the user+IP+path, and 5. just has where the file is going, we can tell that this command is supposed to be run on your local computer (where the file will be going). Copy and paste the command above into a new terminal window on your local computer (desktop, laptop, etc.):
# run:
echo "scp -i .ssh/id_file -oProxyJump=${self}@garrison.ucc.ie ${self}@XXX.YYY.34.1:${mat}/${proj}_output.tar.gz ~/Desktop/"
## now, copy, paste, and run the command printed out above!
## follow the instructions it gives you, or record the message if something goes wrong.
## NB: you might need to replace $proj with the fancyname you picked for your project at the beginning
## as you might not have assigned that local variable on your local machine
# should be here (or just check your desktop)
~/Desktop/${proj}.tar.gz
# uncompress it again - first go to whereever the file is located :
cd ~/Desktop
tar -xzvf ~/Desktop/${proj}.tar.gz
# if its there, you could probably also double click it...Hopefully this has allll magically worked, and you’re ready to push
this data into R/excel/GraphPad
etc. Easy peasy.