jfg, Jun
2026code onlyThis is a quick(er), simplified(er) guide to processing large shotgun short-read sequencing datasets.
`0.` setup and configure - installing programmes and databases
`1.` download (and check)
`2.` trimm (and check)
`3.` decontaminate (and check)
`4.` identify (and check)
NB: the guide is not fully tested, not completed for using
SLURM, and needs more explanation for step#4.
A type of shortcut called a local variable helps us keep
things tidy. First, assign your directory path (folder) to
a short name, like basepath - now, we can use that shortcut
to refer to the long name anywhere in our script, by adding
$ to the variable: $basepath.
If our folder path ever changes, we only need to update where we define the variable. Much easier to read, copy, remember, and type.
# should match your ID on HPC/server/garrison
self=cleverusername
# pick a fancyname
proj=fancy__games_dot_com
# use $self from above to define your working directory
basepath=/mnt/workspace2/$self
# check them - if the $var doesnt exist they'll print empty
echo $self $proj $basepath $thisdoesntexist
# beware of undefined (empty) vars, and vars assigned to typos/nonsense/outdated data
# computer doesn't know the difference. yet.Create vars for all your folders, then give those vars
to mkdir to make those folders.
# using basepath from above
raw=$basepath/${proj}_raw
wrk=$basepath/$proj
# for storing metadata etc.
mat=$wrk/Materials
# assumed location, update as necess
db=$basepath/db
qc=$wrk/1__qc
filt=$wrk/2__filt
host=$wrk/3__deacon
krak=$wrk/4__krak2
## make folders from vars
mkdir -p $raw $wrk $qc $filt $host $krak $mat
mkdir -p $qc/${proj}_raw $qc/${proj}_filt $qc/${proj}_raw_multi $qc/${proj}_filt_multimamba (i.e. a quicker version of conda) is
your new best friend.
# folder for your programmes - adapt as best suits you
b_path=~/bin
mkdir -p $b_path ; cd $b_path
# first, get mamba (miniforge) and run through installation
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh
# install tools needed into fresh environments using mamba. or dont.
mamba create -n mgx -c bioconda -c conda-forge trimmomatic fastqc multiqc bowtie2 samtools hostile deacon -y # 346MB
mamba create -n k2 -c bioconda bracken kraken2 krakentools -y # >300MBMost programmes have resources they need to get - time to get the ones you need (only need to get them once).
# recall we defined a $db var for databases etc.
lk $db
# filtering & trimming - trimmomatic standard reference sequences to be removed
echo '>Ampli_Tru_Seq_adapter__MOST_IMPORTANT_FEEEL_ME
CTGTCTCTTATACACATCT
>Transposase_Adap__for_tagmentation_1
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG
>Transposase_Adap__for_tagmentation_2
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG
>PCR_primer_index_1
CAAGCAGAAGACGGCATACGAGATNNNNNNNGTCTCGTGGGCTCGG
>PCR_primer_index_2
AATGATACGGCGACCACCGAGATCTACACNNNNNTCGTCGGCAGCGTC
>TruSeq_single_index_LT_CD_HT__1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>TruSeq_single_index_LT_CD_HT__2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
>PCR-Free_Prep__Tagm__additional_seq
ATGTGTATAAGAGACA
>Ampli_Tru_Seq_adapter__MOST_IMPORTANT_FEEEL_ME__mate_RC
AGATGTGTATAAGAGACAG
>Transposase_Adap__for_tagmentation_1.RC:
CTGTCTCTTATACACATCTGACGCTGCCGACGA
>Transposase_Adap__for_tagmentation_2.RC:
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
>PCR_primer_index_1_RC:
CCGAGCCCACGAGACNNNNNNNATCTCGTATGCCGTCTTCTGCTTG
>PCR_primer_index_2_RC:
GACGCTGCCGACGANNNNNGTGTAGATCTCGGTGGTCGCCGTATCATT
>TruSeq_single_index_LT_CD_HT__1_RC:
TGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>TruSeq_single_index_LT_CD_HT__2_RC:
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PCR-Free_Prep__Tagm__additional_seq_RC:
TGTCTCTTATACACAT
>polG_just_from_PCF_concerns
GGGGGGGGGGGGGGGGGGGGGGGG
>polA_just_from_PCF_concerns
AAAAAAAAAAAAAAAAAAAAAAAA'> $mat/fqc_trimmo_ill_ref.fa
# decontamination - Download validated 3GB human pangenome index (version 0.13.0 or later)
deacon index fetch panhuman-1 -o $db/deacon_db
# assignment - kraken2 / Bracken
# get this db!Non-mamba: install basespace from illumina,
for getting sequence data: firstly, get an Illumina account here.
## record your own username and password as best you see fit:
> u: _____________
> p: _____________
Then get and activate the bs:
bs=$b_path/bs
## if you don't have it, download the basepspace programme (aka bs) to $bs :
wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O $bs ; chmod 775 $bs
# follow instructions to complete setup:
$bs auth
# more info if you like: (press Q to exit)
$bs download -hHoo. Nearly there.
NB: In reality, if we’re running these programmes on
UCC’s HPC cluster, we will have to change the format for these commands
to work with a HPC management programme called SLURM. We’ll
cover this later, but for the moment any SLURM work will go
into a drop-down box like the one below - ignore these for
now.
Basespace (aka bs) should have been installed etc.
above. Follow the basespace link sent by your sequencer - it will add
the data to your illumina account (which should also have been set up
above).
https://basespace.illumina.com/s/abcdefghijkl # not a real link!
When you click the above it will bring you to a webpage - the project ID for download is in that page’s URL address e.g. :
https://basespace.illumina.com/projects/123456789/about
, where 123456789 is the project ID.
Mad I know.
# save your real project ID here
projectID=123456789
## download FASTQ to the folder we chose above - will take a while to finish, and will
$bs download project --id $projectID --output $raw --log=$f_path/fdl.log &
## tidy up the FASTQ files
mv $raw/*/*fastq.gz $raw/ ; rmdir $raw/*
## name wrangle to remove _L00*_R*_001 from name if necessary
for f in $raw/*R1*gz ; do mv $f ${f%%_L00*}_R1.fastq.gz ; done
for f in $raw/*R2*gz ; do mv $f ${f%%_L00*}_R2.fastq.gz ; done
## have a look at your new files! red = compressed
ls $raw/*fastq.gz
# use that to pull a sample list of ruse later on
ls $raw/*gz | sed -r 's/.*\/(.*)_R.*/\1/g' | sort -u > $mat/${proj}__samples.txt
# set a test sample
test=vaginal-68-Visit-1_S11
lk $raw/${test}*Check the sequencing output
## F / M Q C ================================================================
fastqc -t 20 $raw/*fastq.gz -o $qc/${proj}_raw && multiqc $qc/${proj}_raw -o $qc/${proj}_raw_multi
## exclude negative controls if empty
# ls $raw/*gz | sed -r 's/.*\/(.*)_R.*/\1/g' | sort -u | grep -vE 'LIBNEG' > $mat/${proj}__samples.txtSLURM version + fiddling
## < ! > this isn't recal'd for UCC SLURM / non-omm
#!/bin/bash
#SBATCH --job-name=fqc20
#SBATCH --output=fqc20.%j.slrm
#SBATCH --ntasks=20
INTXT=$1
OUT=$2
mkdir -p $OUT/fwd $OUT/rev ${OUT}_multi
if [[ $? = 0 ]] && [[ -s $INTXT ]] ;
then
echo " + + + running fastqc --- - - -- - - "
module load fastqc multiqc
fastqc -t 20 $(cat $INTXT) -o $OUT # just about works!!
else
echo " < ! > something awry with the fastq.gz given to FQC/MQC, or the output <dirs> :
< ! > check <dir> structure and file manifest : $INTXT"
# exit 1
fi
mv $OUT/*R1* $OUT/fwd/
mv $OUT/*R2* $OUT/rev/
echo " + + + running multiqc -- -- --- -- - - --- "
multiqc $OUT/fwd -o ${OUT}_multi/multi_fwd
multiqc $OUT/rev -o ${OUT}_multi/multi_revNow to see how that looks. Copy the multiQC.html output
to your computer and have a look:
# XXX.YYY.34.1 is the address for the HPC (replace XXX.YYY with our real prefix) - we use -o ProxyJump to send the request through garrison.ucc.ie
# run this command on the server (to get the correct paths etc.), then copy and paste on your local (UNIX) machine
echo "scp -o ProxyJump=${self}@garrison.ucc.ie ${self}@XXX.YYY.34.1:$qc/${proj}_raw_multi/multiqc_report.html ./Desktop/"
# in your browser, open the `multiqc_report.html` that has appeared on your desktop Lots of info. Focus on Per Base Sequence Content,
clicking through a few different samples.
per base sequence quality
Considering the faec & vag samples come
from very different substrates, these look v comparable - probably
because we haven’t remove the host DNA yet. The smooth lines indicate
regions where the overall sampling of communtiy DNA is so balanced that
the lines separate reliably along the difference in DNA GC
content (a characteristic of each sample - usually 40-60%). this is
good! Note how things start and end badly though - we always
trim the beginning and end.
faecal sample
vaginal sample
These are more jagged, more erratic - this indicates different bases
are spiking up and down because there are so few reads, you’re not
getting a balanced sample across all positions. This is particulalrly
evident at the 5' (start) of the read: each read starts
with a synthetic adapter, all of which have the same sequence. As a
result, the bases at those locations are clearly A, T, G,
or C.
Note that a negative control _should_look like this - the less DNA in the negative control the better!
Vaginal-bad
However, this is a vaginal sample that did not perform well (likely low yield). Something to be aware of.
control sample
After clicking through a few samples, you’ll get a feel for where the
” kinks ” or changes in the read-quality is (at beginning
and end). Here, I think 20bp (on the 5' / start) and
105bp (on the 3' / end). This means we’re throwing away
more than a third of our data, before we even start. We must endure
this pain.
Additionally, although trimming requirements are equivalent at the moment, it might make sense to process the faecal and vaginal samples separately. We can come back to this as necessary.
trimmomaticRun trimmomatic, then QC the outputs:
##--< ! > MAKE SURE you check F+MQC to set your parameters below < ! >--
# use ILLUMINACLIP before CROP to maximise space for pattern recognition
cat $mat/${proj}__samples.txt | parallel -j 4 "trimmomatic PE \
$raw/{}_R1.fastq.gz \
$raw/{}_R2.fastq.gz \
$filt/{}_R1_trimm.fastq.gz \
$filt/{}_R1_trimm_unpaired.fastq.gz \
$filt/{}_R2_trimm.fastq.gz \
$filt/{}_R2_trimm_unpaired.fastq.gz \
ILLUMINACLIP:$mat/fqc_trimmo_ill_ref.fa:2:30:10:5 \
SLIDINGWINDOW:6:15 \
CROP:105 \
HEADCROP:19 \
MINLEN:50 \
-threads 4 > $filt/{}_trim.log 2>&1"
# run FQC-MQC on the trimmed data
mkdir $filt/unpaired ; mv $filt/*unpaired* $filt/unpaired/
fastqc -t 12 $filt/*_trimm.fastq.gz -o $qc/${proj}_filt ; multiqc $qc/${proj}_filt -o $qc/${proj}_filt_multiSLURM version of trimmomatic
Write the SLURM script:
#!/bin/bash
#SBATCH --job-name=trimmo4
#SBATCH --output=omm__2__trimmo4.%j.slrm
#SBATCH --ntasks=4
sl_samp=$1
sl_indir=$2
sl_outdir=$3
# sl_samp=$test
# sl_indir=$raw
# sl_outdir=$filt
# echo $sl_samp $sl_indir $sl_outdir ${sl_outdir%\/*}
##--< ! > MAKE SURE you check F+MQC to set your parameters below < ! >--
mamba activate geno
trimmomatic PE \
$sl_indir/${sl_samp}_R1.fastq.gz \
$sl_indir/${sl_samp}_R2.fastq.gz \
$sl_outdir/${sl_samp}_R1_trimm.fastq.gz \
$sl_outdir/${sl_samp}_R1_trimm_unpaired.fastq.gz \
$sl_outdir/${sl_samp}_R2_trimm.fastq.gz \
$sl_outdir/${sl_samp}_R2_trimm_unpaired.fastq.gz \
ILLUMINACLIP:${sl_outdir%\/*}/Materials/fqc_trimmo_ill_ref.fa:2:30:10:5 \
SLIDINGWINDOW:6:15 \
CROP:105 \
HEADCROP:20 \
MINLEN:50 \
-threads 4
# presume logged to trimmo4.%j.slrmRun the SLURM script:
# CHECK THE CROP VALUES (105, 20, 50), then insert the script above:
nano $mat/ucc__fbio__slurm__trimmo4.sh
## separate processing? note you abandon the neg ctrls here - do you really want to do this?
# grep "faecal" $mat/${proj}__samples.txt > $mat/${proj}__samples-faecal.txt
# grep "vaginal" $mat/${proj}__samples.txt > $mat/${proj}__samples-vaginal.txt
# for f in $( cat $mat/${proj}__samples.txt ) ;
for f in $( cat $mat/${proj}__samples.txt | head -4 ) ;
do
sbatch $mat/ucc__fbio__slurm__trimmo4.sh $f $raw $filt ;
donethen check what you’ve done…
DeaconFuture work should include developing a better suited database to catch possible contaminants.
# downloaded the correct (human) index when we installed
# Deplete short paired reads
time cat $mat/${proj}__samples.txt | parallel -j 4 "deacon filter -d \
$db/deacon_db/panhuman-1.k31w15.idx \
$filt/{}_R1_trimm.fastq.gz \
$filt/{}_R2_trimm.fastq.gz \
-o $host/{}_R1_deco.fastq.gz \
-O $host/{}_R2_deco.fastq.gz \
--threads 4 > $host/{}_hostless.log 2>&1"SLURM version of deacon
Write the SLURM script into
ucc__fbio__slurm__deacon6.sh
```{ bash slurm_deacon } #!/bin/bash
#SBATCH –job-name=deacon6 #SBATCH –output=omm__3__deacon6.%j.slrm #SBATCH –ntasks=6
sl_samp=$1 sl_indir=$2 sl_outdir=$3
mamba activate geno
time cat \(mat/\){proj}__samples.txt
| parallel -j 4 “deacon filter -d
$db/deacon_db/panhuman-1.k31w15.idx
\(sl_indir/\){sl_samp}_R1_trimm.fastq.gz
\(sl_indir/\){sl_samp}_R2_trimm.fastq.gz
-o \(sl_outdir/\){sl_samp}_R1_deco.fastq.gz
-O \(sl_outdir/\){sl_samp}_R2_deco.fastq.gz
–threads 6 > \(host/\){sl_samp}_hostless.log
2>&1”
Run the `SLURM` script:
``` bash
# for f in $( cat $mat/${proj}__samples.txt ) ;
for f in $( cat $mat/${proj}__samples.txt | head -4 ) ;
do
sbatch $mat/ucc__fbio__slurm__deacon6.sh $f $raw $filt ;
done
then check what you’ve done.
Kraken2 + BrackenPrevious steps have been very clear in their objective. For
#4, need to have a think about what is required or sensible
for your own project - more interested in specifics (genomes) or in
communities (metagenomes)? Genes more than taxa (pathway analysis)?
Below, we presume metagenomics is the main interest, and that we want something powerful and quick (and good).
## Kraken 2 -------------------------------------------------------------
# minimum hit groups and confidence are major
while read samp ;
do time kraken2 --db $lang \
$host/${samp}_hostless/${samp}_R1_trimm.clean_1.fastq.gz \
$host/${samp}_hostless/${samp}_R2_trimm.clean_2.fastq.gz \
--paired \
--threads 14 \
--confidence 0.1 \
--gzip-compressed \
--report-zero-counts \
--minimum-hit-groups 5 \
--minimum-base-quality 20 \
--report $krak/${samp}_kraken2_report \
--unclassified-out $krak/${samp}_kraken_unclass# \
--output $krak/${samp}_kraken_output > $krak/${samp}_krak2.log && echo " + + + sample ${samp} completed task" >> $krak/kraken_okay.log ;
done< $mat/${proj}__samples.txt
kreport2mpa.py -r $krak/${test}_kraken_report -o $krak/${proj}__${test}_kraken_mpa
grep -h '|s_' $krak/${proj}__${test}_kraken_mpa | cut -f 1 | sort | uniq | sed 's/|/\t/g' > $krak/${proj}__krakenStnd_taxonomy.tsv
less -S $krak/${proj}__krakenStnd_taxonomy.tsv
## Bracken -------------------------------------------------------------
BR_r=150 # $BR_leng
BR_l=S
BR_t=50 # counts of microbes! not threads
for i in $( cat $mat/${proj}__samples.txt );
do
bracken -d $db/kraken2_standard_langm/ -i $krak/${i}_kraken2_report -o $krak/${i}.bracken -r $BR_r -l $BR_l -t $BR_t ;
done > $krak/${proj}_krak2_bracken.log
combine_bracken_outputs.py --files $krak/*.bracken -o $krak/${proj}__krakenStnd_abundances.tsv >> $krak/${proj}_krak2_bracken.log
## e x i t s t a g e ====================================================
mkdir $mat/output
cp $krak/${proj}__krakenStnd_abundances.tsv ${KRAK}*/${proj}__krakenStnd_taxonomy.tsv $mat/output/
## quick exports
scp -i .ssh/id_file -oProxyJump=${self}@garrison.ucc.ie ${path}/${proj}.tar.gz ${self}@XXX.YYY.34.1:$basepath/