hello

This is a quick(er), simplified(er) guide to processing large shotgun short-read sequencing datasets.

`0.` setup and configure  -  installing programmes and databases
`1.` download        (and check)
`2.` trimm           (and check)
`3.` decontaminate   (and check)
`4.` identify        (and check)

NB: the guide is not fully tested, not completed for using SLURM, and needs more explanation for step #4.


0. setup & configure

folders etc.

A type of shortcut called a local variable helps us keep things tidy. First, assign your directory path (folder) to a short name, like basepath - now, we can use that shortcut to refer to the long name anywhere in our script, by adding $ to the variable: $basepath.

If our folder path ever changes, we only need to update where we define the variable. Much easier to read, copy, remember, and type.

# should match your ID on HPC/server/garrison
self=cleverusername

# pick a fancyname
proj=fancy__games_dot_com

# use $self from above to define your working directory
basepath=/mnt/workspace2/$self

# check them - if the $var doesnt exist they'll print empty
echo $self $proj $basepath $thisdoesntexist

# beware of undefined (empty) vars, and vars assigned to typos/nonsense/outdated data
# computer doesn't know the difference. yet.


Create vars for all your folders, then give those vars to mkdir to make those folders.

# using basepath from above
raw=$basepath/${proj}_raw
wrk=$basepath/$proj

# for storing metadata etc.
mat=$wrk/Materials

# assumed location, update as necess
db=$basepath/db   

qc=$wrk/1__qc
filt=$wrk/2__filt
host=$wrk/3__deacon
krak=$wrk/4__krak2

## make folders from vars
mkdir -p $raw $wrk $qc $filt $host $krak $mat
mkdir -p $qc/${proj}_raw $qc/${proj}_filt $qc/${proj}_raw_multi $qc/${proj}_filt_multi


programmes etc.

Click here for installation stuff - you might only ever have to do it once.


mamba (i.e. a quicker version of conda) is your new best friend.


# folder for your programmes - adapt as best suits you
b_path=~/bin  
mkdir -p $b_path ; cd $b_path

# first, get mamba (miniforge) and run through installation
wget "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh"
bash Miniforge3-$(uname)-$(uname -m).sh

# install tools needed into fresh environments using mamba. or dont.
mamba create -n mgx -c bioconda -c conda-forge trimmomatic fastqc multiqc bowtie2 samtools hostile deacon -y   # 346MB
mamba create -n k2 -c bioconda bracken kraken2 krakentools -y # >300MB

Most programmes have resources they need to get - time to get the ones you need (only need to get them once).


# recall we defined a $db var for databases etc.
lk $db

# filtering & trimming - trimmomatic standard reference sequences to be removed
echo '>Ampli_Tru_Seq_adapter__MOST_IMPORTANT_FEEEL_ME
CTGTCTCTTATACACATCT
>Transposase_Adap__for_tagmentation_1
TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG
>Transposase_Adap__for_tagmentation_2
GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG
>PCR_primer_index_1
CAAGCAGAAGACGGCATACGAGATNNNNNNNGTCTCGTGGGCTCGG
>PCR_primer_index_2
AATGATACGGCGACCACCGAGATCTACACNNNNNTCGTCGGCAGCGTC
>TruSeq_single_index_LT_CD_HT__1
AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
>TruSeq_single_index_LT_CD_HT__2
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT
>PCR-Free_Prep__Tagm__additional_seq
ATGTGTATAAGAGACA
>Ampli_Tru_Seq_adapter__MOST_IMPORTANT_FEEEL_ME__mate_RC
AGATGTGTATAAGAGACAG
>Transposase_Adap__for_tagmentation_1.RC:
CTGTCTCTTATACACATCTGACGCTGCCGACGA
>Transposase_Adap__for_tagmentation_2.RC:
CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
>PCR_primer_index_1_RC:
CCGAGCCCACGAGACNNNNNNNATCTCGTATGCCGTCTTCTGCTTG
>PCR_primer_index_2_RC:
GACGCTGCCGACGANNNNNGTGTAGATCTCGGTGGTCGCCGTATCATT
>TruSeq_single_index_LT_CD_HT__1_RC:
TGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
>TruSeq_single_index_LT_CD_HT__2_RC:
ACACTCTTTCCCTACACGACGCTCTTCCGATCT
>PCR-Free_Prep__Tagm__additional_seq_RC:
TGTCTCTTATACACAT
>polG_just_from_PCF_concerns
GGGGGGGGGGGGGGGGGGGGGGGG
>polA_just_from_PCF_concerns
AAAAAAAAAAAAAAAAAAAAAAAA'> $mat/fqc_trimmo_ill_ref.fa

# decontamination - Download validated 3GB human pangenome index (version 0.13.0 or later)
deacon index fetch panhuman-1 -o $db/deacon_db

# assignment - kraken2 / Bracken
# get this db!

Non-mamba: install basespace from illumina, for getting sequence data: firstly, get an Illumina account here.

## record your own username and password as best you see fit:
> u: _____________
> p: _____________

Then get and activate the bs:


bs=$b_path/bs

## if you don't have it, download the basepspace programme (aka bs) to $bs :
wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O $bs ; chmod 775 $bs

# follow instructions to complete setup:
$bs auth
# more info if you like: (press Q to exit)
$bs download -h

Hoo. Nearly there.

NB: In reality, if we’re running these programmes on UCC’s HPC cluster, we will have to change the format for these commands to work with a HPC management programme called SLURM. We’ll cover this later, but for the moment any SLURM work will go into a drop-down box like the one below - ignore these for now.


last crucial step!

Because we used mamba to install our programmes above into the mgx environment, we need to “activate” the mgx environment before use or the computer will think the programmes are not installed.

mamba activate mgx

## to deactivate it afterwards
# mamba deactivate

Time to go.



1. download (& check)

Basespace (aka bs) should have been installed etc. above. Follow the basespace link sent by your sequencer - it will add the data to your illumina account (which should also have been set up above).

https://basespace.illumina.com/s/abcdefghijkl      # not a real link!

When you click the above it will bring you to a webpage - the project ID for download is in that page’s URL address e.g. :

https://basespace.illumina.com/projects/123456789/about

, where 123456789 is the project ID. Mad I know.

# save your real project ID here
projectID=123456789

## download FASTQ to the folder we chose above - will take a while to finish, and will 
$bs download project --id $projectID --output $raw --log=$f_path/fdl.log & 

  
##  tidy up the FASTQ files 
mv $raw/*/*fastq.gz $raw/ ; rmdir $raw/*
## name wrangle to remove _L00*_R*_001 from name if necessary
for f in $raw/*R1*gz ; do mv $f ${f%%_L00*}_R1.fastq.gz ; done
for f in $raw/*R2*gz ; do mv $f ${f%%_L00*}_R2.fastq.gz ; done

  
## have a look at your new files! red = compressed
ls $raw/*fastq.gz
# use that to pull a sample list of ruse later on  
ls $raw/*gz | sed -r 's/.*\/(.*)_R.*/\1/g' | sort -u > $mat/${proj}__samples.txt

# set a test sample
test=vaginal-68-Visit-1_S11
lk $raw/${test}*


Check the sequencing output


## F / M Q C    ================================================================

fastqc -t 20 $raw/*fastq.gz -o $qc/${proj}_raw && multiqc $qc/${proj}_raw -o $qc/${proj}_raw_multi

## exclude negative controls if empty
# ls $raw/*gz | sed -r 's/.*\/(.*)_R.*/\1/g' | sort -u  | grep -vE 'LIBNEG' > $mat/${proj}__samples.txt
SLURM version + fiddling

##   < ! >   this isn't recal'd for UCC SLURM / non-omm

#!/bin/bash

#SBATCH --job-name=fqc20
#SBATCH --output=fqc20.%j.slrm
#SBATCH --ntasks=20

INTXT=$1
OUT=$2


mkdir -p $OUT/fwd $OUT/rev ${OUT}_multi

if [[ $? = 0 ]] && [[ -s $INTXT ]] ;
then
  echo " + + +   running fastqc   --- - - -- - -   "
  module load fastqc multiqc
  fastqc -t 20 $(cat $INTXT) -o $OUT   #  just about works!!
else
  echo " < ! >   something awry with the fastq.gz given to FQC/MQC, or the output <dirs> :
     < ! >   check <dir> structure and file manifest : $INTXT"
  # exit 1
fi

mv $OUT/*R1* $OUT/fwd/
mv $OUT/*R2* $OUT/rev/
echo " + + +   running multiqc   -- -- --- -- - - --- "
multiqc $OUT/fwd -o ${OUT}_multi/multi_fwd
multiqc $OUT/rev -o ${OUT}_multi/multi_rev


Now to see how that looks. Copy the multiQC.html output to your computer and have a look:



# XXX.YYY.34.1 is the address for the HPC (replace XXX.YYY with our real prefix) - we use -o ProxyJump to send the request through garrison.ucc.ie

# run this command on the server (to get the correct paths etc.), then copy and paste on your local (UNIX) machine
echo "scp -o ProxyJump=${self}@garrison.ucc.ie ${self}@XXX.YYY.34.1:$qc/${proj}_raw_multi/multiqc_report.html ./Desktop/"

# in your browser, open the `multiqc_report.html` that has appeared on your desktop 


Lots of info. Focus on Per Base Sequence Content, clicking through a few different samples.

per base sequence quality

per base sequence quality

Click the drop-down for an overview of the MultiQC output.

Considering the faec & vag samples come from very different substrates, these look v comparable - probably because we haven’t remove the host DNA yet. The smooth lines indicate regions where the overall sampling of communtiy DNA is so balanced that the lines separate reliably along the difference in DNA GC content (a characteristic of each sample - usually 40-60%). this is good! Note how things start and end badly though - we always trim the beginning and end.

faecal sample

faecal sample

vaginal sample

vaginal sample


These are more jagged, more erratic - this indicates different bases are spiking up and down because there are so few reads, you’re not getting a balanced sample across all positions. This is particulalrly evident at the 5' (start) of the read: each read starts with a synthetic adapter, all of which have the same sequence. As a result, the bases at those locations are clearly A, T, G, or C.

Note that a negative control _should_look like this - the less DNA in the negative control the better!

Vaginal-bad

Vaginal-bad

However, this is a vaginal sample that did not perform well (likely low yield). Something to be aware of.

control sample

control sample


After clicking through a few samples, you’ll get a feel for where the ” kinks ” or changes in the read-quality is (at beginning and end). Here, I think 20bp (on the 5' / start) and 105bp (on the 3' / end). This means we’re throwing away more than a third of our data, before we even start. We must endure this pain.

Additionally, although trimming requirements are equivalent at the moment, it might make sense to process the faecal and vaginal samples separately. We can come back to this as necessary.


2. trim and filter sequences (& check) - trimmomatic

Run trimmomatic, then QC the outputs:

  
##--< ! >   MAKE SURE you check F+MQC  to set your parameters below     < ! >--

# use ILLUMINACLIP before CROP to maximise space for pattern recognition
cat $mat/${proj}__samples.txt | parallel -j 4 "trimmomatic PE \
  $raw/{}_R1.fastq.gz \
  $raw/{}_R2.fastq.gz \
  $filt/{}_R1_trimm.fastq.gz \
  $filt/{}_R1_trimm_unpaired.fastq.gz \
  $filt/{}_R2_trimm.fastq.gz \
  $filt/{}_R2_trimm_unpaired.fastq.gz \
  ILLUMINACLIP:$mat/fqc_trimmo_ill_ref.fa:2:30:10:5 \
  SLIDINGWINDOW:6:15 \
  CROP:105 \
  HEADCROP:19 \
  MINLEN:50 \
  -threads 4 > $filt/{}_trim.log 2>&1"

# run FQC-MQC on the trimmed data
mkdir $filt/unpaired ; mv $filt/*unpaired* $filt/unpaired/
fastqc -t 12 $filt/*_trimm.fastq.gz -o $qc/${proj}_filt ; multiqc $qc/${proj}_filt -o $qc/${proj}_filt_multi
SLURM version of trimmomatic

Write the SLURM script:

#!/bin/bash

#SBATCH --job-name=trimmo4
#SBATCH --output=omm__2__trimmo4.%j.slrm
#SBATCH --ntasks=4

sl_samp=$1
sl_indir=$2
sl_outdir=$3

# sl_samp=$test
# sl_indir=$raw
# sl_outdir=$filt
# echo $sl_samp $sl_indir $sl_outdir ${sl_outdir%\/*}

##--< ! >   MAKE SURE you check F+MQC  to set your parameters below     < ! >--

mamba activate geno

trimmomatic PE \
  $sl_indir/${sl_samp}_R1.fastq.gz \
  $sl_indir/${sl_samp}_R2.fastq.gz \
  $sl_outdir/${sl_samp}_R1_trimm.fastq.gz \
  $sl_outdir/${sl_samp}_R1_trimm_unpaired.fastq.gz \
  $sl_outdir/${sl_samp}_R2_trimm.fastq.gz \
  $sl_outdir/${sl_samp}_R2_trimm_unpaired.fastq.gz \
  ILLUMINACLIP:${sl_outdir%\/*}/Materials/fqc_trimmo_ill_ref.fa:2:30:10:5 \
  SLIDINGWINDOW:6:15 \
  CROP:105 \
  HEADCROP:20 \
  MINLEN:50 \
  -threads 4
  
# presume logged to trimmo4.%j.slrm

Run the SLURM script:


# CHECK THE CROP VALUES (105, 20, 50), then insert the script above:
nano $mat/ucc__fbio__slurm__trimmo4.sh 

## separate processing? note you abandon the neg ctrls here - do you really want to do this?
# grep "faecal" $mat/${proj}__samples.txt > $mat/${proj}__samples-faecal.txt
# grep "vaginal" $mat/${proj}__samples.txt > $mat/${proj}__samples-vaginal.txt

# for f in $( cat $mat/${proj}__samples.txt ) ; 
for f in $( cat $mat/${proj}__samples.txt | head -4 ) ; 
do 
  sbatch $mat/ucc__fbio__slurm__trimmo4.sh $f $raw $filt ; 
done

then check what you’ve done…


3. remove host sequences - Deacon

Future work should include developing a better suited database to catch possible contaminants.

# downloaded the correct (human) index when we installed

# Deplete short paired reads
time cat $mat/${proj}__samples.txt | parallel -j 4 "deacon filter -d \
  $db/deacon_db/panhuman-1.k31w15.idx \
  $filt/{}_R1_trimm.fastq.gz \
  $filt/{}_R2_trimm.fastq.gz \
  -o $host/{}_R1_deco.fastq.gz \
  -O $host/{}_R2_deco.fastq.gz \
  --threads 4 > $host/{}_hostless.log 2>&1"
SLURM version of deacon

Write the SLURM script into ucc__fbio__slurm__deacon6.sh

```{ bash slurm_deacon } #!/bin/bash

#SBATCH –job-name=deacon6 #SBATCH –output=omm__3__deacon6.%j.slrm #SBATCH –ntasks=6

sl_samp=$1 sl_indir=$2 sl_outdir=$3

mamba activate geno

time cat \(mat/\){proj}__samples.txt | parallel -j 4 “deacon filter -d
$db/deacon_db/panhuman-1.k31w15.idx
\(sl_indir/\){sl_samp}_R1_trimm.fastq.gz
\(sl_indir/\){sl_samp}_R2_trimm.fastq.gz
-o \(sl_outdir/\){sl_samp}_R1_deco.fastq.gz
-O \(sl_outdir/\){sl_samp}_R2_deco.fastq.gz
–threads 6 > \(host/\){sl_samp}_hostless.log 2>&1”



Run the `SLURM` script:
  

``` bash

# for f in $( cat $mat/${proj}__samples.txt ) ; 
for f in $( cat $mat/${proj}__samples.txt | head -4 ) ; 
do 
  sbatch $mat/ucc__fbio__slurm__deacon6.sh $f $raw $filt ; 
done

then check what you’ve done.


4. identify - Kraken2 + Bracken

Previous steps have been very clear in their objective. For #4, need to have a think about what is required or sensible for your own project - more interested in specifics (genomes) or in communities (metagenomes)? Genes more than taxa (pathway analysis)?

Below, we presume metagenomics is the main interest, and that we want something powerful and quick (and good).


##  Kraken 2  -------------------------------------------------------------
      
# minimum hit groups and confidence are major 
while read samp ;
do time kraken2 --db $lang \
    $host/${samp}_hostless/${samp}_R1_trimm.clean_1.fastq.gz \
    $host/${samp}_hostless/${samp}_R2_trimm.clean_2.fastq.gz \
    --paired \
    --threads 14 \
    --confidence 0.1 \
    --gzip-compressed \
    --report-zero-counts \
    --minimum-hit-groups 5 \
    --minimum-base-quality 20 \
    --report $krak/${samp}_kraken2_report \
    --unclassified-out $krak/${samp}_kraken_unclass# \
    --output $krak/${samp}_kraken_output > $krak/${samp}_krak2.log && echo " + + +   sample ${samp} completed task" >> $krak/kraken_okay.log ; 
done< $mat/${proj}__samples.txt

kreport2mpa.py -r $krak/${test}_kraken_report -o $krak/${proj}__${test}_kraken_mpa
grep -h '|s_' $krak/${proj}__${test}_kraken_mpa | cut -f 1 |   sort | uniq | sed 's/|/\t/g' > $krak/${proj}__krakenStnd_taxonomy.tsv
less -S $krak/${proj}__krakenStnd_taxonomy.tsv


##  Bracken  -------------------------------------------------------------

BR_r=150      # $BR_leng
BR_l=S
BR_t=50   # counts of microbes! not threads
for i in $( cat $mat/${proj}__samples.txt );
do 
  bracken -d $db/kraken2_standard_langm/ -i $krak/${i}_kraken2_report -o $krak/${i}.bracken -r $BR_r -l $BR_l -t $BR_t ;
done > $krak/${proj}_krak2_bracken.log                                                               
combine_bracken_outputs.py --files $krak/*.bracken -o $krak/${proj}__krakenStnd_abundances.tsv >> $krak/${proj}_krak2_bracken.log                                                               

##   e x i t    s t a g e   ====================================================

mkdir $mat/output
cp $krak/${proj}__krakenStnd_abundances.tsv ${KRAK}*/${proj}__krakenStnd_taxonomy.tsv $mat/output/

## quick exports 
scp -i .ssh/id_file -oProxyJump=${self}@garrison.ucc.ie ${path}/${proj}.tar.gz ${self}@XXX.YYY.34.1:$basepath/