R
workflow is complete, but page remains under construction đ§
The R
work starts here đŹ, continued from work
assembling the metagenomic
data.
Getting metagenomic data from bash & co.
to
R
can be awkward - this is just a quick run-through of one
way it can be done, creating basic .tsv
files (as enjoyed
by microsoft excel), bringing them into basic R
first, and
then to the great R
library phyloseq
. Feel
free to suggest alternate routes.
Here, we do the following: for Kraken2+Bracken
, and
separately for `Kaiju
:
Bracken
or
Kaiju
into R
:R
in one of two ways: a.
reorganise it for Bracken
b. reorganise it for
Kaiju
.tsv
text files, that can
be used in e.g. excel (if you like excel)phyloseq
,
a R
library specifically for doing microbial
community ecology, designed to keep all the community data in one place,
and make using R
easier for microbiologists, virologists,
mycologists & co.Note: - itâs the same idea for
Kraken2-Bracken
and for Kaiju
, except that
step 2 is different (because Bracken
output is different
from Kaiju
output). Note also that weâre using the outputs
from the first
page.
R
This might not be necessary for you, but here we set up
R
using conda
- this creates a local version
that we have control over. We start off in *bash*
, using conda
to get going.
# R! - requires conda to be installed
conda create -n fyR
conda activate fyR
# get a personal copy of R; -y flag for silent install
# we'll still have to install libraries of interest
conda install -c r -y r
# dir for microbial community
R=$WRK/5__rmicro
mkdir $R
mkdir $R/analysis $R/input $R/output $R/vis $R/documents
# arrange your outputs from Krak & kaiju
cp $KROUT/*tsv $R/input
cp $KaOUT/*tsv $R/input
# check picking up the correct env
which R
# here we go!
R
R
Point to your output files, and import them: here there are 2 files
for Kraken2
, 1 file for Kaiju
.
## Kraken2
# output from kraken2-bracken, combined to one file
krak2_brack_file="../input/rmicro__krakenStnd_abundances.tsv"
# output from single kraken2 report, converted to MPA
krak2_mpa_file="../input/rmicro__krakenStnd_taxonomy.tsv"
## Kaiju
# output from kaiju's kaiju2table output
kaij_file="../input/rmicro__total__kaiju_summary_species.tsv"
There are many, many ways to do things (generally and) in
R
- note that this page avoids using R::tidy
solutions, but feel free to suggest some.
Kraken2+Bracken
data:Point to output file, and import
k2dat <- read.table(file = krak2_brack_file, header=TRUE, quote = "", sep = '\t')
k2dat[1:10, 1:10]
prenames_k2dat <- paste0("k2_", stringr::str_pad(k2dat[,2], width=7, pad=0))
## non-unique names - assume only one dupe of each
prenames_k2dat[ duplicated(k2dat[,2])] <- gsub('k2_', 'k2-A_', prenames_k2dat[ duplicated(k2dat[,2])])
rownames(k2dat) <- prenames_k2dat
# counts
k2_counts <- k2dat[ , grep('_num', colnames(k2dat))]
colnames(k2_counts) <- gsub( ".bracken_num", "", colnames(k2_counts) )
# relab%
k2_relab <- k2dat[ , grep('_frac', colnames(k2dat))]
colnames(k2_relab) <- gsub( ".bracken_frac", "", colnames(k2_relab) )
We need to make taxonomic data from the material we have (we made
files for this using --report-zero-counts
and
kreport2mpa.py
). Note that this takes a little
organisation, as the taxonomic hierarchy is not the same across all
domains of life (because it is imaginary).
## start with data from the abundance table names --------------------------------
tax_a <- data.frame(taxon=k2dat[ ,1], k2_id=k2dat[ ,2], row.names = rownames(k2dat) , stringsAsFactors = FALSE)
tax_a[,1] <- gsub('\\/','', tax_a[,1])
tax_a[,1] <- gsub('ALOs_','ALOs-', tax_a[,1])
# tax_a[,1] <- gsub("'","", tax_a[,1]) # none?
## NOTE use quote="" to avoid nightmare of special characters like '"`/,| etc in names
tax_b <- read.table( krak2_mpa_file, sep='\t', header=FALSE, fill = TRUE, stringsAsFactors = FALSE, quote="")
head(tax_b)
## regularise taxonomic ranks (not all have K:P:C:O:F:G:S etc.) -----------------
ranks <- c("k__", "p__", "c__", "o__" , "f__", "g__", "s__")
tax_b <- t(apply(tax_b, 1, function(aa){ # aa <- tax_b[1,]
sapply(ranks, function(aaa){ # aaa <- "c__"
bbb <- grep(aaa, unlist(aa), value=TRUE)
ifelse( length(bbb) == 0, "unkn.", bbb)
})
}))
dim(tax_b) ; str(tax_b)
## e.g. eukarya have loads of ranks, so many empty sapces to be filled, e.g.
# tax_b[4000:4200,]
## more tax issues due to names: doubled ranks:
prior_index <- (unlist( lapply(1:nrow(tax_b), function(aa){ if( sum( grepl("unkn.", tax_b[aa,])) == 6){aa} }) ) - 1)
# # compare:
# tax_b[ prior_index , ]
# tax_b[ prior_index+1 , ]
# amend, by doubling up (will remove duplicates later)
tax_b[ prior_index , "s__"] <- tax_b[ (prior_index+1) , "s__"]
# row is only useful if has a species: cannot have a species, and hit 6, And be worth keeping
tax_b <- tax_b[ !apply(tax_b, 1, function(aa) sum(grepl("unkn.", unlist(aa)) ) == 6) , ]
colnames(tax_b) <- c("D", "P", "C", "O", "F", "G", "S")
tax_a[ , "taxon"] <- gsub(" ", "_", paste0("s__", tax_a[ , "taxon"]))
head(tax_a) ; head(tax_a[ , "taxon"]) ; dim(tax_a)
head(tax_b) ; head(tax_b[ , "S"]) ; dim(tax_b)
## stitch & name ----------------------------------
tax_c <- merge(tax_b, tax_a, by.x="S", by.y="taxon")[, c(2:7,1,8)]
head(tax_c) ; dim(tax_c)
prenames_tax <- paste0("k2_", stringr::str_pad(tax_c[,"k2_id"], width=7, pad=0))
rownames(tax_c) <- prenames_tax
## check all on the same page -----------------------
all(rownames(k2dat) == rownames(k2_counts) & rownames(k2dat) == rownames(k2_relab))
all(rownames(k2dat) %in% rownames(tax_c))
kraken_ids <- sort(rownames(k2dat))
k2dat <- k2dat [kraken_ids , ]
k2_counts <- k2_counts [kraken_ids , ]
k2_relab <- k2_relab [kraken_ids , ]
tax_c <- tax_c[kraken_ids , ]
## remove if no values (kraken2 set to give all taxa)
pos_abund <- rownames(k2_relab)[rowSums(k2_relab) > 0]
k2_relab <- k2_relab[ pos_abund , ]
k2_counts <- k2_counts[ pos_abund , ]
k2_tax <- tax_c[ pos_abund , ]
Next, add in the metadata we use to organise this study. Environmental variables like pH, SCFA, dates, locations, and other study design data go in here. We create a minimal set based on the data available:
mgdat <- data.frame("sample" = colnames(k2_counts),
"type" = gsub("(.).*", "\\1", colnames(k2_counts), perl = TRUE),
"seqdepth" = colSums(k2_counts) )
# View(mgdat)
Next, check that all the data looks sane and normal.
## check - do all match? -----------------------
dim(k2_counts)
dim(k2_relab)
dim(k2_tax)
dim(mgdat)
# look
k2_counts[1:10, 1:10]
# View(k2_counts)
# hist( log10(k2_counts ))
Kraken2
on itâs own :(
note that this isnât recommended
You might not want to use Bracken
(or more likely, you
might not be able to make a Bracken
database). It is not the best idea,
but you can
Kraken2
reports to one fileS
) and strains (S1
)
from that reportS
pecies levelFirst, we get the species data. Be careful here that you donât
accidentally include the $TEST
file too! This would
duplicate that sample.
## i n b a s h ================================================== ##
# presume you've made a load of kraken2 files:
ls -lsh $KROUT/*kraken_report
# be careful here that you don't accidentally include the $TEST file too! This would duplicate that sample
~/bin/KrakenTools/combine_kreports.py -r $KROUT/*kraken_report -o $KROUT/krakenStnd_kreportCombo.tsv
# table of sorts
grep -E 'perc\stot_all' $KROUT/krakenStnd_kreportCombo.tsv | sed 's/#//' > $KROUT/krakenStnd_kreportCombo_count.tsv
grep -E '\sS1?\s' $KROUT/krakenStnd_kreportCombo.tsv >> $KROUT/krakenStnd_kreportCombo_count.tsv
Put the file where it needs to be (assuming alongside
krak2_brack_file
& krak2_mpa_file
described above). Note the structure of this data using
head()
: there are two columns for each sample:
all
(all reads from this point onwards in the tree), and
lvl
(reads assigned to this specific level in the tree). If
we selected all
, we would count the same values multiple
times at each level. The lvl
counts are the abundances
assigned specifically to that taxonomic point - this is what we will
use.
As there will be several e.g. Streptococcus thermophilus
strains (S1
level), we will gather read abundances for all
those strains to S
pecies level by finding all taxa whoâs
names contain Streptococcus thermophilus (and do this for all
taxa present).
Note that we also make a taxonomy
table based on the
work above, to try match the feature IDs properly.
just_krak <- "../input/krakenStnd_kreportCombo.tsv"
krak2 <- read.table(just_krak, header=TRUE, sep="\t", quote = "")
rownames(krak2) <- paste0("k2_", stringr::str_pad(krak2$taxid, width=7, pad=0))
head(krak2)
# View(krak2)
# choose to keep the abundances at "_lvl", rather than at "_all"
kfeat <- krak2[ , grep("\\d_lvl$", colnames(krak2), value= TRUE, perl=TRUE) ]
head(kfeat)
# fix some names, esp for the next step
tax_x <- data.frame(taxon=krak2[ ,30], k2_id=krak2[ ,29], row.names = rownames(krak2) , stringsAsFactors = FALSE)
tax_x[,1] <- gsub('\\/','', tax_x[,1])
tax_x[,1] <- gsub('ALOs_','ALOs-', tax_x[,1])
tax_x[,1] <- gsub('[\\[\\]]','', tax_x[,1], perl = TRUE)
# make a clean genus and species column
tax_x$G <- gsub("^\\s*(\\w*)\\s.*", "\\1", tax_x$taxon, perl=TRUE)
tax_x$S <- gsub("^\\s*(\\w+)\\s(\\w+)\\s*.*", "\\2", tax_x$taxon, perl=TRUE)
head(tax_x)
# View(tax_x)
# now gather abundances based on perfect matches to each "Genus species" combo
kIDs_per_unique_binomials <- apply( unique(tax_x[,c("G", "S")]), 1, function(aa){ rownames(tax_x)[(tax_x$G == unlist(aa[1])) & (tax_x$S == unlist(aa[2]))] })
k2only_counts <- t(sapply( kIDs_per_unique_binomials, function(aa){ colSums(kfeat[ aa , ]) }))
# make a taxonomy file, based on teh work done above using krak2_mpa_file
k2only_tax2 <- tax_c[kraken_ids , ]
# check dimensions / orientation
dim(k2only_counts)
dim(k2only_tax)
From here, k2only_counts
and k2only_tax
can
be used instead of k2_counts
& k2_tax
made
above (i.e. use them to make phyloseq
etc.), though you
will have to edit the names as necessary. Do also check that the
taxa
/samples
are on the right sides!
Kaiju
data:In case we have outputs for Kaiju
: point to output file,
and then manipulate that data. We extract count and %
data
directly from the kaidat
object, and set
kaiju IDs
as the ânamesâ of the taxa.
Note, had to delete some â#
â characters from file, which
gives errors.
## note the quote option, to stop strange names doing strange things
kaidat <- read.table(file = kaij_file, header = TRUE, sep = "\t", stringsAsFactors = FALSE, quote = "")
# remove 1+ header(s)
kaidat <- kaidat[ !grepl("file", kaidat[,1]) , ]
dim(kaidat)
head(kaidat)
# View(kaidat)
## fix sample names
kaidat$file <- gsub(".*/(.*)_kaiju.*", "\\1", kaidat$file)
kaidat$otu <- gsub( ".*;(.*);", "\\1", kaidat$taxon_name )
kaidat$ID <- paste0( "kai__", stringr::str_pad( kaidat[, "taxon_id"], width = 7, pad = 0) )
length(unique(kaidat$file)) # 20 files
length(unique(kaidat$otu)) # 11.7K OTUs
length(unique(kaidat$ID)) # 11.7K IDs
kaidat$percent <- as.numeric(as.character(kaidat$percent))
# should be very close to nsamples * 100
sum(kaidat$percent, na.rm = TRUE)
head(kaidat) # View(kaidat)
str(kaidat)
## assess total output ----------------------------
kaidf <- as.data.frame(kaidat)
kaidf[ , "reads"] <- as.numeric(kaidf[ , "reads"])
# again, counts and percent
sum( (kaidf[ , "reads"]))
sum( (kaidf[ , "percent"]), na.rm = TRUE) # nsample*100%
## nreads unclassified / sent to bin
kaidf$substrate <- gsub("([A-Z]).*", "\\1", kaidf$file)
# ggplot(kaidf, aes(log10(reads), fill = substrate)) + geom_boxplot() + labs(title = "total N reads per substrate")
## congeal -------------------------------------------
# use the Kaiju ID, bot the genus name
u_bio <- unique( kaidf$ID)
u_sample <- unique( kaidf$file)
## count data
# to each unique taxon found (u_bio), assign the number of counts found in each unique sample (u_sample)
ka_counts <- do.call("rbind", lapply( u_bio, function(aa){ # aa <- u_bio[100]
bb_df <- kaidf[ grep(aa, kaidf$ID) , ]
cc_vec <- unlist(lapply( u_sample, function(aaa){ # aaa <- u_sample[9]
as.numeric(bb_df[ match(aaa, bb_df$file) , "reads" ])
}))
} )
)
rownames(ka_counts) <- u_bio # [1:100]
colnames(ka_counts) <- u_sample
# only keep real numbers
ka_counts[ is.na(ka_counts)] <- 0
dim( ka_counts <- ka_counts[ , !apply( ka_counts, 2, function(aa){ all(aa == 0)}) ] )
## % data
# same, but make relative abudances using the values given by kaiju
ka_relab <- do.call("rbind", lapply( u_bio, function(aa){ # aa <- u_bio[100]
bb_df <- kaidf[ grep(aa, kaidf$ID) , ]
cc_vec <- unlist(lapply( u_sample, function(aaa){ # aaa <- u_sample[9]
as.numeric(bb_df[ match(aaa, bb_df$file) , "percent" ])
}))
} )
)
rownames(ka_relab) <- u_bio # [1:100]
colnames(ka_relab) <- u_sample
# only keep real numbers
ka_relab[ is.na(ka_relab)] <- 0
dim( ka_relab <- ka_relab[ , !apply( ka_relab, 2, function(aa){ all(aa == 0)}) ] )
# look
ka_counts[1:10,1:10]
ka_relab[1:10,1:10]
Again, we have to fabricate out own taxonomic table, while taking care to manage any irregularities.
ka_tax <- unique( t(# do.call("rbind",
apply( kaidf[ , c("ID", "taxon_name")], 1, function(aa){ # aa <- kaidf[ 1, c("ID", "taxon_name" )]
bb_vec <- aa[2]
if( grepl("Viruses", bb_vec)){
cc_vec <- c( rep( "Viruses", length(2:7)), aa[1])
}else{
cc_vec <- c( unlist(strsplit(unlist(bb_vec), ";"))[ 2:7], aa[1] )
}
cc_vec }) ))
# set names
rownames(ka_tax) <- ka_tax[,7]
colnames(ka_tax) <- c("p", "c", "o", "f", "g", "s", "ID") # Jun'22 - remove "d" here, increase index of all names by 1
head(ka_tax)
dim(ka_tax <- (unique(ka_tax)))
Note - at this stage weâve reduced gigabytes of sequence data into ~3
basic spreadsheets (.tsv
), irrespective of whether it came
from Kraken2
or Kaiju
(or some other program)
- from here on, the data used can be processed using standard methods
from biology and data-science, whether thatâs in R
,
python
, MSExcel
etc.
## general metadata
write.table(mgdat, "../output/rmicro__metadata__dat.tsv", sep = "\t")
saveRDS( mgdat, "../output/rmicro__metadata__dat.RDS")
## for Kraken2:
## tsv output
write.table(k2_counts, "../output/rmicro__krakenStnd__num.tsv", sep = "\t")
write.table(k2_relab, "../output/rmicro__krakenStnd__frac.tsv", sep = "\t")
write.table(k2_tax, "../output/rmicro__krakenStnd__tax.tsv", sep = "\t")
## RDS for R output - note samples are ROWS
saveRDS( t(k2_counts), "../output/rmicro__krakenStnd__num.RDS")
saveRDS( t(k2_relab), "../output/rmicro__krakenStnd__frac.RDS")
saveRDS( k2_tax, "../output/rmicro__krakenStnd__tax.RDS")
## for Kaiju:
## tsv output
write.table(ka_counts, "../output/rmicro__kaijuStnd__num.tsv", sep = "\t")
write.table(ka_relab, "../output/rmicro__kaijuStnd__frac.tsv", sep = "\t")
write.table(ka_tax, "../output/rmicro__kaijuStnd__tax.tsv", sep = "\t")
## RDS for R output - note samples are ROWS
saveRDS( t(ka_counts), "../output/rmicro__kaijuStnd__num.RDS")
saveRDS( t(ka_relab), "../output/rmicro__kaijuStnd__frac.RDS")
saveRDS( ka_tax, "../output/rmicro__kaijuStnd__tax.RDS")
R::phyloseq
More info on the gateway drug that is phyloseq
, see the
phyloseq main page, and
here
for importing data. Note again that here, it doesnât matter whether itâs
Kraken2
or Kaiju
(or some other program) -
from here on, the data used can be processed using standard methods from
biology and data-science
## note: we use the count data (making % is easy in R, but
## - we would have no way of knowing the original counts)
## note also: we did not extract any phylogenetic tree from
## - the shotgun data, so no tree is included below
library(phyloseq)
## Kraken2/Bracken ---
# look at the first ten rows, ten columns:
k2_counts[1:10, 1:10]
k2_relab[1:10, 1:10]
k2_tax[1:10, 1:5]
mgdat[1:10, ]
# note: we use the count data
k2_otu <- otu_table(k2_counts, taxa_are_rows = TRUE)
# convert from df to matrix to tax_table to avoid losing rownames...
k2_tax2 <- tax_table(as.matrix(k2_tax))
(k2_phyloseq_dat <- phyloseq(
k2_otu,
k2_tax2,
# note no phylogenetic tree
sample_data(mgdat)
))
saveRDS( k2_phyloseq_dat, "../output/rmicro__KrakenStnd__phylo.RDS")
# check: save and read back in. RDS is a handy way to save/load data
k2_phylo <- readRDS("../output/rmicro__KrakenStnd__phylo.RDS")
k2_phylo
## Kaiju ----------
# look at the first ten rows, ten columns:
ka_counts[1:10, 1:10]
ka_relab[1:10, 1:10]
ka_tax[1:10, 1:5]
mgdat[1:10, ]
# note: we use the count data..
# sample names need to match - here we solve an issue between "-" and "."
colnames(ka_counts) <- gsub("-", ".", colnames(ka_counts))
dim(kai_otu <- otu_table(ka_counts, taxa_are_rows = TRUE))
# minimum modification of sample names
colnames(kai_otu) <- paste0("X", colnames(kai_otu))
dim(kai_tax2 <- tax_table(ka_tax))
(kai_phyloseq_dat <- phyloseq(kai_otu, kai_tax2, sample_data(mgdat)))
saveRDS( kai_phyloseq_dat, "../output/rmicro__KaijuStnd__phylo.RDS")
# demo - save out and read back in again
kai_phylo <- readRDS("../output/rmicro__KaijuStnd__phylo.RDS")
kai_phylo
That was, hopefully, easy! Take this data, and have a look at the phyloseq guide to exploring
data, their inspirational F1000 expanded methods
paper, a quick list of related
commands, or other guides as recommended on the climber
page.