Rworkflow is complete, but page remains under construction đ§
The R work starts here đŹ, continued from work
assembling the metagenomic
data.
Getting metagenomic data from bash & co. to
R can be awkward - this is just a quick run-through of one
way it can be done, creating basic .tsv files (as enjoyed
by microsoft excel), bringing them into basic R first, and
then to the great R library phyloseq. Feel
free to suggest alternate routes.
Here, we do the following: for Kraken2+Bracken, and
separately for `Kaiju:
Bracken or
Kaiju into R:R in one of two ways: a.
reorganise it for Bracken b. reorganise it for
Kaiju.tsv text files, that can
be used in e.g. excel (if you like excel)phyloseq,
a R library specifically for doing microbial
community ecology, designed to keep all the community data in one place,
and make using R easier for microbiologists, virologists,
mycologists & co.Note: - itâs the same idea for
Kraken2-Bracken and for Kaiju, except that
step 2 is different (because Bracken output is different
from Kaiju output). Note also that weâre using the outputs
from the first
page.
RThis might not be necessary for you, but here we set up
R using conda - this creates a local version
that we have control over. We start off in *bash*, using conda
to get going.
# R! - requires conda to be installed
conda create -n fyR
conda activate fyR
# get a personal copy of R; -y flag for silent install
# we'll still have to install libraries of interest
conda install -c r -y r
# dir for microbial community
R=$WRK/5__rmicro
mkdir $R
mkdir $R/analysis $R/input $R/output $R/vis $R/documents
# arrange your outputs from Krak & kaiju
cp $KROUT/*tsv $R/input
cp $KaOUT/*tsv $R/input
# check picking up the correct env
which R
# here we go!
R
RPoint to your output files, and import them: here there are 2 files
for Kraken2, 1 file for Kaiju.
## Kraken2
# output from kraken2-bracken, combined to one file
krak2_brack_file="../input/rmicro__krakenStnd_abundances.tsv"
# output from single kraken2 report, converted to MPA
krak2_mpa_file="../input/rmicro__krakenStnd_taxonomy.tsv"
## Kaiju
# output from kaiju's kaiju2table output
kaij_file="../input/rmicro__total__kaiju_summary_species.tsv"
There are many, many ways to do things (generally and) in
R - note that this page avoids using R::tidy
solutions, but feel free to suggest some.
Kraken2+Bracken data:Point to output file, and import
k2dat <- read.table(file = krak2_brack_file, header=TRUE, quote = "", sep = '\t')
k2dat[1:10, 1:10]
prenames_k2dat <- paste0("k2_", stringr::str_pad(k2dat[,2], width=7, pad=0))
## non-unique names - assume only one dupe of each
prenames_k2dat[ duplicated(k2dat[,2])] <- gsub('k2_', 'k2-A_', prenames_k2dat[ duplicated(k2dat[,2])])
rownames(k2dat) <- prenames_k2dat
# counts
k2_counts <- k2dat[ , grep('_num', colnames(k2dat))]
colnames(k2_counts) <- gsub( ".bracken_num", "", colnames(k2_counts) )
# relab%
k2_relab <- k2dat[ , grep('_frac', colnames(k2dat))]
colnames(k2_relab) <- gsub( ".bracken_frac", "", colnames(k2_relab) )
We need to make taxonomic data from the material we have (we made
files for this using --report-zero-counts and
kreport2mpa.py). Note that this takes a little
organisation, as the taxonomic hierarchy is not the same across all
domains of life (because it is imaginary).
## start with data from the abundance table names --------------------------------
tax_a <- data.frame(taxon=k2dat[ ,1], k2_id=k2dat[ ,2], row.names = rownames(k2dat) , stringsAsFactors = FALSE)
tax_a[,1] <- gsub('\\/','', tax_a[,1])
tax_a[,1] <- gsub('ALOs_','ALOs-', tax_a[,1])
# tax_a[,1] <- gsub("'","", tax_a[,1]) # none?
## NOTE use quote="" to avoid nightmare of special characters like '"`/,| etc in names
tax_b <- read.table( krak2_mpa_file, sep='\t', header=FALSE, fill = TRUE, stringsAsFactors = FALSE, quote="")
head(tax_b)
## regularise taxonomic ranks (not all have K:P:C:O:F:G:S etc.) -----------------
ranks <- c("k__", "p__", "c__", "o__" , "f__", "g__", "s__")
tax_b <- t(apply(tax_b, 1, function(aa){ # aa <- tax_b[1,]
sapply(ranks, function(aaa){ # aaa <- "c__"
bbb <- grep(aaa, unlist(aa), value=TRUE)
ifelse( length(bbb) == 0, "unkn.", bbb)
})
}))
dim(tax_b) ; str(tax_b)
## e.g. eukarya have loads of ranks, so many empty sapces to be filled, e.g.
# tax_b[4000:4200,]
## more tax issues due to names: doubled ranks:
prior_index <- (unlist( lapply(1:nrow(tax_b), function(aa){ if( sum( grepl("unkn.", tax_b[aa,])) == 6){aa} }) ) - 1)
# # compare:
# tax_b[ prior_index , ]
# tax_b[ prior_index+1 , ]
# amend, by doubling up (will remove duplicates later)
tax_b[ prior_index , "s__"] <- tax_b[ (prior_index+1) , "s__"]
# row is only useful if has a species: cannot have a species, and hit 6, And be worth keeping
tax_b <- tax_b[ !apply(tax_b, 1, function(aa) sum(grepl("unkn.", unlist(aa)) ) == 6) , ]
colnames(tax_b) <- c("D", "P", "C", "O", "F", "G", "S")
tax_a[ , "taxon"] <- gsub(" ", "_", paste0("s__", tax_a[ , "taxon"]))
head(tax_a) ; head(tax_a[ , "taxon"]) ; dim(tax_a)
head(tax_b) ; head(tax_b[ , "S"]) ; dim(tax_b)
## stitch & name ----------------------------------
tax_c <- merge(tax_b, tax_a, by.x="S", by.y="taxon")[, c(2:7,1,8)]
head(tax_c) ; dim(tax_c)
prenames_tax <- paste0("k2_", stringr::str_pad(tax_c[,"k2_id"], width=7, pad=0))
rownames(tax_c) <- prenames_tax
## check all on the same page -----------------------
all(rownames(k2dat) == rownames(k2_counts) & rownames(k2dat) == rownames(k2_relab))
all(rownames(k2dat) %in% rownames(tax_c))
kraken_ids <- sort(rownames(k2dat))
k2dat <- k2dat [kraken_ids , ]
k2_counts <- k2_counts [kraken_ids , ]
k2_relab <- k2_relab [kraken_ids , ]
tax_c <- tax_c[kraken_ids , ]
## remove if no values (kraken2 set to give all taxa)
pos_abund <- rownames(k2_relab)[rowSums(k2_relab) > 0]
k2_relab <- k2_relab[ pos_abund , ]
k2_counts <- k2_counts[ pos_abund , ]
k2_tax <- tax_c[ pos_abund , ]
Next, add in the metadata we use to organise this study. Environmental variables like pH, SCFA, dates, locations, and other study design data go in here. We create a minimal set based on the data available:
mgdat <- data.frame("sample" = colnames(k2_counts),
"type" = gsub("(.).*", "\\1", colnames(k2_counts), perl = TRUE),
"seqdepth" = colSums(k2_counts) )
# View(mgdat)
Next, check that all the data looks sane and normal.
## check - do all match? -----------------------
dim(k2_counts)
dim(k2_relab)
dim(k2_tax)
dim(mgdat)
# look
k2_counts[1:10, 1:10]
# View(k2_counts)
# hist( log10(k2_counts ))
Kraken2 on itâs own :(note that this isnât recommended
You might not want to use Bracken (or more likely, you
might not be able to make a Bracken database). It is not the best idea,
but you can
Kraken2 reports to one fileS) and strains (S1)
from that reportSpecies levelFirst, we get the species data. Be careful here that you donât
accidentally include the $TEST file too! This would
duplicate that sample.
## i n b a s h ================================================== ##
# presume you've made a load of kraken2 files:
ls -lsh $KROUT/*kraken_report
# be careful here that you don't accidentally include the $TEST file too! This would duplicate that sample
~/bin/KrakenTools/combine_kreports.py -r $KROUT/*kraken_report -o $KROUT/krakenStnd_kreportCombo.tsv
# table of sorts
grep -E 'perc\stot_all' $KROUT/krakenStnd_kreportCombo.tsv | sed 's/#//' > $KROUT/krakenStnd_kreportCombo_count.tsv
grep -E '\sS1?\s' $KROUT/krakenStnd_kreportCombo.tsv >> $KROUT/krakenStnd_kreportCombo_count.tsv
Put the file where it needs to be (assuming alongside
krak2_brack_file & krak2_mpa_file
described above). Note the structure of this data using
head(): there are two columns for each sample:
all (all reads from this point onwards in the tree), and
lvl (reads assigned to this specific level in the tree). If
we selected all, we would count the same values multiple
times at each level. The lvl counts are the abundances
assigned specifically to that taxonomic point - this is what we will
use.
As there will be several e.g. Streptococcus thermophilus
strains (S1 level), we will gather read abundances for all
those strains to Species level by finding all taxa whoâs
names contain Streptococcus thermophilus (and do this for all
taxa present).
Note that we also make a taxonomy table based on the
work above, to try match the feature IDs properly.
just_krak <- "../input/krakenStnd_kreportCombo.tsv"
krak2 <- read.table(just_krak, header=TRUE, sep="\t", quote = "")
rownames(krak2) <- paste0("k2_", stringr::str_pad(krak2$taxid, width=7, pad=0))
head(krak2)
# View(krak2)
# choose to keep the abundances at "_lvl", rather than at "_all"
kfeat <- krak2[ , grep("\\d_lvl$", colnames(krak2), value= TRUE, perl=TRUE) ]
head(kfeat)
# fix some names, esp for the next step
tax_x <- data.frame(taxon=krak2[ ,30], k2_id=krak2[ ,29], row.names = rownames(krak2) , stringsAsFactors = FALSE)
tax_x[,1] <- gsub('\\/','', tax_x[,1])
tax_x[,1] <- gsub('ALOs_','ALOs-', tax_x[,1])
tax_x[,1] <- gsub('[\\[\\]]','', tax_x[,1], perl = TRUE)
# make a clean genus and species column
tax_x$G <- gsub("^\\s*(\\w*)\\s.*", "\\1", tax_x$taxon, perl=TRUE)
tax_x$S <- gsub("^\\s*(\\w+)\\s(\\w+)\\s*.*", "\\2", tax_x$taxon, perl=TRUE)
head(tax_x)
# View(tax_x)
# now gather abundances based on perfect matches to each "Genus species" combo
kIDs_per_unique_binomials <- apply( unique(tax_x[,c("G", "S")]), 1, function(aa){ rownames(tax_x)[(tax_x$G == unlist(aa[1])) & (tax_x$S == unlist(aa[2]))] })
k2only_counts <- t(sapply( kIDs_per_unique_binomials, function(aa){ colSums(kfeat[ aa , ]) }))
# make a taxonomy file, based on teh work done above using krak2_mpa_file
k2only_tax2 <- tax_c[kraken_ids , ]
# check dimensions / orientation
dim(k2only_counts)
dim(k2only_tax)
From here, k2only_counts and k2only_tax can
be used instead of k2_counts & k2_tax made
above (i.e. use them to make phyloseq etc.), though you
will have to edit the names as necessary. Do also check that the
taxa/samples are on the right sides!
Kaiju data:In case we have outputs for Kaiju: point to output file,
and then manipulate that data. We extract count and % data
directly from the kaidat object, and set
kaiju IDs as the ânamesâ of the taxa.
Note, had to delete some â#â characters from file, which
gives errors.
## note the quote option, to stop strange names doing strange things
kaidat <- read.table(file = kaij_file, header = TRUE, sep = "\t", stringsAsFactors = FALSE, quote = "")
# remove 1+ header(s)
kaidat <- kaidat[ !grepl("file", kaidat[,1]) , ]
dim(kaidat)
head(kaidat)
# View(kaidat)
## fix sample names
kaidat$file <- gsub(".*/(.*)_kaiju.*", "\\1", kaidat$file)
kaidat$otu <- gsub( ".*;(.*);", "\\1", kaidat$taxon_name )
kaidat$ID <- paste0( "kai__", stringr::str_pad( kaidat[, "taxon_id"], width = 7, pad = 0) )
length(unique(kaidat$file)) # 20 files
length(unique(kaidat$otu)) # 11.7K OTUs
length(unique(kaidat$ID)) # 11.7K IDs
kaidat$percent <- as.numeric(as.character(kaidat$percent))
# should be very close to nsamples * 100
sum(kaidat$percent, na.rm = TRUE)
head(kaidat) # View(kaidat)
str(kaidat)
## assess total output ----------------------------
kaidf <- as.data.frame(kaidat)
kaidf[ , "reads"] <- as.numeric(kaidf[ , "reads"])
# again, counts and percent
sum( (kaidf[ , "reads"]))
sum( (kaidf[ , "percent"]), na.rm = TRUE) # nsample*100%
## nreads unclassified / sent to bin
kaidf$substrate <- gsub("([A-Z]).*", "\\1", kaidf$file)
# ggplot(kaidf, aes(log10(reads), fill = substrate)) + geom_boxplot() + labs(title = "total N reads per substrate")
## congeal -------------------------------------------
# use the Kaiju ID, bot the genus name
u_bio <- unique( kaidf$ID)
u_sample <- unique( kaidf$file)
## count data
# to each unique taxon found (u_bio), assign the number of counts found in each unique sample (u_sample)
ka_counts <- do.call("rbind", lapply( u_bio, function(aa){ # aa <- u_bio[100]
bb_df <- kaidf[ grep(aa, kaidf$ID) , ]
cc_vec <- unlist(lapply( u_sample, function(aaa){ # aaa <- u_sample[9]
as.numeric(bb_df[ match(aaa, bb_df$file) , "reads" ])
}))
} )
)
rownames(ka_counts) <- u_bio # [1:100]
colnames(ka_counts) <- u_sample
# only keep real numbers
ka_counts[ is.na(ka_counts)] <- 0
dim( ka_counts <- ka_counts[ , !apply( ka_counts, 2, function(aa){ all(aa == 0)}) ] )
## % data
# same, but make relative abudances using the values given by kaiju
ka_relab <- do.call("rbind", lapply( u_bio, function(aa){ # aa <- u_bio[100]
bb_df <- kaidf[ grep(aa, kaidf$ID) , ]
cc_vec <- unlist(lapply( u_sample, function(aaa){ # aaa <- u_sample[9]
as.numeric(bb_df[ match(aaa, bb_df$file) , "percent" ])
}))
} )
)
rownames(ka_relab) <- u_bio # [1:100]
colnames(ka_relab) <- u_sample
# only keep real numbers
ka_relab[ is.na(ka_relab)] <- 0
dim( ka_relab <- ka_relab[ , !apply( ka_relab, 2, function(aa){ all(aa == 0)}) ] )
# look
ka_counts[1:10,1:10]
ka_relab[1:10,1:10]
Again, we have to fabricate out own taxonomic table, while taking care to manage any irregularities.
ka_tax <- unique( t(# do.call("rbind",
apply( kaidf[ , c("ID", "taxon_name")], 1, function(aa){ # aa <- kaidf[ 1, c("ID", "taxon_name" )]
bb_vec <- aa[2]
if( grepl("Viruses", bb_vec)){
cc_vec <- c( rep( "Viruses", length(2:7)), aa[1])
}else{
cc_vec <- c( unlist(strsplit(unlist(bb_vec), ";"))[ 2:7], aa[1] )
}
cc_vec }) ))
# set names
rownames(ka_tax) <- ka_tax[,7]
colnames(ka_tax) <- c("p", "c", "o", "f", "g", "s", "ID") # Jun'22 - remove "d" here, increase index of all names by 1
head(ka_tax)
dim(ka_tax <- (unique(ka_tax)))
Note - at this stage weâve reduced gigabytes of sequence data into ~3
basic spreadsheets (.tsv), irrespective of whether it came
from Kraken2 or Kaiju (or some other program)
- from here on, the data used can be processed using standard methods
from biology and data-science, whether thatâs in R,
python, MSExcel etc.
## general metadata
write.table(mgdat, "../output/rmicro__metadata__dat.tsv", sep = "\t")
saveRDS( mgdat, "../output/rmicro__metadata__dat.RDS")
## for Kraken2:
## tsv output
write.table(k2_counts, "../output/rmicro__krakenStnd__num.tsv", sep = "\t")
write.table(k2_relab, "../output/rmicro__krakenStnd__frac.tsv", sep = "\t")
write.table(k2_tax, "../output/rmicro__krakenStnd__tax.tsv", sep = "\t")
## RDS for R output - note samples are ROWS
saveRDS( t(k2_counts), "../output/rmicro__krakenStnd__num.RDS")
saveRDS( t(k2_relab), "../output/rmicro__krakenStnd__frac.RDS")
saveRDS( k2_tax, "../output/rmicro__krakenStnd__tax.RDS")
## for Kaiju:
## tsv output
write.table(ka_counts, "../output/rmicro__kaijuStnd__num.tsv", sep = "\t")
write.table(ka_relab, "../output/rmicro__kaijuStnd__frac.tsv", sep = "\t")
write.table(ka_tax, "../output/rmicro__kaijuStnd__tax.tsv", sep = "\t")
## RDS for R output - note samples are ROWS
saveRDS( t(ka_counts), "../output/rmicro__kaijuStnd__num.RDS")
saveRDS( t(ka_relab), "../output/rmicro__kaijuStnd__frac.RDS")
saveRDS( ka_tax, "../output/rmicro__kaijuStnd__tax.RDS")
R::phyloseqMore info on the gateway drug that is phyloseq, see the
phyloseq main page, and
here
for importing data. Note again that here, it doesnât matter whether itâs
Kraken2 or Kaiju (or some other program) -
from here on, the data used can be processed using standard methods from
biology and data-science
## note: we use the count data (making % is easy in R, but
## - we would have no way of knowing the original counts)
## note also: we did not extract any phylogenetic tree from
## - the shotgun data, so no tree is included below
library(phyloseq)
## Kraken2/Bracken ---
# look at the first ten rows, ten columns:
k2_counts[1:10, 1:10]
k2_relab[1:10, 1:10]
k2_tax[1:10, 1:5]
mgdat[1:10, ]
# note: we use the count data
k2_otu <- otu_table(k2_counts, taxa_are_rows = TRUE)
# convert from df to matrix to tax_table to avoid losing rownames...
k2_tax2 <- tax_table(as.matrix(k2_tax))
(k2_phyloseq_dat <- phyloseq(
k2_otu,
k2_tax2,
# note no phylogenetic tree
sample_data(mgdat)
))
saveRDS( k2_phyloseq_dat, "../output/rmicro__KrakenStnd__phylo.RDS")
# check: save and read back in. RDS is a handy way to save/load data
k2_phylo <- readRDS("../output/rmicro__KrakenStnd__phylo.RDS")
k2_phylo
## Kaiju ----------
# look at the first ten rows, ten columns:
ka_counts[1:10, 1:10]
ka_relab[1:10, 1:10]
ka_tax[1:10, 1:5]
mgdat[1:10, ]
# note: we use the count data..
# sample names need to match - here we solve an issue between "-" and "."
colnames(ka_counts) <- gsub("-", ".", colnames(ka_counts))
dim(kai_otu <- otu_table(ka_counts, taxa_are_rows = TRUE))
# minimum modification of sample names
colnames(kai_otu) <- paste0("X", colnames(kai_otu))
dim(kai_tax2 <- tax_table(ka_tax))
(kai_phyloseq_dat <- phyloseq(kai_otu, kai_tax2, sample_data(mgdat)))
saveRDS( kai_phyloseq_dat, "../output/rmicro__KaijuStnd__phylo.RDS")
# demo - save out and read back in again
kai_phylo <- readRDS("../output/rmicro__KaijuStnd__phylo.RDS")
kai_phylo
That was, hopefully, easy! Take this data, and have a look at the phyloseq guide to exploring
data, their inspirational F1000 expanded methods
paper, a quick list of related
commands, or other guides as recommended on the climber
page.