workflow is complete, but page remains under construction 🚧


The R work starts here 😬, continued from work assembling the metagenomic data.


Introduction

Getting metagenomic data from bash & co. to R can be awkward - this is just a quick run-through of one way it can be done, creating basic .tsv files (as enjoyed by microsoft excel), bringing them into basic R first, and then to the great R library phyloseq. Feel free to suggest alternate routes.

Here, we do the following: for Kraken2+Bracken, and separately for `Kaiju:

  1. import the data from Bracken or Kaiju into R:
  2. re-arrange the outputs in R in one of two ways: a. reorganise it for Bracken b. reorganise it for Kaiju
  3. save this “basic” output as .tsv text files, that can be used in e.g. excel (if you like excel)
  4. export this output to phyloseq, a R library specifically for doing microbial community ecology, designed to keep all the community data in one place, and make using R easier for microbiologists, virologists, mycologists & co.

Note: - it’s the same idea for Kraken2-Bracken and for Kaiju, except that step 2 is different (because Bracken output is different from Kaiju output). Note also that we’re using the outputs from the first page.

0. Set up R

This might not be necessary for you, but here we set up R using conda - this creates a local version that we have control over. We start off in *bash*, using conda to get going.

# R! - requires conda to be installed
conda create -n fyR
conda activate fyR

# get a personal copy of R; -y flag for silent install
# we'll still have to install libraries of interest 
conda install -c r -y r

# dir for microbial community
R=$WRK/5__rmicro
mkdir $R
mkdir $R/analysis $R/input $R/output $R/vis $R/documents

# arrange your outputs from Krak  & kaiju
cp $KROUT/*tsv $R/input
cp $KaOUT/*tsv $R/input

# check picking up the correct env
which R

# here we go!
R

1. Import data to R

Point to your output files, and import them: here there are 2 files for Kraken2, 1 file for Kaiju.

## Kraken2

# output from kraken2-bracken, combined to one file
  krak2_brack_file="../input/rmicro__krakenStnd_abundances.tsv"

# output from single kraken2 report, converted to MPA
  krak2_mpa_file="../input/rmicro__krakenStnd_taxonomy.tsv"


## Kaiju
  
# output from kaiju's kaiju2table output
  kaij_file="../input/rmicro__total__kaiju_summary_species.tsv"

2. Re-arrange data

There are many, many ways to do things (generally and) in R - note that this page avoids using R::tidy solutions, but feel free to suggest some.

2a. for Kraken2+Bracken data:

Point to output file, and import

  k2dat <- read.table(file = krak2_brack_file, header=TRUE, quote = "", sep = '\t')
  k2dat[1:10, 1:10]
  
  prenames_k2dat <- paste0("k2_", stringr::str_pad(k2dat[,2], width=7, pad=0))
  ## non-unique names -  assume only one dupe of each
  prenames_k2dat[ duplicated(k2dat[,2])] <- gsub('k2_', 'k2-A_', prenames_k2dat[ duplicated(k2dat[,2])])
  rownames(k2dat) <- prenames_k2dat
  
  # counts
  k2_counts <- k2dat[ , grep('_num', colnames(k2dat))]
  colnames(k2_counts) <- gsub( ".bracken_num", "", colnames(k2_counts) )
  
  # relab%
  k2_relab <- k2dat[ , grep('_frac', colnames(k2dat))]
  colnames(k2_relab) <- gsub( ".bracken_frac", "", colnames(k2_relab) )

We need to make taxonomic data from the material we have (we made files for this using --report-zero-counts and kreport2mpa.py). Note that this takes a little organisation, as the taxonomic hierarchy is not the same across all domains of life (because it is imaginary).

## start with data from the abundance table names  --------------------------------

  tax_a <- data.frame(taxon=k2dat[ ,1], k2_id=k2dat[ ,2], row.names = rownames(k2dat) , stringsAsFactors = FALSE)
  tax_a[,1] <- gsub('\\/','', tax_a[,1])
  tax_a[,1] <- gsub('ALOs_','ALOs-', tax_a[,1])
  # tax_a[,1] <- gsub("'","", tax_a[,1])      # none?

  
  ## NOTE use quote="" to avoid nightmare of special characters like '"`/,| etc in names
  tax_b <- read.table( krak2_mpa_file, sep='\t', header=FALSE, fill = TRUE, stringsAsFactors = FALSE, quote="")
  head(tax_b)
  
## regularise taxonomic ranks (not all have K:P:C:O:F:G:S etc.)   -----------------
  
  ranks <- c("k__", "p__", "c__", "o__" , "f__", "g__", "s__")
  tax_b <- t(apply(tax_b, 1, function(aa){  # aa <- tax_b[1,]
    sapply(ranks, function(aaa){ # aaa <- "c__"
      bbb <- grep(aaa, unlist(aa), value=TRUE)
      ifelse( length(bbb) == 0, "unkn.", bbb)
    })
  }))
  dim(tax_b) ; str(tax_b)
  
  ## e.g. eukarya have loads of ranks, so many empty sapces to be filled, e.g.
  # tax_b[4000:4200,]
  
## more tax issues due to names: doubled ranks:
  prior_index <- (unlist( lapply(1:nrow(tax_b), function(aa){ if( sum( grepl("unkn.", tax_b[aa,])) == 6){aa} }) ) - 1)
  # # compare:
  # tax_b[ prior_index , ]
  # tax_b[ prior_index+1 , ]
  # amend, by doubling up (will remove duplicates later)
  tax_b[ prior_index , "s__"] <- tax_b[ (prior_index+1) , "s__"]
  # row is only useful if has a species: cannot have a species, and hit 6, And be worth keeping
  tax_b <- tax_b[ !apply(tax_b, 1, function(aa) sum(grepl("unkn.", unlist(aa)) ) == 6) , ]
  
  colnames(tax_b) <- c("D", "P", "C", "O", "F", "G", "S")
  
  
  tax_a[ , "taxon"] <- gsub(" ", "_", paste0("s__", tax_a[ , "taxon"]))
  head(tax_a) ; head(tax_a[ , "taxon"]) ;  dim(tax_a)
  head(tax_b) ;   head(tax_b[ , "S"]) ;  dim(tax_b)
  
  
## stitch  &  name   ----------------------------------
  
  tax_c <- merge(tax_b, tax_a, by.x="S", by.y="taxon")[, c(2:7,1,8)]
  head(tax_c) ; dim(tax_c)
  prenames_tax <- paste0("k2_", stringr::str_pad(tax_c[,"k2_id"], width=7, pad=0))
  rownames(tax_c) <- prenames_tax
  
  
## check all on the same page   -----------------------
  
  all(rownames(k2dat) == rownames(k2_counts) & rownames(k2dat) == rownames(k2_relab))
  all(rownames(k2dat) %in% rownames(tax_c))
  
  kraken_ids <- sort(rownames(k2dat))
  k2dat <- k2dat [kraken_ids , ]
  k2_counts <- k2_counts [kraken_ids , ]
  k2_relab <- k2_relab [kraken_ids , ]
  tax_c <- tax_c[kraken_ids , ]
  
  ## remove if no values (kraken2 set to give all taxa)
  pos_abund <- rownames(k2_relab)[rowSums(k2_relab) > 0]
  k2_relab <- k2_relab[ pos_abund , ]
  k2_counts <- k2_counts[ pos_abund , ]
  k2_tax <- tax_c[ pos_abund , ]

Next, add in the metadata we use to organise this study. Environmental variables like pH, SCFA, dates, locations, and other study design data go in here. We create a minimal set based on the data available:

  mgdat <- data.frame("sample" = colnames(k2_counts),
                     "type" = gsub("(.).*", "\\1", colnames(k2_counts), perl = TRUE),
                     "seqdepth" = colSums(k2_counts) )
  # View(mgdat)

Next, check that all the data looks sane and normal.

## check - do all match?   -----------------------

  dim(k2_counts)
  dim(k2_relab)
  dim(k2_tax)
  dim(mgdat)
  
  # look
  k2_counts[1:10, 1:10]
  
  # View(k2_counts)
  # hist( log10(k2_counts ))

2b. Kraken2 on it’s own :(

note that this isn’t recommended

You might not want to use Bracken (or more likely, you might not be able to make a Bracken database). It is not the best idea, but you can

  • combine all the Kraken2 reports to one file
  • get all the species (S) and strains (S1) from that report
  • extract the taxonomic info, and combine all the abundances at Species level

First, we get the species data. Be careful here that you don’t accidentally include the $TEST file too! This would duplicate that sample.

##   i n   b a s h   ================================================== ##

# presume you've made a load of kraken2 files:
ls -lsh $KROUT/*kraken_report

# be careful here that you don't accidentally include the $TEST file too! This would duplicate that sample
~/bin/KrakenTools/combine_kreports.py -r $KROUT/*kraken_report -o $KROUT/krakenStnd_kreportCombo.tsv

# table of sorts
grep -E 'perc\stot_all' $KROUT/krakenStnd_kreportCombo.tsv | sed 's/#//' > $KROUT/krakenStnd_kreportCombo_count.tsv
grep -E '\sS1?\s' $KROUT/krakenStnd_kreportCombo.tsv >> $KROUT/krakenStnd_kreportCombo_count.tsv

Put the file where it needs to be (assuming alongside krak2_brack_file & krak2_mpa_file described above). Note the structure of this data using head(): there are two columns for each sample: all (all reads from this point onwards in the tree), and lvl (reads assigned to this specific level in the tree). If we selected all, we would count the same values multiple times at each level. The lvl counts are the abundances assigned specifically to that taxonomic point - this is what we will use.

As there will be several e.g. Streptococcus thermophilus strains (S1 level), we will gather read abundances for all those strains to Species level by finding all taxa who’s names contain Streptococcus thermophilus (and do this for all taxa present).

Note that we also make a taxonomy table based on the work above, to try match the feature IDs properly.

just_krak <- "../input/krakenStnd_kreportCombo.tsv"
krak2 <- read.table(just_krak, header=TRUE, sep="\t", quote = "")
rownames(krak2) <- paste0("k2_", stringr::str_pad(krak2$taxid, width=7, pad=0))
head(krak2)
# View(krak2)

# choose to keep the abundances at "_lvl", rather than at "_all" 
kfeat <- krak2[ , grep("\\d_lvl$", colnames(krak2), value= TRUE, perl=TRUE) ]
head(kfeat)

# fix some names, esp for the next step
tax_x <- data.frame(taxon=krak2[ ,30], k2_id=krak2[ ,29], row.names = rownames(krak2) , stringsAsFactors = FALSE)
tax_x[,1] <- gsub('\\/','', tax_x[,1])
tax_x[,1] <- gsub('ALOs_','ALOs-', tax_x[,1])
tax_x[,1] <- gsub('[\\[\\]]','', tax_x[,1], perl = TRUE)

# make a clean genus and species column
tax_x$G <- gsub("^\\s*(\\w*)\\s.*", "\\1", tax_x$taxon, perl=TRUE)
tax_x$S <- gsub("^\\s*(\\w+)\\s(\\w+)\\s*.*", "\\2", tax_x$taxon, perl=TRUE)
head(tax_x)
# View(tax_x)

# now gather abundances based on perfect matches to each "Genus species" combo
kIDs_per_unique_binomials <- apply( unique(tax_x[,c("G", "S")]), 1, function(aa){ rownames(tax_x)[(tax_x$G == unlist(aa[1])) & (tax_x$S == unlist(aa[2]))] })
k2only_counts <- t(sapply( kIDs_per_unique_binomials, function(aa){ colSums(kfeat[ aa , ]) }))

# make a taxonomy file, based on teh work done above using krak2_mpa_file
k2only_tax2 <- tax_c[kraken_ids , ]

# check dimensions / orientation
dim(k2only_counts)
dim(k2only_tax)

From here, k2only_counts and k2only_tax can be used instead of k2_counts & k2_tax made above (i.e. use them to make phyloseq etc.), though you will have to edit the names as necessary. Do also check that the taxa/samples are on the right sides!

2c. for Kaiju data:

In case we have outputs for Kaiju: point to output file, and then manipulate that data. We extract count and % data directly from the kaidat object, and set kaiju IDs as the “names” of the taxa.

Note, had to delete some “#” characters from file, which gives errors.

## note the quote option, to stop strange names doing strange things

  kaidat <- read.table(file = kaij_file, header = TRUE, sep = "\t", stringsAsFactors = FALSE, quote = "")
  # remove 1+ header(s)
  kaidat <- kaidat[ !grepl("file", kaidat[,1]) , ]
  dim(kaidat)
  head(kaidat)
  # View(kaidat)
  
  ## fix sample names
  kaidat$file <- gsub(".*/(.*)_kaiju.*", "\\1", kaidat$file)
  kaidat$otu <- gsub( ".*;(.*);", "\\1", kaidat$taxon_name )
  kaidat$ID <- paste0( "kai__", stringr::str_pad( kaidat[, "taxon_id"], width = 7, pad = 0) )
  length(unique(kaidat$file))         # 20 files
  length(unique(kaidat$otu))          # 11.7K OTUs
  length(unique(kaidat$ID))           # 11.7K IDs
  kaidat$percent <- as.numeric(as.character(kaidat$percent))
  
  # should be very close to nsamples * 100
  sum(kaidat$percent, na.rm = TRUE)

  head(kaidat) # View(kaidat)
  str(kaidat)

  
## assess total output    ----------------------------
  
  kaidf <- as.data.frame(kaidat) 
  kaidf[ , "reads"] <- as.numeric(kaidf[ , "reads"])
  
  # again, counts and percent
  sum( (kaidf[ , "reads"]))
  sum( (kaidf[ , "percent"]), na.rm = TRUE)  # nsample*100%
  
  ## nreads unclassified / sent to bin
  kaidf$substrate <- gsub("([A-Z]).*", "\\1", kaidf$file)
  # ggplot(kaidf, aes(log10(reads), fill = substrate)) + geom_boxplot() + labs(title = "total N reads per substrate")

  
## congeal  -------------------------------------------
  
  # use the Kaiju ID, bot the genus name
    u_bio <- unique( kaidf$ID)
    u_sample <- unique( kaidf$file)
    
  ## count data
  # to each unique taxon found (u_bio), assign the number of counts found in each unique sample (u_sample)
    ka_counts <- do.call("rbind", lapply( u_bio, function(aa){   # aa <- u_bio[100]
      bb_df <- kaidf[ grep(aa, kaidf$ID) , ]
      cc_vec <- unlist(lapply( u_sample, function(aaa){ # aaa <- u_sample[9]
        as.numeric(bb_df[ match(aaa, bb_df$file) , "reads" ])
      }))
    } )
    )
    rownames(ka_counts) <- u_bio # [1:100]
    colnames(ka_counts) <- u_sample
    
    # only keep real numbers
    ka_counts[ is.na(ka_counts)] <- 0
    dim( ka_counts <- ka_counts[ , !apply( ka_counts, 2, function(aa){ all(aa == 0)}) ] )

        
  ## % data    
  # same, but make relative abudances using the values given by kaiju
    ka_relab <- do.call("rbind", lapply( u_bio, function(aa){   # aa <- u_bio[100]
      bb_df <- kaidf[ grep(aa, kaidf$ID) , ]
      cc_vec <- unlist(lapply( u_sample, function(aaa){ # aaa <- u_sample[9]
        as.numeric(bb_df[ match(aaa, bb_df$file) , "percent" ])
      }))
    } )
    )
    rownames(ka_relab) <- u_bio # [1:100]
    colnames(ka_relab) <- u_sample
    
    # only keep real numbers
    ka_relab[ is.na(ka_relab)] <- 0
    dim( ka_relab <- ka_relab[ , !apply( ka_relab, 2, function(aa){ all(aa == 0)}) ] )
    
    
    # look
    ka_counts[1:10,1:10]
    ka_relab[1:10,1:10]

Again, we have to fabricate out own taxonomic table, while taking care to manage any irregularities.

  ka_tax <- unique( t(# do.call("rbind", 
    apply( kaidf[ , c("ID", "taxon_name")], 1, function(aa){   # aa <- kaidf[ 1, c("ID", "taxon_name" )]
      bb_vec <- aa[2]
      if( grepl("Viruses", bb_vec)){ 
        cc_vec <- c( rep( "Viruses", length(2:7)), aa[1])
      }else{
        cc_vec <- c( unlist(strsplit(unlist(bb_vec), ";"))[ 2:7], aa[1] )
      }
      cc_vec  }) ))


  # set names
  rownames(ka_tax) <- ka_tax[,7]
  colnames(ka_tax) <- c("p", "c", "o", "f", "g", "s", "ID")   # Jun'22 - remove "d" here, increase index of all names by 1
  head(ka_tax)
  dim(ka_tax <- (unique(ka_tax)))

3. save the output in a basic format

Note - at this stage we’ve reduced gigabytes of sequence data into ~3 basic spreadsheets (.tsv), irrespective of whether it came from Kraken2 or Kaiju (or some other program) - from here on, the data used can be processed using standard methods from biology and data-science, whether that’s in R, python, MSExcel etc.

## general metadata
    write.table(mgdat, "../output/rmicro__metadata__dat.tsv", sep = "\t")
    saveRDS( mgdat, "../output/rmicro__metadata__dat.RDS")


## for Kraken2:
  
  ## tsv output
    write.table(k2_counts, "../output/rmicro__krakenStnd__num.tsv", sep = "\t")
    write.table(k2_relab, "../output/rmicro__krakenStnd__frac.tsv", sep = "\t")
    write.table(k2_tax, "../output/rmicro__krakenStnd__tax.tsv", sep = "\t")
    
  ## RDS for R output - note samples are ROWS
    saveRDS( t(k2_counts), "../output/rmicro__krakenStnd__num.RDS")
    saveRDS( t(k2_relab), "../output/rmicro__krakenStnd__frac.RDS")
    saveRDS( k2_tax, "../output/rmicro__krakenStnd__tax.RDS")

        
## for Kaiju:

  ## tsv output
    write.table(ka_counts, "../output/rmicro__kaijuStnd__num.tsv", sep = "\t")
    write.table(ka_relab, "../output/rmicro__kaijuStnd__frac.tsv", sep = "\t")
    write.table(ka_tax, "../output/rmicro__kaijuStnd__tax.tsv", sep = "\t")
    
  ## RDS for R output - note samples are ROWS
    saveRDS( t(ka_counts), "../output/rmicro__kaijuStnd__num.RDS")
    saveRDS( t(ka_relab), "../output/rmicro__kaijuStnd__frac.RDS")
    saveRDS( ka_tax, "../output/rmicro__kaijuStnd__tax.RDS")

4. export to R::phyloseq

More info on the gateway drug that is phyloseq, see the phyloseq main page, and here for importing data. Note again that here, it doesn’t matter whether it’s Kraken2 or Kaiju (or some other program) - from here on, the data used can be processed using standard methods from biology and data-science

  ## note: we use the count data (making % is easy in R, but 
  ##  - we would have no way of knowing the original counts)

  ## note also: we did not extract any phylogenetic tree from 
  ##  - the shotgun data, so no tree is included below

    library(phyloseq)

## Kraken2/Bracken   ---

  # look at the first ten rows, ten columns:
    k2_counts[1:10, 1:10]
    k2_relab[1:10, 1:10]
    k2_tax[1:10, 1:5]
    mgdat[1:10,  ]

    # note: we use the count data
    k2_otu <- otu_table(k2_counts, taxa_are_rows = TRUE)
    # convert from df to matrix to tax_table to avoid losing rownames...
    k2_tax2 <- tax_table(as.matrix(k2_tax))

    (k2_phyloseq_dat <- phyloseq(
      k2_otu,
      k2_tax2,
      # note no phylogenetic tree
      sample_data(mgdat)
      ))
    
    saveRDS( k2_phyloseq_dat, "../output/rmicro__KrakenStnd__phylo.RDS")
    # check: save and read back in. RDS is a handy way to save/load data
    k2_phylo <- readRDS("../output/rmicro__KrakenStnd__phylo.RDS")
    k2_phylo

            
## Kaiju   ----------
    
  # look at the first ten rows, ten columns:
    ka_counts[1:10, 1:10]
    ka_relab[1:10, 1:10]
    ka_tax[1:10, 1:5]
    mgdat[1:10, ]

  # note: we use the count data..
    # sample names need to match - here we solve an issue between "-" and "."
    colnames(ka_counts) <- gsub("-", ".", colnames(ka_counts))
    dim(kai_otu <- otu_table(ka_counts, taxa_are_rows = TRUE))
    # minimum modification of sample names
    colnames(kai_otu) <- paste0("X", colnames(kai_otu))
    dim(kai_tax2 <- tax_table(ka_tax))

    (kai_phyloseq_dat <- phyloseq(kai_otu, kai_tax2, sample_data(mgdat)))

    saveRDS( kai_phyloseq_dat, "../output/rmicro__KaijuStnd__phylo.RDS")
    # demo - save out and read back in again
    kai_phylo <- readRDS("../output/rmicro__KaijuStnd__phylo.RDS")
    kai_phylo

Next Steps

That was, hopefully, easy! Take this data, and have a look at the phyloseq guide to exploring data, their inspirational F1000 expanded methods paper, a quick list of related commands, or other guides as recommended on the climber page.