↑ back to main


GreenGenes2-Kraken2 database: download here / end of page

GreenGenes2-Kraken2 database: skip to code only / end of page


Greengenes 2: son of Greengenes

Greengenes was a popular and comprehensive 16S database representing 201K sequences. Greengenes has been recently updated to a staggering 2M full-length (+21M V4 sequences) (GG2, 2024).

GG2 is not natively available for use with Kraken2, everyone’s my favourite k-mer based sequence classifier.

This guide hopes to give quick guidance to parsing the GG2 database into a NCBI-like format (see taxdump), explicitly so that it can be used with the Kraken2 taxonomic classifier. Although the Kraken2 wiki is extensive, the method for developing a custom database from scratch was not completely obvious, so it is being recorded here in with the foreknowledge that jfg will forget what he has done.


Intro to kraken2 databases

We assume that you have R (+Biostrings, +parallel) and Kraken2 installed. First point is to read the guide on the Kraken2 website about making custom databases.

These custom databases allow you to customise what part of the NCBI you use, and generally go:

## not run (successfully...)

# what you want
customDBs=archaea,bacteria,viruses
# where you want it
ncbiDB=/path/to/custom_ncbi

kraken2-build --download-taxonomy --db $ncbiDB
kraken2-build --download-library $customDBs --db $ncbiDB
kraken2-build --add-to-library extra_seqs.fa --db $ncbiDB         # add extra_seqs.fa if needed
kraken2-build --build --db $ncbiDB --threads 10

Pretty cool! Having read this however, you might note that there are guides for customising many database formats from Silva, GTDB, NCBI - but not creating one from scratch. Nor is Greengenes2 an officially supported database that can be downloaded elsewhere, so we will:


  1. download the correct GG2 sequences for the (2024.09) release
  2. download the correct GG2 taxonomy for the (2024.09) release, based on Greengenes2 “ID” (and not ASV or MD5)

Then, in R:

  1. get each unique taxon in the GG2 taxonomic tree
  2. using the Greengenes2 taxonomy, assign new, simple taxids to each node in the GG2 tree
  3. make a new names.dmp, including the new taxids for each node
  4. make a new nodes.dmp, including the new taxids for each node and that node’s parent (basically flattening the tree into a metadata table)
  5. saving out the files (nodes.dmp, names.dmp) with the correct delimiter for Kraken2 (\t|\t)

and finally, in bash:

  1. build and check the custom Greengenes2 database with Kraken2




Build a GG2 database - walkthrough


1. & 2. - get the correct files

We’re using the most recent GG2 release, currently 2024.09, available at the Greengenes2 development server - all due thanks to the authors & maintainers (see also their github).

These files are hosted at https://ftp.microbio.me/greengenes_release/current, along with several other outputs for their paper. Some of these are a backbone subset of sequences for the tree they created, some are intended for use with QIIME2 (.QZA), phylogenetics (everything marked “phylo”), others replace the header info with ASV or MD5 sequences: we are only interested in the full list of sequences used (FASTA format), and the associated taxonomic information which lists each sequences’ identifying FASTA header and taxonomic ranks.

It can be a little difficult to see the full filenames on the download site, so try hovering your mouse over the links / checking the linked filenames to be sure you have the right version:

  • 2024.09.seqs.fna.gz - 23M sequences, some full length, some way too long, some V4 only
  • 2024.09.taxonomy.id.tsv.gz - entry for each sequence, with the accession (the proper FASTA header ID in this case) in the first column; taxonomic ranks in the second column

# make a home for your database + assoc files; go there; download sequences
k2db=/path/to/custom_gg2

# we add folders that Kraken2 will expect
mkdir -p $k2db $k2db/taxonomy $k2db/library 

cd $k2db
wget https://ftp.microbio.me/greengenes_release/current/2024.09.seqs.fna.gz
wget https://ftp.microbio.me/greengenes_release/current/2024.09.taxonomy.id.tsv.gz

for f in ./*gz ;
do
  gzip -dc $f | head -10 ; 
done


The taxonomy.id file should look like below: the long accessions are NCBI, and the short numbers are from the 23M V4 sequences added to the tree. Note the poor confidence scores: 0.3. These “confidence” estimated are an artefact of compatibility, and are not relevant here. Note also that Taxon is a single column, with the different d__; c__; o__; ranks joined together by whitespace.

Feature ID      Taxon   Confidence
GB-GCA-003568775.1-MWMI01000008.1       d__Bacteria; p__; c__; o__; f__; g__; s__       0.3
GB-GCA-001552015.1-CP010514.1   d__Bacteria; p__; c__; o__; f__; g__; s__       0.3
GB-GCA-000008085.1-AE017199.1   d__Bacteria; p__; c__; o__; f__; g__; s__       0.3
22383138        d__Bacteria; p__; c__; o__; f__; g__; s__       0.3
07656960        d__Bacteria; p__; c__; o__; f__; g__; s__       0.3
01537540        d__Bacteria; p__; c__; o__; f__; g__; s__       0.3


3. get each unique taxon

Session on. We start R, clear the workspace & memory, load libraries, load the files.

Loading seqs.fna and taxonomy.id.tsv can take a long time. If you want to experiment / debug, use a smaller set of taxonomy and sequences:

  • use bash to make a smaller seqs.fna file: e.g. gzip -dc 2024.09.seqs.fna | head -n 20000 > subseqs.fna to use the first e.g. 10k FASTA sequences, load that instead.
  • add nrows to your read.table call, e.g. read.table( ... nrows=10000 ) to read in only the first e.g. 10K lines only

Just be aware that the smaller these test subsets are the less likely they are to match up with one another! Next, we also make path the same as your database directory (but it could be any dir where you want to save the outputs), and chose how many n_cores to use in parallel (15 here because v. fancy).

rm(list=ls()) ; gc() 
library("Biostrings")
library("parallel")
      
# read in seqs.fna - 23,467,470 seqs
length(seqs <- readDNAStringSet( filepath = paste0(path, "/2024.09.seqs.fna.gz" ))) 

# read in ID taxonomy - 23M lines
head(tax <- read.table(paste0( path, "/2024.09.taxonomy.id.tsv.gz"), header = TRUE, sep = "\t"))   


n_cores <- 15
path <- "/path/to/custom_gg2"


We make a vector that holds the rank levels (see step 4), and we’ll break up that taxonomy into different columns, making a handy dataframe of all the taxa and accessions in GG2:

# note, taxonomy lacks root@1
rank_is <- c( "d__" = "domain", "p__" = "phylum", "c__" = "class", "o__" = "order", "f__" = "family", "g__" = "genus", "s__" = "species")      

# tax_dummies, just in case theres weird extra ranks (there aren't)
tax_dummies <- 3  
taxs <- as.data.frame(do.call("rbind", mclapply( tax$Taxon, function(aa){   # aa <- tax$Taxon[727748]                       ## parallel
  bb <- unlist(strsplit( aa, split = "\\; "))[1:(7+tax_dummies)]
  bb
}, mc.cores = n_cores, mc.cleanup = TRUE)), stringsAsFactors = FALSE)
colnames(taxs) <- rank_is
taxs$header <- tax$Feature.ID
rownames(taxs) <- taxs$header
head(taxs)
# rownames removed!
       domain phylum class order family genus species                            header
d__Bacteria    p__   c__   o__    f__   g__     s__ GB-GCA-003568775.1-MWMI01000008.1
d__Bacteria    p__   c__   o__    f__   g__     s__     GB-GCA-000008085.1-AE017199.1
d__Bacteria    p__   c__   o__    f__   g__     s__                          22383138
d__Bacteria    p__   c__   o__    f__   g__     s__                          07656960
d__Bacteria  p__Pates...  c__Pacei..  o__UBA9973  f__UBA9973  g__UBA9973     s__ 05829965
d__Bacteria  p__Pates...  c__Pacei..  o__UBA9973  f__UBA9973  g__UBA9973     s__ 09474592


Next, lapply over that dataframe (in parallel=mclapply), creating a list with a unique entry for each unique taxon - this will be the key for assigning each unique taxon their own unique taxid.

taxs$effective_taxon <- mclapply( 1:nrow(taxs), function(aa){   # aa <- 400    aa <- taxs[ 400 , 1:(7+tax_dummies) ]                  ## parallel
  bb <- taxs[ aa, 1:(7+tax_dummies) ]
  unlist(bb[ max(which( !( grepl("__$", bb, perl = TRUE) | is.na(bb) ) ))  ])
},  mc.cores = n_cores, mc.cleanup = TRUE)
head(taxs)

uniq_tax <- apply( taxs[, 1:(7+tax_dummies)], 2, function(aa){
  bb <- aa[ !( is.na(aa) ) ]     #  prob need to keep these in order to match later?| grepl("__$", aa, perl = TRUE)
  sort(unique(unlist(bb)))
  })


Here, it might be useful to remove 16S sequences below a certain threshold length - so we only keep those longer than 1.4kbp, and shorter than 1.65 kbp. This means we’re losing 21M V4 regions, but that data sounds like a bad idea, and this will greatly simplify database building.

s_widths <- width(seqs)
length(seqs_w <- seqs[ s_widths > 1400 & s_widths < 1650 ])         # 332,588
length( usable_seqs <- seqs_w[ names(seqs_w) %in% taxs$header ] )   # 312,905 


4. assign new, simple taxids

Above, we have an entry for each unique taxon (node) in the GG2 tree. Because of how Kraken2 works, the fake taxid’s we make need to start at 1 (root), and should not start with a 0 (apparently). We use floop to count how many taxids have been given out so far.

Note the python-like universal assignment of values outside of the apply call using <<- - this is not recommended practice in R.

## root is 1, so floop/domains start from 2...
floop <- 1    

taxid_proto <- lapply( names(uniq_tax)[ sapply( uniq_tax, function(aa){ !identical( aa, character(0)) })],                  ## can't parallelise counting!!!
                function(aa){   # aa <- "phylum"                                                                            ## is very fast in any case
  if( aa == "domain"){ floop <<- 1 }
  bb <- uniq_tax[ aa ][[1]]
  # cc <- stringr::str_pad( (floop+1):(floop+length(bb)), width = 7, pad = "0" )
  cc <- (floop+1):(floop+length(bb))
  floop <<- floop + length(cc)
  list( "name_txt" = bb, "taxid" = cc)
})
names(taxid_proto) <- rank_is


Here, avoid having to do a dictionary-style lookup of taxon:taxid: instead write a fn to decide rank ahead of time using rank_is & grepl through that rank alone (i.e. based on matching “x__” prefix) to find each taxons correct taxid (and a similar fn to find each taxon’s parent’s taxid):

identify <- function(aa){
    bb_rank <- rank_is[ gsub("__.*", "__", aa )]                           # aa <- "g__GWA2-37-10"
    if( aa %in% taxid_proto[[bb_rank]][[1]] ){
      taxid_proto[[bb_rank]][[2]][ match(aa, taxid_proto[[bb_rank]][[1]] ) ]
    }else{
      print(paste0("taxon ", aa, " wasn't in the taxid map"))
    }
  }

identify_parent <- function(aa){   # aa <- "g__UBA6532"         aa <- "s__"
  if( grepl("d__", aa)){
    # "0000000"                      # snilches get zilches    -     possibly no leading 0s
    "1"                      # snoonos get unos ~~snilches get zilches
  }else{
    bb_rank <- rank_is[ gsub("__.*", "__", aa) ]
    cc <- taxs[ match(aa, taxs[ , bb_rank ]) , ( which( names(taxs) == bb_rank) -1) ] 
    identify( cc )
    }   
  }


5. make a new names.dmp

Here we’re copying the expected format of nodes & names.dmp files. It can be useful to look at a “normal” version and see how it looks (i.e. less -S $STANDARD_KRAKEN2DB/taxonomy/names.dmp), but it looks much like this (note the special \t|\t field delimiters between columns and at the end - we’ll deal with those in steps 7. & 8.):

13    |     Dictyoglomus    |               |       scientific name |
14    |     ATCC 35947      |   ATCC 35947 <type strain   |     type material  |
14    |     Dictyoglomus sp. Rt46-B1        |   |       includes        |
14    |     Dictyoglomus thermophilum Saiki et al. 1985   |   |   authority  |
14    |     Dictyoglomus thermophilum       |       |       scientific name |
14    |     DSM 3960        |       DSM 3960 <type strain  |     type material   |
14    |     strain H-6-12   |       strain H-6-12 <type strain   |    type material   |
16    |     Methyliphilus   |               |       equivalent name |


We copy this format as best we can, using the identify() fn we made above, to look up each unique taxon we identified (3.), and find the unique taxid we assigned it (4.):

names_proto <- data.frame(
  "tax_id" = unlist( mclapply(                    # the id of node associated with this name
                              unlist( uniq_tax),
                              identify, mc.cores = n_cores, mc.cleanup = TRUE )),
  "name_txt" = unlist( uniq_tax),                 # name itself
  "unique name" = unlist( uniq_tax),              # the unique variant of this name if name not unique
  "name class" = "scientific name",               # synonym, common name, ... all lower case.
  stringsAsFactors = FALSE
)

head(names_proto)


6. make a new nodes.dmp

Same idea here! Only major change is that we also use the identify_parent() fn we made to also get the taxid of each taxons parent taxon (e.g. for a given species, what is the taxid of the genus it belongs to?).

The other difference is that nodes.dmp has more fields - but note that we bascially give rows 4-13 a fixed value for all Bacterial/Archaeal taxa. It won’t affect Kraken2.

nodes_proto <- data.frame(
  # "tax_id" = unlist(sapply( names_proto$name_txt, identify )),                                           ## parallel
  "tax_id" = unlist(mclapply( names_proto$name_txt, identify, mc.cores = n_cores, mc.cleanup = TRUE )),                                           ## parallel
  "parent tax_id" = unlist( mclapply( names_proto$name_txt, identify_parent, mc.cores = n_cores, mc.cleanup = TRUE )),                             ## parallel
  "rank" = rank_is[ unlist( mclapply( names_proto$name_txt, function(aa){ gsub("(^\\w__).*", "\\1", aa) }, mc.cores = n_cores, mc.cleanup = TRUE )) ] ,     ## parallel
  "embl code" = "",                  # no one seems to know about this one
  "division id" = 0,                  # 0 = Bacteria (but also apparently Archaea...)
  "inherited div flag" = 1,           # 1 if node inherits division code from parent
  "genetic code id" = 11,           # 11 = Bact/Arch/plastid
  "inherited GC" = 1,                # 1 if node inherits genetic code from parent
  "mitochondrial genetic code" = 0,  # see gencode.dmp file
  "inherited MGC" = 1,                # 1 if node inherits mitochondrial gencode from parent
  "GenBank hidden" = 0,               # 1 if name is suppressed in GenBank entry lineage
  "hidden subtree root" = 0,          # 1 if this subtree has no sequence data yet
  "comments" = "")

head(nodes_proto)


Crucially to both nodes.dmp and names.dmp, we must root our new fake taxid, with the “root” taxon, which must be at taxid 1 (I think…feel free to check and let us know).

# - nodes.dmp (note 8,0,1, instread of 11,1,0)
nodes_proto <- rbind( 
  c( "1","1","no rank","","8","0","1","0","0","0","0","0",""),
  nodes_proto
)

# - and names.dmp
names_proto <- rbind( 
  c( "1", "all", "all", "synonym"),
  c( "1", "root", "root", "scientific name"),
  names_proto
)


Equally, the sequences need new accessions which reflect the newly created taxid that we’ve added to nodes.dmp and names.dmp. These names are include:

  • the accession text the sequence came with
  • a string that reads “kraken:taxid”, which I assume Kraken2 uses to find the next bit:
  • the all-important new taxid.

These are separated by a pipe (the “|” character), and put together should look like this:

MJ030-2-barcode69-umi1705bins-ubs-122|kraken:taxid|1491

Kraken2’s documentation on Custom database building indicates that “kraken:taxid|123456” could be at the front or back, as long as it is properly delimited by |; troubleshooting via genAI suggested that the above order is preferred (but again, feel free to explore this variable).

names(usable_seqs) <- paste0( 
  names(usable_seqs), 
  "|kraken:taxid|", 
  unlist( mclapply( taxs[ names(usable_seqs) , "effective_taxon"], identify, mc.cores = n_cores, mc.cleanup = TRUE ) 
          )
  )


7. saving out the files

Always worth checking the structure and look of these files. Certainly easier to examine in a local session on Rstudio or similar, but they should look similar (identical) to the standard Kraken2 ones:

head( nodes_proto )
head( names_proto )
head( usable_seqs )


Finally, write out nodes.dmp & names.dmp. Check out that weird sep arg - again, this is a key part of how NCBI taxdump files are structured, so Kraken2 needs it:

# reference FASTA sequences first:
writeXStringSet( usable_seqs, 
                 filepath = paste0( path, "/gg2_2024.09__FullLength313k__seqs.fna"), 
                 compress = FALSE, format = "fasta" )

# nodes and names.dmp files, with odd "\t|\t" separator 
write.table(names_proto, file = paste0( path, "/gg2_2024.09__FullLength313k__names.dmp"), 
            sep = '\t|\t', col.names = FALSE, row.names = FALSE, quote = FALSE)

write.table(nodes_proto, file = paste0( path, "/gg2_2024.09__FullLength313k__nodes.dmp"), 
            sep = '\t|\t', col.names = FALSE, row.names = FALSE, quote = FALSE)


8. kraken2-build the GG2 database

Structuring yr Kraken2 database

This will work! Kraken2 only needs the three files, in two folders, inside your database directory as below to build a database. All the other taxdump files (accessions2taxids, other .dmp files) are ancillary. Step 1 is to arrange this manually.

NB: Doing this manually skips the first step in the standard database building (e.g. kraken2-build --download-library bacteria --db $k2db)

$ tree $k2db
# >  ├── library
# >  │   └── seqs.fna
# >  └── taxonomy
# >      ├── names.dmp
# >      └── nodes.dmp


Make folders, copy the files over. Note that the files end up named nodes.dmp, names.dmp, seqs.fna (non-compressed):

# where you saved the dmp & fna files from R
outdir=/path/to/custom_outputs/

# where you want your database to go
k2db=/path/to/custom_gg2_database/
mkdir -p $k2db $k2db/taxonomy $k2db/library

# copy from wherever you made it, to wherever your database is going
cp $outdir/*nodes.dmp $k2db/taxonomy/nodes.dmp
cp $outdir/*names.dmp $k2db/taxonomy/names.dmp
cp $outdir/*seqs.fna $k2db/library/seqs.fna

head $k2db/taxonomy/nodes.dmp
head $k2db/taxonomy/names.dmp
head $k2db/library/seqs.fna


Originally, I was also building seqid2taxid.map too, but Kraken2 is quite happy to do that step itself, so let it away. However, we do still need to add a \t|\t delimiter at the end of the .dmp files - otherwise the database will work and classify taxids, but will not tell you what the actual taxon ranks are - which would not be much use.

sed -i -e 's/$/\t|/g' $k2db/taxonomy/names.dmp
sed -i -e 's/$/\t|/g' $k2db/taxonomy/nodes.dmp


Starting yr Kraken2 database

If making your own database from scratch, step 2 is to add sequences into the database structure from step 1. Note here we skip masking of repetitive sequences, as we’re working with 16S sequences (we need all the signal we can get from this restricted range of sequences, and some “low” information stretches are structurally important. As ever, thoughts welcome on this).

kraken2-build --add-to-library $k2db/library/seqs.fna --db $k2db --no-masking

Once all the pieces are in place, it’s (theoretically) as simple as:

# rub your hands together and 
kraken2-build --build --db $k2db --threads 15
    # >  Added "$k2db/library/seqs.fna" to library (/path/to/custom_gg2_database/)
    # >  Creating sequence ID to taxonomy ID map (step 1)...
    # >  Sequence ID to taxonomy ID map complete. [0.182s]
    # >  Estimating required capacity (step 2)...
    # >  Estimated hash table requirement: 38361964 bytes
    # >  Capacity estimation complete. [4.503s]
    # >  Building database files (step 3)...
    # >  Taxonomy parsed and converted.
    # >  CHT created with 15 bits reserved for taxid.
    # >  Completed processing of 625810 sequences, 938680526 bp        <<< much better! 
    # >  Writing data to disk...  complete.
    # >  Database files completed. [3m8.402s]
    # >  Database construction complete. [Total: 3m13.129s]

Inspecting the database itself (long output curtailed):

# press Q to exit from `less`
kraken2-inspect --db $k2db | less  
    # >  # >  Database options: nucleotide db, k = 35, l = 31
    # >  # >  Spaced mask = 11111111111111111111111111111111110011001100110011001100110011
    # >  # >  Toggle mask = 1110001101111110001010001100010000100111000110110101101000101101
    # >  # >  Total taxonomy nodes: 18637
    # >  # >  Table size: 6672071
    # >  # >  Table capacity: 9590491
    # >  # >  Min clear hash value = 0
    # >  100.00  6672071 3502    R       1       root
    # >  97.61  6512515 309649  R1      3         d__Bacteria
    # >  17.49  1167211 7021    P       109         p__Pseudomonadota
    # >  12.18  812446  98233   C       255           c__Gammaproteobacteria
    # >  2.21  147697  13331   O       590             o__Burkholderiales
    # >  0.90  60050   14680   F       1682              f__Burkholderiaceae_A_595421
    # >  0.05  3614    401     G       8717                g__Solimicrobium
    # >  0.04  2647    2647    S       21332                 s__Solimicrobium aquaticum
    # >  0.00  324     324     S       21333                 s__Solimicrobium silvestre
    # >  0.00  115     115     S       23410                 s__Undibacterium aquatile
    # >  0.00  102     102     S       23413                 s__Undibacterium seohonense
    # >  0.00  25      25      S       23412                 s__Undibacterium nitidum
    # >  0.04  2868    371     G       7552                g__Paraburkholderia_580243
    # >  [...] etc.

Much success.


9. use kraken2 with the GG2 database

That’s only the beginning. Lets try apply it to a DADA2 output of assigned ASVs:

kraken2 --db $k2db \
  --use-names $WRK2/dada2_output/assigned_ASVs__2632.fna \
  --report-zero-counts --report $k2db/temp_k2_report | less -S 

All 100% classified (though not necessarily to species level, as expected…):

# >  2632 sequences (1.08 Mbp) processed in 0.254s (622.7 Kseq/m, 256.12 Mbp/m).
# >    2632 sequences classified (100.00%)
# >    0 sequences unclassified (0.00%)

# classification output:
C       asv__4308       f__Rhizobiaceae (taxid 2720)    399     3:32 109:1   [...]
C       asv__4315       s__Actinotignum schaalii_A_386418 (taxid 10669) 415  [...]
C       asv__4326       s__Gemmiger_A_73129 qucibialis (taxid 14717)    425  [...]
C       asv__4334       g__Peptoniphilus_A (taxid 7692) 398     3:18 1217:4  [...]
C       asv__4405       g__Lentihominibacter (taxid 6546)       401     3:51 [...]
C       asv__4445       s__Neisseria sp000186165 (taxid 18038)  424     3:14 [...]
[...]

If we look at the Kraken2 report, we see everything broken down by taxid, and hierarchical taxonomy.

less -S $k2db/temp_k2_report 
  0.00  0       0       U       0       unclassified
100.00  2632    0       R       1       root
100.00  2632    0       R1      3         d__Bacteria
 39.86  1049    0       P       14          p__Bacillota_A_368345
 39.86  1049    0       C       191           c__Clostridia_258483
 21.12  556     1       O       883             o__Lachnospirales
 19.98  526     91      F       2360              f__Lachnospiraceae
  3.50  92      10      G       4377                g__Blautia_A_141781
  1.10  29      29      S       12002                 s__Blautia_A_141781 obeum
  0.91  24      24      S       12003                 s__Blautia_A_141781 wexlerae
  0.68  18      18      S       12000                 s__Blautia_A_141781 fusiformis
  0.34  9       9       S       11999                 s__Blautia_A_141781 caecimuris


Best of luck in yr bughunts.


Build a GG2 database - code only

Just the code plx? Shur shur shur.

R

      rm(list=ls()) ; gc() 

  ## here, download to /dl/path from R
      system("cd /dl/path")
      system("wget https://ftp.microbio.me/greengenes_release/current/2024.09.seqs.fna.gz")
      system("wget https://ftp.microbio.me/greengenes_release/current/2024.09.taxonomy.id.tsv.gz")

      
  ##   get sequences   =========================================================
      
      # attributes, names, length, hist etc. for DNA work
      library("Biostrings")
      library("parallel")
      n_cores <- 15
      path <- "/dl/path"
      length(seqs <- readDNAStringSet( filepath = paste0(path, "/2024.09.seqs.fna.gz" )))  #   23,467,470
      head(tax <- read.table(paste0( path, "/2024.09.taxonomy.id.tsv.gz"), header = TRUE, sep = "\t"))   # 23M lines


  ##   make taxonomy   =========================================================
      
      rank_is <- c( "d__" = "domain", "p__" = "phylum", 
                    "c__" = "class", "o__" = "order", 
                    "f__" = "family", "g__" = "genus",
                    "s__" = "species")      
      
      # 3 taxonomic dummy levels added here for fun
      tax_dummies <- 3  
      taxs <- as.data.frame(do.call("rbind", mclapply( tax$Taxon, function(aa){  
        bb <- unlist(strsplit( aa, split = "\\; "))[1:(7+tax_dummies)]
        bb
      }, mc.cores = n_cores, mc.cleanup = TRUE)), stringsAsFactors = FALSE)
      colnames(taxs) <- rank_is
      taxs$header <- tax$Feature.ID
      rownames(taxs) <- taxs$header
    
    ## crucial - accession - taxon map
      taxs$effective_taxon <- unlist( mclapply( 1:nrow(taxs), function(aa){  
        bb <- taxs[ aa, 1:(7+tax_dummies) ]
        bb[ max(which( !( grepl("__$", bb, perl = TRUE) | is.na(bb) ) ))  ]
      },  mc.cores = n_cores, mc.cleanup = TRUE))

      uniq_tax <- apply( taxs[, 1:(7+tax_dummies)], 2, function(aa){
        bb <- aa[ !( is.na(aa) ) ]
        sort(unique(unlist(bb)))
        })


  ## subset to seqs in 1.4-1.65kbp range    ====================================
      
      s_widths <- width(seqs)
      # 332,588 in full version
      length(seqs_w <- seqs[ s_widths > 1400 & s_widths < 1650 ])
      # match with those in taxs -  312,905 in full version
      length( usable_seqs <- seqs_w[ names(seqs_w) %in% taxs$header ] )
      
      
  ## taxid proto   -----------------------------------------------------------

    ## root is 1, so floop/domains start from 2...
      floop <- 1    
    ## counter with universal assignment to var
      taxid_proto <- lapply( names(uniq_tax)[ 
        sapply( uniq_tax, function(aa){ !identical( aa, character(0)) })
        ], function(aa){   
          if( aa == "domain"){ floop <<- 1 }
          bb <- uniq_tax[ aa ][[1]]
          cc <- (floop+1):(floop+length(bb))    #  stringr::str_pad( (floop+1):(floop+length(bb)), width = 7, pad = "0" )
          floop <<- floop + length(cc)
          list( "name_txt" = bb, "taxid" = cc)
      })
      names(taxid_proto) <- rank_is

    ## no mult lookups - decide rank ahead of time, use rank_is & grep through that alone (e.g. based on prefix)
      identify <- function(aa){
        bb_rank <- rank_is[ gsub("__.*", "__", aa )]                           # aa <- "g__GWA2-37-10"
        if( aa %in% taxid_proto[[bb_rank]][[1]] ){
          taxid_proto[[bb_rank]][[2]][ match(aa, taxid_proto[[bb_rank]][[1]] ) ]
        }else{
          print(paste0("taxon ", aa, " wasn't in the taxid map"))
        }
      }

      identify_parent <- function(aa){   # aa <- "g__UBA6532"         aa <- "s__"
        if( grepl("d__", aa)){
          "1"                      # snoonos get unos
        }else{
          bb_rank <- rank_is[ gsub("__.*", "__", aa) ]
          cc <- taxs[ match(aa, taxs[ , bb_rank ]) , ( which( names(taxs) == bb_rank) -1) ] 
          identify( cc )
          }   
        }


  ## names proto   -----------------------------------------------------------
  
      names_proto <- data.frame(
        "tax_id" = unlist( mclapply( unlist( uniq_tax), identify, mc.cores = n_cores, mc.cleanup = TRUE )),
        "name_txt" = unlist( uniq_tax),
        "unique name" = "",                   # safety net ignored.
        "name class" = "scientific name",     # synonym, common name, ... lower case!
        stringsAsFactors = FALSE
      )

      
  ## proto nodes   -----------------------------------------------------------
      
      nodes_proto <- data.frame(
        "tax_id" = unlist(mclapply( names_proto$name_txt, identify, mc.cores = n_cores, mc.cleanup = TRUE )),                                           ## parallel
        "parent tax_id" = unlist( mclapply( names_proto$name_txt, identify_parent, mc.cores = n_cores, mc.cleanup = TRUE )),                             ## parallel
        "rank" = rank_is[ unlist( mclapply( names_proto$name_txt, function(aa){ gsub("(^\\w__).*", "\\1", aa) }, mc.cores = n_cores, mc.cleanup = TRUE )) ] ,     ## parallel
        "embl code" = "",                  # no one seems to know about this one
        "division id" = 0,                  # 0 = Bacteria (but also apparently Archaea...)
        "inherited div flag" = 1,                            
        "genetic code id" = 11,                   # 11 = Bact/Arch/plastid
        "inherited GC" = 1,                     # 1 if node inherits genetic code from parent
        "mitochondrial genetic code" = 0,       # see gencode.dmp file
        "inherited MGC" = 1,                # 1 if node inherits mitochondrial gencode from parent
        "GenBank hidden" = 0,               # 1 if name is suppressed in GenBank entry lineage
        "hidden subtree root" = 0,          # 1 if this subtree has no sequence data yet
        "comments" = "")

            
    ## rooting the tree - must be appended to :
      # - nodes.dmp (note 8,0,1, instead of 11,1,0)
      nodes_proto <- rbind( 
        c( "1","1","no rank","","8","0","1","0","0","0","0","0",""),
        nodes_proto
      )
      # - and names.dmp
      names_proto <- rbind( 
        c( "1", "all", "all", "synonym"),
        c( "1", "root", "root", "scientific name"),
        names_proto
      )

      
  ## modify the seq headers    =================================================
      
      names(usable_seqs) <- paste0(
        names(usable_seqs), 
        "|kraken:taxid|", 
        unlist( mclapply( taxs[ names(usable_seqs) , "effective_taxon"], identify, mc.cores = n_cores, mc.cleanup = TRUE )
                )
        )
            

##   out that!   ---------------------------------------------------------------

      # head( nodes_proto )
      # head( names_proto )
      # head( usable_seqs )

      ## note sep for nodes & names
      writeXStringSet( usable_seqs, filepath = paste0( path, "/gg2_2024.09__FullLength313k__seqs.fna"), 
                       compress = FALSE, format = "fasta" )
      
      write.table(names_proto, file = paste0( path, "/gg2_2024.09__FullLength313k__names.dmp"), 
                  sep = '\t|\t', col.names = FALSE, row.names = FALSE, quote = FALSE)
      write.table(nodes_proto, file = paste0( path, "/gg2_2024.09__FullLength313k__nodes.dmp"), 
                  sep = '\t|\t', col.names = FALSE, row.names = FALSE, quote = FALSE)


bash

Swap to bash:


outdir=/dl/path
k2db=/ref/gg2_db
# rm -rf $k2db
mkdir -p $k2db $k2db/taxonomy $k2db/library

cp $outdir/*nodes.dmp $k2db/taxonomy/nodes.dmp
cp $outdir/*names.dmp $k2db/taxonomy/names.dmp
cp $outdir/*seqs.fna $k2db/library/seqs.fna

# fix delimiters! 
sed -i -e 's/$/\t|/g' $k2db/taxonomy/names.dmp
sed -i -e 's/$/\t|/g' $k2db/taxonomy/nodes.dmp

kraken2-build --add-to-library $k2db/library/seqs.fna --db $k2db --no-masking

# full thing takes ~90 mins
kraken2-build --build --db $k2db --threads 15


# look
kraken2-inspect --db $k2db | head -n20


kraken2 --db $k2db \
  --use-names ${yourdata}/query16S.fna \
  --report-zero-counts --report ./query16S.report > ./query16S.output 

head -n25  ./query16S.report

# here endeth the guessin

Download GG2-Kraken2 database

GreenGenes2 for Kraken2: 1.3-1.65kbp subset (331K seqs)

Filtered the original 23M sequences to keep only sequences between 1.3Kbp and 1.65Kbp only (331K sequences).

  • hash.k2d - 38MB
  • opts.k2d - 64B
  • taxo.k2d - 1.5MB

Should you feel so inclined

  • full_db_tar.gz - includes the hash, opts, & taxo.k2d, as well as nodes, map, & seq.fna files used: 170MB

Download here (Zenodo.org)

GreenGenes2 for Kraken2: complete release (23M seqs)

Built from the complete release, contains all sequences with a matching accession (ought to be 23M seqs).

  • hash.k2d - 184MB
  • opts.k2d - 64B
  • taxo.k2d - 1.8MB

Or indeed, run away with yrself:

  • full_db_tar.gz - includes the hash, opts, & taxo.k2d, as well as nodes, map, & seq.fna files used: 1.27GB

Download here (Zenodo.org)


Appendix

Considerations

Things to be aware of if using this data for your work (generally same between GG2 / S138.2):

  • Cake is a made up drug. We made completely fake taxids, so none of these will match NCBI taxonomy.
  • We did not mask repetitive sequences. Not certain whether avoiding DUSTmasking for 16S is “better”, but there is a strong & sensible case for avoiding it.
  • as mentioned, database was first built with only “full length” sequences (1.4 - 1.65Kbp; 331K sequences) for the subset. It is now extended to the full GG2 2024.09 release too.

Notes

A number of issues that needed to be taken care of en route to the bottom of the page.

  • incorrect root (needed to be 1, not 0) (unclear how important this actually is)
  • .dmps: incorrect seps on .dmps (needed to be \t|\t)
  • .dmps: no sep (\t|\t) at end of each line in .dmps (the end of a line is traditionally not delimited by sep, but by a newline, so go figure.)
  • .dmps: column 4 in names.dmp needed to be lower-case in order to report taxonomies (taxid assignment worked fine, but it wouldn’t tell you who they were)
  • not clear whether <acc>|kraken:taxid|000000 is better than kraken:taxid|000000|<acc>
  • remove sarcasm, bad jokes, curt insensitivities. Partially successful.