GreenGenes2-Kraken2database: download here / end of page
GreenGenes2-Kraken2database: skip to code only / end of page
Greengenes was a popular and comprehensive 16S database representing 201K sequences. Greengenes has been recently updated to a staggering 2M full-length (+21M V4 sequences) (GG2, 2024).
GG2 is not natively available for use with Kraken2,
everyone’s my favourite k-mer based sequence
classifier.
This guide hopes to give quick guidance to parsing the GG2 database
into a NCBI-like format (see taxdump), explicitly so that it can be used
with the Kraken2 taxonomic classifier. Although the Kraken2 wiki is
extensive, the method for developing a custom database from
scratch was not completely obvious, so it is being recorded here in
with the foreknowledge that jfg will forget what
he has done.
kraken2 databasesWe assume that you have R (+Biostrings,
+parallel) and Kraken2 installed. First point
is to read the guide on the Kraken2
website about making custom
databases.
These custom databases allow you to customise what part of the NCBI you use, and generally go:
## not run (successfully...)
# what you want
customDBs=archaea,bacteria,viruses
# where you want it
ncbiDB=/path/to/custom_ncbi
kraken2-build --download-taxonomy --db $ncbiDB
kraken2-build --download-library $customDBs --db $ncbiDB
kraken2-build --add-to-library extra_seqs.fa --db $ncbiDB # add extra_seqs.fa if needed
kraken2-build --build --db $ncbiDB --threads 10
Pretty cool! Having read this however, you might note that there are guides for customising many database formats from Silva, GTDB, NCBI - but not creating one from scratch. Nor is Greengenes2 an officially supported database that can be downloaded elsewhere, so we will:
2024.09)
release2024.09)
release, based on Greengenes2 “ID” (and not ASV or MD5)Then, in R:
taxids to each node in the GG2 treenames.dmp, including the new
taxids for each nodenodes.dmp, including the new
taxids for each node and that node’s parent (basically
flattening the tree into a metadata table)nodes.dmp,
names.dmp) with the correct delimiter for
Kraken2 (\t|\t)and finally, in bash:
Kraken2We’re using the most recent GG2 release, currently 2024.09,
available at the Greengenes2
development server - all due thanks to the authors & maintainers
(see also their github).
These files are hosted at https://ftp.microbio.me/greengenes_release/current,
along with several other outputs for their paper. Some of these are a
backbone subset of sequences for the tree they created,
some are intended for use with QIIME2 (.QZA), phylogenetics
(everything marked “phylo”), others replace the header info with ASV or MD5 sequences: we are only
interested in the full list of sequences used (FASTA
format), and the associated taxonomic information which lists each
sequences’ identifying FASTA header and taxonomic
ranks.
It can be a little difficult to see the full filenames on the download site, so try hovering your mouse over the links / checking the linked filenames to be sure you have the right version:
# make a home for your database + assoc files; go there; download sequences
k2db=/path/to/custom_gg2
# we add folders that Kraken2 will expect
mkdir -p $k2db $k2db/taxonomy $k2db/library
cd $k2db
wget https://ftp.microbio.me/greengenes_release/current/2024.09.seqs.fna.gz
wget https://ftp.microbio.me/greengenes_release/current/2024.09.taxonomy.id.tsv.gz
for f in ./*gz ;
do
gzip -dc $f | head -10 ;
done
The taxonomy.id file should look like below: the long
accessions are NCBI, and the short numbers are from the 23M V4 sequences
added to the tree. Note the poor confidence scores: 0.3. These
“confidence” estimated are an artefact of compatibility,
and are not relevant here. Note also that Taxon is a
single column, with the different d__; c__; o__;
ranks joined together by whitespace.
Feature ID Taxon Confidence
GB-GCA-003568775.1-MWMI01000008.1 d__Bacteria; p__; c__; o__; f__; g__; s__ 0.3
GB-GCA-001552015.1-CP010514.1 d__Bacteria; p__; c__; o__; f__; g__; s__ 0.3
GB-GCA-000008085.1-AE017199.1 d__Bacteria; p__; c__; o__; f__; g__; s__ 0.3
22383138 d__Bacteria; p__; c__; o__; f__; g__; s__ 0.3
07656960 d__Bacteria; p__; c__; o__; f__; g__; s__ 0.3
01537540 d__Bacteria; p__; c__; o__; f__; g__; s__ 0.3
Session on. We start R, clear the workspace
& memory, load libraries, load the files.
Loading seqs.fna and taxonomy.id.tsv can
take a long time. If you want to experiment / debug, use a smaller set
of taxonomy and sequences:
bash to make a smaller seqs.fna file:
e.g. gzip -dc 2024.09.seqs.fna | head -n 20000 > subseqs.fna
to use the first e.g. 10k FASTA sequences, load that instead.nrows to your read.table call,
e.g. read.table( ... nrows=10000 ) to read in only the
first e.g. 10K lines onlyJust be aware that the smaller these test subsets are the less likely
they are to match up with one another! Next, we also make
path the same as your database directory (but it could be
any dir where you want to save the outputs), and chose how
many n_cores to use in parallel (15 here
because v. fancy).
rm(list=ls()) ; gc()
library("Biostrings")
library("parallel")
# read in seqs.fna - 23,467,470 seqs
length(seqs <- readDNAStringSet( filepath = paste0(path, "/2024.09.seqs.fna.gz" )))
# read in ID taxonomy - 23M lines
head(tax <- read.table(paste0( path, "/2024.09.taxonomy.id.tsv.gz"), header = TRUE, sep = "\t"))
n_cores <- 15
path <- "/path/to/custom_gg2"
We make a vector that holds the rank levels (see step 4), and we’ll break up that taxonomy into different columns, making a handy dataframe of all the taxa and accessions in GG2:
# note, taxonomy lacks root@1
rank_is <- c( "d__" = "domain", "p__" = "phylum", "c__" = "class", "o__" = "order", "f__" = "family", "g__" = "genus", "s__" = "species")
# tax_dummies, just in case theres weird extra ranks (there aren't)
tax_dummies <- 3
taxs <- as.data.frame(do.call("rbind", mclapply( tax$Taxon, function(aa){ # aa <- tax$Taxon[727748] ## parallel
bb <- unlist(strsplit( aa, split = "\\; "))[1:(7+tax_dummies)]
bb
}, mc.cores = n_cores, mc.cleanup = TRUE)), stringsAsFactors = FALSE)
colnames(taxs) <- rank_is
taxs$header <- tax$Feature.ID
rownames(taxs) <- taxs$header
head(taxs)
# rownames removed!
domain phylum class order family genus species header
d__Bacteria p__ c__ o__ f__ g__ s__ GB-GCA-003568775.1-MWMI01000008.1
d__Bacteria p__ c__ o__ f__ g__ s__ GB-GCA-000008085.1-AE017199.1
d__Bacteria p__ c__ o__ f__ g__ s__ 22383138
d__Bacteria p__ c__ o__ f__ g__ s__ 07656960
d__Bacteria p__Pates... c__Pacei.. o__UBA9973 f__UBA9973 g__UBA9973 s__ 05829965
d__Bacteria p__Pates... c__Pacei.. o__UBA9973 f__UBA9973 g__UBA9973 s__ 09474592
Next, lapply over that dataframe (in
parallel=mclapply), creating a list with a unique entry for
each unique taxon - this will be the key for assigning each unique taxon
their own unique taxid.
taxs$effective_taxon <- mclapply( 1:nrow(taxs), function(aa){ # aa <- 400 aa <- taxs[ 400 , 1:(7+tax_dummies) ] ## parallel
bb <- taxs[ aa, 1:(7+tax_dummies) ]
unlist(bb[ max(which( !( grepl("__$", bb, perl = TRUE) | is.na(bb) ) )) ])
}, mc.cores = n_cores, mc.cleanup = TRUE)
head(taxs)
uniq_tax <- apply( taxs[, 1:(7+tax_dummies)], 2, function(aa){
bb <- aa[ !( is.na(aa) ) ] # prob need to keep these in order to match later?| grepl("__$", aa, perl = TRUE)
sort(unique(unlist(bb)))
})
Here, it might be useful to remove 16S sequences below a certain threshold length - so we only keep those longer than 1.4kbp, and shorter than 1.65 kbp. This means we’re losing 21M V4 regions, but that data sounds like a bad idea, and this will greatly simplify database building.
s_widths <- width(seqs)
length(seqs_w <- seqs[ s_widths > 1400 & s_widths < 1650 ]) # 332,588
length( usable_seqs <- seqs_w[ names(seqs_w) %in% taxs$header ] ) # 312,905
taxidsAbove, we have an entry for each unique taxon (node) in the GG2 tree.
Because of how Kraken2 works, the fake taxid’s
we make need to start at 1
(root), and should not start with a 0
(apparently). We use floop to count how many
taxids have been given out so far.
Note the python-like universal assignment
of values outside of the apply call using
<<- - this is not recommended practice in
R.
## root is 1, so floop/domains start from 2...
floop <- 1
taxid_proto <- lapply( names(uniq_tax)[ sapply( uniq_tax, function(aa){ !identical( aa, character(0)) })], ## can't parallelise counting!!!
function(aa){ # aa <- "phylum" ## is very fast in any case
if( aa == "domain"){ floop <<- 1 }
bb <- uniq_tax[ aa ][[1]]
# cc <- stringr::str_pad( (floop+1):(floop+length(bb)), width = 7, pad = "0" )
cc <- (floop+1):(floop+length(bb))
floop <<- floop + length(cc)
list( "name_txt" = bb, "taxid" = cc)
})
names(taxid_proto) <- rank_is
Here, avoid having to do a dictionary-style lookup of
taxon:taxid: instead write a fn to decide rank
ahead of time using rank_is & grepl
through that rank alone (i.e. based on matching “x__”
prefix) to find each taxons correct taxid (and a similar fn
to find each taxon’s parent’s taxid):
identify <- function(aa){
bb_rank <- rank_is[ gsub("__.*", "__", aa )] # aa <- "g__GWA2-37-10"
if( aa %in% taxid_proto[[bb_rank]][[1]] ){
taxid_proto[[bb_rank]][[2]][ match(aa, taxid_proto[[bb_rank]][[1]] ) ]
}else{
print(paste0("taxon ", aa, " wasn't in the taxid map"))
}
}
identify_parent <- function(aa){ # aa <- "g__UBA6532" aa <- "s__"
if( grepl("d__", aa)){
# "0000000" # snilches get zilches - possibly no leading 0s
"1" # snoonos get unos ~~snilches get zilches
}else{
bb_rank <- rank_is[ gsub("__.*", "__", aa) ]
cc <- taxs[ match(aa, taxs[ , bb_rank ]) , ( which( names(taxs) == bb_rank) -1) ]
identify( cc )
}
}
names.dmpHere we’re copying the expected format of nodes &
names.dmp files. It can be useful to look at a “normal”
version and see how it looks
(i.e. less -S $STANDARD_KRAKEN2DB/taxonomy/names.dmp), but
it looks much like this (note the special \t|\t field
delimiters between columns and at the end - we’ll deal with
those in steps 7. & 8.):
13 | Dictyoglomus | | scientific name |
14 | ATCC 35947 | ATCC 35947 <type strain | type material |
14 | Dictyoglomus sp. Rt46-B1 | | includes |
14 | Dictyoglomus thermophilum Saiki et al. 1985 | | authority |
14 | Dictyoglomus thermophilum | | scientific name |
14 | DSM 3960 | DSM 3960 <type strain | type material |
14 | strain H-6-12 | strain H-6-12 <type strain | type material |
16 | Methyliphilus | | equivalent name |
We copy this format as best we can, using the identify()
fn we made above, to look up each unique taxon we identified (3.), and
find the unique taxid we assigned it (4.):
names_proto <- data.frame(
"tax_id" = unlist( mclapply( # the id of node associated with this name
unlist( uniq_tax),
identify, mc.cores = n_cores, mc.cleanup = TRUE )),
"name_txt" = unlist( uniq_tax), # name itself
"unique name" = unlist( uniq_tax), # the unique variant of this name if name not unique
"name class" = "scientific name", # synonym, common name, ... all lower case.
stringsAsFactors = FALSE
)
head(names_proto)
nodes.dmpSame idea here! Only major change is that we also use the
identify_parent() fn we made to also get the
taxid of each taxons parent taxon (e.g. for a given
species, what is the taxid of the genus it belongs
to?).
The other difference is that nodes.dmp has more fields - but note
that we bascially give rows 4-13 a fixed value for all
Bacterial/Archaeal taxa. It won’t affect Kraken2.
nodes_proto <- data.frame(
# "tax_id" = unlist(sapply( names_proto$name_txt, identify )), ## parallel
"tax_id" = unlist(mclapply( names_proto$name_txt, identify, mc.cores = n_cores, mc.cleanup = TRUE )), ## parallel
"parent tax_id" = unlist( mclapply( names_proto$name_txt, identify_parent, mc.cores = n_cores, mc.cleanup = TRUE )), ## parallel
"rank" = rank_is[ unlist( mclapply( names_proto$name_txt, function(aa){ gsub("(^\\w__).*", "\\1", aa) }, mc.cores = n_cores, mc.cleanup = TRUE )) ] , ## parallel
"embl code" = "", # no one seems to know about this one
"division id" = 0, # 0 = Bacteria (but also apparently Archaea...)
"inherited div flag" = 1, # 1 if node inherits division code from parent
"genetic code id" = 11, # 11 = Bact/Arch/plastid
"inherited GC" = 1, # 1 if node inherits genetic code from parent
"mitochondrial genetic code" = 0, # see gencode.dmp file
"inherited MGC" = 1, # 1 if node inherits mitochondrial gencode from parent
"GenBank hidden" = 0, # 1 if name is suppressed in GenBank entry lineage
"hidden subtree root" = 0, # 1 if this subtree has no sequence data yet
"comments" = "")
head(nodes_proto)
Crucially to both nodes.dmp and names.dmp,
we must root our new fake taxid, with the
“root” taxon, which must be at
taxid 1 (I think…feel free to
check and let us know).
# - nodes.dmp (note 8,0,1, instread of 11,1,0)
nodes_proto <- rbind(
c( "1","1","no rank","","8","0","1","0","0","0","0","0",""),
nodes_proto
)
# - and names.dmp
names_proto <- rbind(
c( "1", "all", "all", "synonym"),
c( "1", "root", "root", "scientific name"),
names_proto
)
Equally, the sequences need new accessions which
reflect the newly created taxid that we’ve added to
nodes.dmp and names.dmp. These names are
include:
kraken:taxid”, which I assume
Kraken2 uses to find the next bit:taxid.These are separated by a pipe (the “|” character), and
put together should look like this:
MJ030-2-barcode69-umi1705bins-ubs-122|kraken:taxid|1491
Kraken2’s documentation on Custom database building
indicates that “kraken:taxid|123456” could be at the front
or back, as long as it is properly delimited by |;
troubleshooting via genAI suggested that the above order is preferred
(but again, feel free to explore this variable).
names(usable_seqs) <- paste0(
names(usable_seqs),
"|kraken:taxid|",
unlist( mclapply( taxs[ names(usable_seqs) , "effective_taxon"], identify, mc.cores = n_cores, mc.cleanup = TRUE )
)
)
Always worth checking the structure and look of these files.
Certainly easier to examine in a local session on Rstudio
or similar, but they should look similar (identical) to the standard
Kraken2 ones:
head( nodes_proto )
head( names_proto )
head( usable_seqs )
Finally, write out nodes.dmp &
names.dmp. Check out that weird sep arg -
again, this is a key part of how NCBI taxdump files are structured, so
Kraken2 needs it:
# reference FASTA sequences first:
writeXStringSet( usable_seqs,
filepath = paste0( path, "/gg2_2024.09__FullLength313k__seqs.fna"),
compress = FALSE, format = "fasta" )
# nodes and names.dmp files, with odd "\t|\t" separator
write.table(names_proto, file = paste0( path, "/gg2_2024.09__FullLength313k__names.dmp"),
sep = '\t|\t', col.names = FALSE, row.names = FALSE, quote = FALSE)
write.table(nodes_proto, file = paste0( path, "/gg2_2024.09__FullLength313k__nodes.dmp"),
sep = '\t|\t', col.names = FALSE, row.names = FALSE, quote = FALSE)
kraken2-build the GG2 databaseKraken2 databaseThis will work! Kraken2 only needs the three files, in two folders,
inside your database directory as below to build a database. All the
other taxdump files (accessions2taxids, other
.dmp files) are ancillary. Step 1 is to
arrange this manually.
NB: Doing this manually skips the first step in the standard database building (e.g.
kraken2-build --download-library bacteria --db $k2db)
$ tree $k2db
# > ├── library
# > │ └── seqs.fna
# > └── taxonomy
# > ├── names.dmp
# > └── nodes.dmp
Make folders, copy the files over. Note that the files end up named
nodes.dmp, names.dmp, seqs.fna
(non-compressed):
# where you saved the dmp & fna files from R
outdir=/path/to/custom_outputs/
# where you want your database to go
k2db=/path/to/custom_gg2_database/
mkdir -p $k2db $k2db/taxonomy $k2db/library
# copy from wherever you made it, to wherever your database is going
cp $outdir/*nodes.dmp $k2db/taxonomy/nodes.dmp
cp $outdir/*names.dmp $k2db/taxonomy/names.dmp
cp $outdir/*seqs.fna $k2db/library/seqs.fna
head $k2db/taxonomy/nodes.dmp
head $k2db/taxonomy/names.dmp
head $k2db/library/seqs.fna
Originally, I was also building seqid2taxid.map too, but
Kraken2 is quite happy to do that step itself, so let it away. However,
we do still need to add a \t|\t delimiter at the
end of the .dmp files - otherwise the database will work
and classify taxids, but will not tell you what the actual
taxon ranks are - which would not be much use.
sed -i -e 's/$/\t|/g' $k2db/taxonomy/names.dmp
sed -i -e 's/$/\t|/g' $k2db/taxonomy/nodes.dmp
Kraken2 databaseIf making your own database from scratch, step 2 is to add sequences into the database structure from step 1. Note here we skip masking of repetitive sequences, as we’re working with 16S sequences (we need all the signal we can get from this restricted range of sequences, and some “low” information stretches are structurally important. As ever, thoughts welcome on this).
kraken2-build --add-to-library $k2db/library/seqs.fna --db $k2db --no-masking
Once all the pieces are in place, it’s (theoretically) as simple as:
# rub your hands together and
kraken2-build --build --db $k2db --threads 15
# > Added "$k2db/library/seqs.fna" to library (/path/to/custom_gg2_database/)
# > Creating sequence ID to taxonomy ID map (step 1)...
# > Sequence ID to taxonomy ID map complete. [0.182s]
# > Estimating required capacity (step 2)...
# > Estimated hash table requirement: 38361964 bytes
# > Capacity estimation complete. [4.503s]
# > Building database files (step 3)...
# > Taxonomy parsed and converted.
# > CHT created with 15 bits reserved for taxid.
# > Completed processing of 625810 sequences, 938680526 bp <<< much better!
# > Writing data to disk... complete.
# > Database files completed. [3m8.402s]
# > Database construction complete. [Total: 3m13.129s]
Inspecting the database itself (long output curtailed):
# press Q to exit from `less`
kraken2-inspect --db $k2db | less
# > # > Database options: nucleotide db, k = 35, l = 31
# > # > Spaced mask = 11111111111111111111111111111111110011001100110011001100110011
# > # > Toggle mask = 1110001101111110001010001100010000100111000110110101101000101101
# > # > Total taxonomy nodes: 18637
# > # > Table size: 6672071
# > # > Table capacity: 9590491
# > # > Min clear hash value = 0
# > 100.00 6672071 3502 R 1 root
# > 97.61 6512515 309649 R1 3 d__Bacteria
# > 17.49 1167211 7021 P 109 p__Pseudomonadota
# > 12.18 812446 98233 C 255 c__Gammaproteobacteria
# > 2.21 147697 13331 O 590 o__Burkholderiales
# > 0.90 60050 14680 F 1682 f__Burkholderiaceae_A_595421
# > 0.05 3614 401 G 8717 g__Solimicrobium
# > 0.04 2647 2647 S 21332 s__Solimicrobium aquaticum
# > 0.00 324 324 S 21333 s__Solimicrobium silvestre
# > 0.00 115 115 S 23410 s__Undibacterium aquatile
# > 0.00 102 102 S 23413 s__Undibacterium seohonense
# > 0.00 25 25 S 23412 s__Undibacterium nitidum
# > 0.04 2868 371 G 7552 g__Paraburkholderia_580243
# > [...] etc.
Much success.
kraken2 with the GG2 databaseThat’s only the beginning. Lets try apply it to a DADA2
output of assigned ASVs:
kraken2 --db $k2db \
--use-names $WRK2/dada2_output/assigned_ASVs__2632.fna \
--report-zero-counts --report $k2db/temp_k2_report | less -S
All 100% classified (though not necessarily to species level, as expected…):
# > 2632 sequences (1.08 Mbp) processed in 0.254s (622.7 Kseq/m, 256.12 Mbp/m).
# > 2632 sequences classified (100.00%)
# > 0 sequences unclassified (0.00%)
# classification output:
C asv__4308 f__Rhizobiaceae (taxid 2720) 399 3:32 109:1 [...]
C asv__4315 s__Actinotignum schaalii_A_386418 (taxid 10669) 415 [...]
C asv__4326 s__Gemmiger_A_73129 qucibialis (taxid 14717) 425 [...]
C asv__4334 g__Peptoniphilus_A (taxid 7692) 398 3:18 1217:4 [...]
C asv__4405 g__Lentihominibacter (taxid 6546) 401 3:51 [...]
C asv__4445 s__Neisseria sp000186165 (taxid 18038) 424 3:14 [...]
[...]
If we look at the Kraken2 report, we see everything
broken down by taxid, and hierarchical taxonomy.
less -S $k2db/temp_k2_report
0.00 0 0 U 0 unclassified
100.00 2632 0 R 1 root
100.00 2632 0 R1 3 d__Bacteria
39.86 1049 0 P 14 p__Bacillota_A_368345
39.86 1049 0 C 191 c__Clostridia_258483
21.12 556 1 O 883 o__Lachnospirales
19.98 526 91 F 2360 f__Lachnospiraceae
3.50 92 10 G 4377 g__Blautia_A_141781
1.10 29 29 S 12002 s__Blautia_A_141781 obeum
0.91 24 24 S 12003 s__Blautia_A_141781 wexlerae
0.68 18 18 S 12000 s__Blautia_A_141781 fusiformis
0.34 9 9 S 11999 s__Blautia_A_141781 caecimuris
Best of luck in yr bughunts.
Just the code plx? Shur shur shur.
R rm(list=ls()) ; gc()
## here, download to /dl/path from R
system("cd /dl/path")
system("wget https://ftp.microbio.me/greengenes_release/current/2024.09.seqs.fna.gz")
system("wget https://ftp.microbio.me/greengenes_release/current/2024.09.taxonomy.id.tsv.gz")
## get sequences =========================================================
# attributes, names, length, hist etc. for DNA work
library("Biostrings")
library("parallel")
n_cores <- 15
path <- "/dl/path"
length(seqs <- readDNAStringSet( filepath = paste0(path, "/2024.09.seqs.fna.gz" ))) # 23,467,470
head(tax <- read.table(paste0( path, "/2024.09.taxonomy.id.tsv.gz"), header = TRUE, sep = "\t")) # 23M lines
## make taxonomy =========================================================
rank_is <- c( "d__" = "domain", "p__" = "phylum",
"c__" = "class", "o__" = "order",
"f__" = "family", "g__" = "genus",
"s__" = "species")
# 3 taxonomic dummy levels added here for fun
tax_dummies <- 3
taxs <- as.data.frame(do.call("rbind", mclapply( tax$Taxon, function(aa){
bb <- unlist(strsplit( aa, split = "\\; "))[1:(7+tax_dummies)]
bb
}, mc.cores = n_cores, mc.cleanup = TRUE)), stringsAsFactors = FALSE)
colnames(taxs) <- rank_is
taxs$header <- tax$Feature.ID
rownames(taxs) <- taxs$header
## crucial - accession - taxon map
taxs$effective_taxon <- unlist( mclapply( 1:nrow(taxs), function(aa){
bb <- taxs[ aa, 1:(7+tax_dummies) ]
bb[ max(which( !( grepl("__$", bb, perl = TRUE) | is.na(bb) ) )) ]
}, mc.cores = n_cores, mc.cleanup = TRUE))
uniq_tax <- apply( taxs[, 1:(7+tax_dummies)], 2, function(aa){
bb <- aa[ !( is.na(aa) ) ]
sort(unique(unlist(bb)))
})
## subset to seqs in 1.4-1.65kbp range ====================================
s_widths <- width(seqs)
# 332,588 in full version
length(seqs_w <- seqs[ s_widths > 1400 & s_widths < 1650 ])
# match with those in taxs - 312,905 in full version
length( usable_seqs <- seqs_w[ names(seqs_w) %in% taxs$header ] )
## taxid proto -----------------------------------------------------------
## root is 1, so floop/domains start from 2...
floop <- 1
## counter with universal assignment to var
taxid_proto <- lapply( names(uniq_tax)[
sapply( uniq_tax, function(aa){ !identical( aa, character(0)) })
], function(aa){
if( aa == "domain"){ floop <<- 1 }
bb <- uniq_tax[ aa ][[1]]
cc <- (floop+1):(floop+length(bb)) # stringr::str_pad( (floop+1):(floop+length(bb)), width = 7, pad = "0" )
floop <<- floop + length(cc)
list( "name_txt" = bb, "taxid" = cc)
})
names(taxid_proto) <- rank_is
## no mult lookups - decide rank ahead of time, use rank_is & grep through that alone (e.g. based on prefix)
identify <- function(aa){
bb_rank <- rank_is[ gsub("__.*", "__", aa )] # aa <- "g__GWA2-37-10"
if( aa %in% taxid_proto[[bb_rank]][[1]] ){
taxid_proto[[bb_rank]][[2]][ match(aa, taxid_proto[[bb_rank]][[1]] ) ]
}else{
print(paste0("taxon ", aa, " wasn't in the taxid map"))
}
}
identify_parent <- function(aa){ # aa <- "g__UBA6532" aa <- "s__"
if( grepl("d__", aa)){
"1" # snoonos get unos
}else{
bb_rank <- rank_is[ gsub("__.*", "__", aa) ]
cc <- taxs[ match(aa, taxs[ , bb_rank ]) , ( which( names(taxs) == bb_rank) -1) ]
identify( cc )
}
}
## names proto -----------------------------------------------------------
names_proto <- data.frame(
"tax_id" = unlist( mclapply( unlist( uniq_tax), identify, mc.cores = n_cores, mc.cleanup = TRUE )),
"name_txt" = unlist( uniq_tax),
"unique name" = "", # safety net ignored.
"name class" = "scientific name", # synonym, common name, ... lower case!
stringsAsFactors = FALSE
)
## proto nodes -----------------------------------------------------------
nodes_proto <- data.frame(
"tax_id" = unlist(mclapply( names_proto$name_txt, identify, mc.cores = n_cores, mc.cleanup = TRUE )), ## parallel
"parent tax_id" = unlist( mclapply( names_proto$name_txt, identify_parent, mc.cores = n_cores, mc.cleanup = TRUE )), ## parallel
"rank" = rank_is[ unlist( mclapply( names_proto$name_txt, function(aa){ gsub("(^\\w__).*", "\\1", aa) }, mc.cores = n_cores, mc.cleanup = TRUE )) ] , ## parallel
"embl code" = "", # no one seems to know about this one
"division id" = 0, # 0 = Bacteria (but also apparently Archaea...)
"inherited div flag" = 1,
"genetic code id" = 11, # 11 = Bact/Arch/plastid
"inherited GC" = 1, # 1 if node inherits genetic code from parent
"mitochondrial genetic code" = 0, # see gencode.dmp file
"inherited MGC" = 1, # 1 if node inherits mitochondrial gencode from parent
"GenBank hidden" = 0, # 1 if name is suppressed in GenBank entry lineage
"hidden subtree root" = 0, # 1 if this subtree has no sequence data yet
"comments" = "")
## rooting the tree - must be appended to :
# - nodes.dmp (note 8,0,1, instead of 11,1,0)
nodes_proto <- rbind(
c( "1","1","no rank","","8","0","1","0","0","0","0","0",""),
nodes_proto
)
# - and names.dmp
names_proto <- rbind(
c( "1", "all", "all", "synonym"),
c( "1", "root", "root", "scientific name"),
names_proto
)
## modify the seq headers =================================================
names(usable_seqs) <- paste0(
names(usable_seqs),
"|kraken:taxid|",
unlist( mclapply( taxs[ names(usable_seqs) , "effective_taxon"], identify, mc.cores = n_cores, mc.cleanup = TRUE )
)
)
## out that! ---------------------------------------------------------------
# head( nodes_proto )
# head( names_proto )
# head( usable_seqs )
## note sep for nodes & names
writeXStringSet( usable_seqs, filepath = paste0( path, "/gg2_2024.09__FullLength313k__seqs.fna"),
compress = FALSE, format = "fasta" )
write.table(names_proto, file = paste0( path, "/gg2_2024.09__FullLength313k__names.dmp"),
sep = '\t|\t', col.names = FALSE, row.names = FALSE, quote = FALSE)
write.table(nodes_proto, file = paste0( path, "/gg2_2024.09__FullLength313k__nodes.dmp"),
sep = '\t|\t', col.names = FALSE, row.names = FALSE, quote = FALSE)
bashSwap to bash:
outdir=/dl/path
k2db=/ref/gg2_db
# rm -rf $k2db
mkdir -p $k2db $k2db/taxonomy $k2db/library
cp $outdir/*nodes.dmp $k2db/taxonomy/nodes.dmp
cp $outdir/*names.dmp $k2db/taxonomy/names.dmp
cp $outdir/*seqs.fna $k2db/library/seqs.fna
# fix delimiters!
sed -i -e 's/$/\t|/g' $k2db/taxonomy/names.dmp
sed -i -e 's/$/\t|/g' $k2db/taxonomy/nodes.dmp
kraken2-build --add-to-library $k2db/library/seqs.fna --db $k2db --no-masking
# full thing takes ~90 mins
kraken2-build --build --db $k2db --threads 15
# look
kraken2-inspect --db $k2db | head -n20
kraken2 --db $k2db \
--use-names ${yourdata}/query16S.fna \
--report-zero-counts --report ./query16S.report > ./query16S.output
head -n25 ./query16S.report
# here endeth the guessin
Filtered the original 23M sequences to keep only sequences between 1.3Kbp and 1.65Kbp only (331K sequences).
hash.k2d - 38MBopts.k2d - 64Btaxo.k2d - 1.5MBShould you feel so inclined
full_db_tar.gz - includes the hash,
opts, & taxo.k2d, as well as
nodes, map, & seq.fna files
used: 170MBBuilt from the complete release, contains all sequences with a matching accession (ought to be 23M seqs).
hash.k2d - 184MBopts.k2d - 64Btaxo.k2d - 1.8MBOr indeed, run away with yrself:
full_db_tar.gz - includes the hash,
opts, & taxo.k2d, as well as
nodes, map, & seq.fna files
used: 1.27GBThings to be aware of if using this data for your work (generally same between GG2 / S138.2):
taxids, so none of these
will match NCBI taxonomy.A number of issues that needed to be taken care of en route to the bottom of the page.
.dmps: incorrect seps on .dmps (needed to
be \t|\t).dmps: no sep (\t|\t) at end of
each line in .dmps (the end of a line is traditionally not
delimited by sep, but by a newline, so go figure.).dmps: column 4 in names.dmp needed to be
lower-case in order to report taxonomies (taxid assignment
worked fine, but it wouldn’t tell you who they were)<acc>|kraken:taxid|000000 is
better than kraken:taxid|000000|<acc>taxonomizR
- interface to NCBI’s taxdump, which was the real confusion when
starting