Silva138.2-Kraken2database: download here / see end of page
Silva138.2-Kraken2database: skip to code only
Here we take the most recent release of the Silva database (v.138.2 Silva; 2.2M sequences, varying lengths), and format it for use explicitly with the Kraken2 classifier.
See also all the roundabout talk regarding the GG2 version, including a basic intro to building Kraken2 databases.
of note:
taxids do not match NCBI! They don’t even match
between the SSU & LSU datasets (let alone matching GG2).
Heartbreaking.species rank at all, and nor
does it assign species-level taxids. Instead, every species
is assigned the taxid of it’s parent, thereby “extending”
its parent genus. Having done no research on the matter, I’m not sure
why this should be, but it is at least partly the motivation for this
code.SILVA_138.2_SSUReftax_silva.fasta.gz,
and not the NR99 version (tiny differences are the entire
point of using ASVs). However, mighty benjjneb and
co. use NR99 for their NB-RDP method. Should we follow suit?
Probably not - DADA2 uses a twostep
solution,
whereas we are aiming for perfect-firstime-e’rytime.It’s worth looking at the three Silva files we need
before going any further (they’re quite different from the GG2 files,
and see also the
README.txt).
208K -rw-rw-r-- 1 hndbls hndbls 206K Jul 11 2024 tax_slv_ssu_138.2.txt.gz
28M -rw-rw-r-- 1 hndbls hndbls 28M Jul 11 2024 taxmap_slv_ssu_ref_138.2.txt.gz
677M -rw-rw-r-- 1 hndbls hndbls 677M Jul 11 2024 SILVA_138.2_SSURef_tax_silva.fasta.gz
# for a quick look at the contents, try
gzip -dc /some/silva.file.gz | head
tax_slv_[ls]su_VERSION.txt ->
linkEach organism’s:
taxida for enironmental; w
for due update; or blank.Silva DB version number.The last two do not matter for this, but path-taxid-rank
are very helpful.
taxmap_TAXNAME_[ls]su_VERSION.txt ->
linkFile contains:
primaryAccession : found in the header of the
sequences. Largey unique, but some are duplicated when more than one hit
is detected per reference sequence.start : starting 5’ postiton of the reference sequence
in the entrystop : ending 3’ postiton of the reference sequence in
the entrypath : taxonomic descent to the entry, separated by
“;’, and stopping at the parent rank. Note that this is
not a standard number of fields, and entries up to 14
ranks are common.organism_name : name of the organism. Poorly
standardised, can include ([#\, and Crom knows what else.
To fix this, make sure to use
quote = "", comment= "" when reading the table into
R.taxid : the taxonomic ID of this entry. Again, note
that all the species from the same genus will share the
taxid of that genus.SILVA_138.2_SSURef_tax_silva.fasta.gz ->
linkThe reference FASTA sequences, complete with
all-important header:
primaryAccession in the taxmap file (#2
above) does not contain version numbering (e.g.,
AY846372), while the accession in the seqs.fna
header does (e.g. AY846372.1.1779). We will have to either
remove the versioning, or edit the header to start with just the
accession (i.e., “>AY846372|..., and
not”>AY846372.1.1779|...
);organism_name above, the header contains
spaces and other stupid characters like [, (,
‘/’, and #. Stay wide.A, T,
G, and C.From Silva, we need to create the following using the
taxids we copy from Silva/create for species level:
names.dmp: organism_name,
taxid, and a bunch of other unimportant stuff we can
fakenodes.dmp: taxid,
parent_taxid, rank, and a bunch more
unimportant stuff, which we can also fakeseqs.fna: FASTA-format DNA sequences, with
the header reformatted as:
“>---#1---|kraken:taxid|---#2---”, where
---#1--- is the accession the sequence came with, and
---#2--- is the taxid value (either the one
copied from Silva, or the one we create for species-level
assignment). Any additional fields must be separated with a pipe
(|), and spaces are not allowed in the
header.Good to keep in mind, Kraken2 just wants matching
names.dmp, nodes.dmp, and
seqs.fna, as per the screenshot below. Note that for
names.dmp and nodes.dmp, each field is
delimited by \t|\t (tab, pipe, tab), including one at the
very end for nodes.dmp! We set these when we write out the
file from R, using the sep = "..." arg, and we
fix that final field in bash.
codeR: re-format Silva # nano $ref/k2__silva138.2/r_format_s1382-ALL-SEQS.R
rm(list=ls()) ; gc()
# attributes, names, length, hist etc. for DNA work
library("Biostrings")
library("parallel")
n_cores <- 15
## import SILVA -----------------------------------------------------------
## 3 silva files already downloaded to this location
path <- "/ref/dl"
## sequnces
# note silva seqs are in RNA, so need to RNA -> DNA
length(seqs <- DNAStringSet(
readRNAStringSet(
filepath = paste0(path, "/SILVA_138.2_SSURef_tax_silva.fasta.gz" )))) # 2,224,690
## taxmap
head(tax <- read.table(paste0( path,
"/taxmap_slv_ssu_ref_138.2.txt.gz"), header = TRUE,
sep = "\t", quote = "", comment= "" )) # 2,224,691 lines
## taxonomic ranks
ranks <- read.table(paste0( path, "/tax_slv_ssu_138.2.txt.gz"),
header = FALSE, sep = "\t", quote = "") # 2,224,691 lines
colnames(ranks) <- c("path", "taxid", "rank", "comment", "release")
head(seqs)
head(ranks)
head(tax)
## get parent_taxid
ranks$effective_taxon <- gsub( ".*;([\\w\\s\\.\\-\\(\\)\\[\\]]*);$",
"\\1", ranks$path, perl = TRUE )
# remove odd characters for better matching
ranks$path_debracket <- gsub("\\(|\\)", "_",
gsub("\\[|\\]", "_", ranks$path))
ranks$parent_taxid <- unlist( mclapply( 1:nrow(ranks), function(aa){ # aa <- 13
bb <- strsplit( ranks[ aa , ]["path"][[1]], ";")[[1]]
if( length(bb) > 1){
cc <- unlist( ranks[ grep(
gsub("\\(|\\)", "_",
paste0( c(bb[ 1:length(bb)-1 ],"$"), collapse=";"),
perl=TRUE),
ranks$path_debracket, perl = TRUE) , "taxid" ])
}else{
cc <- 1 # assume its at domain level
}
if( length( cc ) == 0 ){
0
}else if( identical( cc, character(0)) | is.na(cc) | is.null(cc) ){
NA
}else{
cc
}
}, mc.cores = n_cores, mc.cleanup = TRUE, mc.preschedule = TRUE))
head(ranks)
## expand SILVA taxid to include SPECIES ----------------------------------
## new taxid for species (& strains, substrains, etc.)
tax$taxid_novo <- tax$taxid
floop <- 13370000
# with each unique taxid in tax, get all taxa sharing that parent, and count up from floop
taxf <- do.call( "rbind", lapply( unique(tax$taxid), function(aa){ # aa <- 58941
bb <- tax[ tax$taxid_novo == aa , ]
cc <- floop+nrow(bb)
bb[ , "parent_taxid" ] <- aa
bb[ , "taxid_novo" ] <- (floop+1):cc
floop <<- cc
bb
}))
## update rank assignment
taxf$rank <- sapply( taxf$organism_name, function(aa){
if( grepl( "substr", aa)){
"substrain"
}else if( grepl( "subsp", aa)){
"strain"
}else if( grepl( "subsp", aa)){
"subspecies"
}else{
"species"
}
})
head(taxf)
## proto names -----------------------------------------------------------
head(
spec_names <- rbind(
# species-level stuffing
data.frame(
taxid = taxf$taxid_novo,
name_txt = taxf$organism_name,
"unique name" = "",
"name class" = "scientific name",
stringsAsFactors = FALSE),
# everything else
data.frame(
taxid = ranks$taxid,
name_txt = ranks$effective_taxon,
"unique name" = "",
"name class" = "scientific name",
stringsAsFactors = FALSE)
) )
## proto nodes -----------------------------------------------------------
head(
spec_nodes <- rbind(
# more species-level stuffing
data.frame(
"tax_id" = taxf$taxid_novo,
"parent tax_id" = taxf$parent_taxid,
"rank" = taxf$rank,
"embl code" = "", # no one seems to know about this one
"division id" = 0, # 0 = Bacteria (but also apparently Archaea...)
"inherited div flag" = 1,
"genetic code id" = 11, # 11 = Bact/Arch/plastid
"inherited GC" = 1, # 1 if node inherits genetic code from parent
"mitochondrial genetic code" = 0, # see gencode.dmp file
"inherited MGC" = 1, # 1 if node inherits mitochondrial gencode from parent
"GenBank hidden" = 0, # 1 if name is suppressed in GenBank entry lineage
"hidden subtree root" = 0, # 1 if this subtree has no sequence data yet
"comments" = ""),
data.frame(
"tax_id" = ranks$taxid,
"parent tax_id" = ranks$parent_taxid,
"rank" = ranks$rank,
"embl code" = "",
"division id" = 0,
"inherited div flag" = 1,
"genetic code id" = 11,
"inherited GC" = 1,
"mitochondrial genetic code" = 0,
"inherited MGC" = 1,
"GenBank hidden" = 0,
"hidden subtree root" = 0,
"comments" = "")
) )
## rooting the tree - must be appended to :
# - names.dmp
names_proto <- rbind(
c( "1", "all", "all", "synonym"),
c( "1", "root", "root", "scientific name"),
spec_names
)
# - and nodes.dmp (note 8,0,1, instead of 11,1,0)
nodes_proto <- rbind(
c( "1","1","no rank","","8","0","1","0","0","0","0","0",""),
spec_nodes
)
## modify the seq headers -------------------------------------------------
## small difference in sets does not motivate making a subset DB like in GG2
# s_widths <- width(seqs) # subset lengths :: Q1:569 - median:1180 - Q3:2260
# length(seqs_w <- seqs[ s_widths > 1000 & s_widths < 1650 ]) # 2,072,388
length(seqs_w <- seqs) # 2,224,690
## seqs in current subset
length( usable_seqs <- seqs_w[ gsub("\\..*", "", names(seqs_w)) %in% tax$primaryAccession ] ) # 2,224,690
## accession-wrangling not necessary elsewhere, as we allow K2 to build the seqid2taxid map for us
names(usable_seqs) <- gsub( " ", "-",
paste0(
gsub( "\\..*", "", names(usable_seqs)),
"|",
names(usable_seqs),
"|kraken:taxid|",
unlist( mclapply( names(usable_seqs), function(aa){ # aa <- names(seqs_w)[403]
# - first match only below, due to nested hits in the DB
taxf[ match( gsub( "\\..*","", aa), taxf$primaryAccession ) , "taxid_novo" ]
}, mc.cores = n_cores, mc.cleanup = TRUE, mc.preschedule = TRUE ))
))
## DB issue: some accessions represent multiple overlapping htis
# bad_acctors <- unlist(sapply(unique(taxf$primaryAccession), function(aa){ if( sum( taxf$primaryAccession == aa) > 1){aa} }))
# dplyr::filter(taxf, primaryAccession %in% bad_acctors )
# # - solution :: ignore, as the seq-id relationship remains valid
## out that! ---------------------------------------------------------------
head( nodes_proto )
head( names_proto )
head( usable_seqs )
## note sep for nodes & names
writeXStringSet( usable_seqs, filepath = paste0( path, "/silva_138.2__complete2.2M__seqs.fna"),
compress = FALSE, format = "fasta" )
write.table(names_proto, file = paste0( path, "/silva_138.2__complete2.2M__names.dmp"),
sep = '\t|\t', col.names = FALSE, row.names = FALSE, quote = FALSE)
write.table(nodes_proto, file = paste0( path, "/silva_138.2__complete2.2M__nodes.dmp"),
sep = '\t|\t', col.names = FALSE, row.names = FALSE, quote = FALSE)
bash : post-process & build
Kraken2Having made the species levels & taxids, and output
the nodes.dmp, names.dmp, &
seqs.fna, we now build the Kraken2 database.
We still need to do a small fix to the end of the nodes.dmp
and names.dmp files, as below. I couldn’t get the database
to build without doing that.
NB: Building a database from scratch skips the first step in the standard database building (e.g.
kraken2-build --download-library bacteria --db $k2db)
outdir=/ref/dl
k2db=/ref/k2__silva138.2
mkdir -p $k2db $k2db/taxonomy $k2db/library
cp $outdir/*complete2.2M__nodes.dmp $k2db/taxonomy/nodes.dmp
cp $outdir/*complete2.2M__names.dmp $k2db/taxonomy/names.dmp
cp $outdir/*complete2.2M__seqs.fna $k2db/library/seqs.fna
# fix delimiters!
sed -i -e 's/$/\t|/g' $k2db/taxonomy/names.dmp
sed -i -e 's/$/\t|/g' $k2db/taxonomy/nodes.dmp
Your setup should now look like this:
$ tree $k2db
# > ├── library
# > │ └── seqs.fna
# > └── taxonomy
# > ├── names.dmp
# > └── nodes.dmp
Now try building the database:
# note that we avoid masking, to maximise content of 16S
kraken2-build --add-to-library $k2db/library/seqs.fna --db $k2db --no-masking
# full thing takes GG2: ~90 mins Silva138.2: about ~100 mins
time kraken2-build --build --db $k2db --threads 15
# look
kraken2-inspect --db $k2db | head -n20
kraken2 --db $k2db \
--use-names ${yourdata}/query16S.fna \
--report-zero-counts --report ./query16S.report > ./query16S.output
head -n25 ./query16S.report
Complete, species-level 16S Silva database for Kraken2 (2.2M
sequences). Silva is highly curated, thus filtering by
sequence length did not really impact the dataset size/feasibility.
Enjoy working with the full dataset!
hash.k2d - 172MBopts.k2d - 64Btaxo.k2d - 163MBOr indeed, run back and forth, forever:
full_db_tar.gz - includes the hash,
opts, & taxo.k2d, as well as
nodes, map, & seq.fna files
used: 1.01GBDownload here (Zenodo.org - might still be processing…)
$ tree $k2db
# > ├── library
# > │ ├── added
# > │ └── seqs.fna
# > └── taxonomy
# > │ ├── names.dmp
# > │ ├── nodes.dmp
# > │ └── prelim_map.txt
# > ├── seqid2taxid.map
# > ├── hash.k2d
# > ├── opts.k2d
# > └── taxo.k2d
Things to be aware of if using this data for your work:
taxids, so none of these
will match NCBI taxonomy (but none of these 16S databases match
that anyway).A number of issues that needed to be taken care of en route to the bottom of the page.
.dmps: incorrect seps on .dmps (needed to
be \t|\t).dmps: no sep (\t|\t) at end of
each line in .dmps (the end of a line is traditionally not
delimited by sep, but by a newline, so go figure.).dmps: column 4 in names.dmp needed to be
lower-case in order to report taxonomies (taxid assignment
worked fine, but it wouldn’t tell you who they were)<acc>|kraken:taxid|000000 is
better than kraken:taxid|000000|<acc>taxonomizR
- interface to NCBI’s taxdump, which was the real confusion when
starting