Skip to content

Databases

Note

Databases here requires Metabuli v1.2.0 or later. For older versions, please refer to the Old Database page.

Pre-built databases are provided for common use cases. All databases can be downloaded from here.


Summary

Note

Please refer Metabuli-Bracken page for instructions on how to use Bracken with Metabuli's databases.

Database Name Taxonomy Size(GB) Bracken Contents Link
gtdb226 GTDB 378 - GTDB R226 genomes Download
refseq_standard NCBI 111 Yes RefSeq archaea, bacteria, virus, plasmid, protozoa, fungi, and human Download
hrgm2 GTDB 85 Yes Human Reference Gut Microbiome v2 (HRGM2) Download
hrom GTDB 42 Yes Human Reference Oral Microbiome (HROM) Download

gtdb226

  • Citation: GTDB R226 (Parks et al., 2026)
  • Species representative genomes with checkm2 completeness > 90% and contamination < 5%.
  • Includes 90,791 species out of 143,614 species in GTDB R226.
  • Human genome (T2T-CHM13v2.0) and RefSeq Virus (2026-03-31) are added.
  • build options: --space-mask 11101110111 --custom-metamer reduced_15_pattern.txt --syncmer 1 --smer-len 6
    • As many genomes are included, syncmers are used to reduce database size and improve classification speed.

refseq_standard

Note

Eukaryotic Classification Notice: The default minimum amino acid match count (--min-aa-euk) is set to 16, which is significantly more stringent than the prokaryotic requirement of 11. When using classify module, set --min-aa-euk to 11-13 to get more eukaryotic classifications.

  • Metabuli version of Kraken2's PlusPF database (2026-02-26 update)
    • The same set of genomes as Kraken2's PlusPF database are used.
      • RefSeq Complete Genome or Chromosome level assemblies: archaea, bacteria, virus, protozoa, fungi, and human
      • RefSeq plasmids and UniVec_Core
    • Difference from Kraken2's PlusPF:
      • Sequences deprecated between 2026-02-26 and 2026-04-02 are excluded.
        • List of excluded sequences
          • NC_002193
          • NC_010021
          • NC_018496
          • NC_018497
          • NC_024996
          • NC_030892
          • NZ_CM136992
          • NZ_CM136993
          • NZ_CM136994
          • NZ_CM136995
          • NZ_CP103377
          • NZ_CP103378
          • NZ_CP103379
          • NZ_CP103380
          • NZ_CP126834
          • NZ_CP126841
          • NZ_CP126850
          • NZ_CP168307
          • NZ_CP180736
          • NZ_CP180737
          • NZ_CP181249
          • NZ_CP199310
          • NZ_JADRXB020000004
          • NZ_JADRXB020000007
          • NZ_JAWQLS010000002
          • NZ_JAWQLS010000003
          • NZ_JAWQLS010000004
          • NZ_JAWQLS010000005
          • NZ_JBPJAM010000036
          • NZ_JBRYHD010000002
          • NZ_JBRYHD010000003
          • NZ_JBRYHE010000002
          • NZ_JBRYHE010000003
          • NZ_JBRYHF010000002
          • NZ_JBRYHF010000003
          • NZ_JBRYHG010000002
          • NZ_JBRYHG010000003
          • NZ_JBTORD010000003
          • NZ_JBTORD010000004
      • 4,936 more plamids are included as RefSeq plasmid set is updated.
  • Bracken support: Bracken database is bundled with Kraken2's PlusPF database.
  • build options: --space-mask 11101110111 --custom-metamer reduced_15_pattern.txt --syncmer 1 --smer-len 6
    • As many genomes are included, syncmers are used to reduce database size and improve classification speed.

hrgm2

  • Citation: Human Reference Gut Microbiome v2 (HRGM2).
  • HRGM2 statistics:
    • Only near-complete genomes (Completeness ≥ 90%, Contamination ≤ 5%, and GUNC CSS < 0.45)
    • 155,211 genomes representing 4,824 species.
  • Human genome (T2T-CHM13v2.0) and RefSeq Virus (2026-03-31) are added.
  • Bracken support:
    • Download Bracken database from HRGM2 page here.
    • NOTE: The HRGM2 Bracken databases only include prokaryotic genomes. Viral and eukaryotic portions of Bracken results should be interpreted with caution.
  • build options: --space-mask 11101110111 --custom-metamer reduced_15_pattern.txt

hrom

  • Citation: Human Reference Oral Microbiome (HROM).
  • HROM statistics:
    • 72,641 high-quality genomes representing 3,426 species are used. (Completeness ≥ 90%, Contamination ≤ 5%, and GUNC CSS < 0.45)
  • Human genome (T2T-CHM13v2.0) and RefSeq Virus (2026-03-31) are added.
  • Bracken support:
    • Download Bracken database from HROM page here.
    • NOTE: The HROM Bracken databases only include prokaryotic genomes. Viral and eukaryotic portions of BracBrackenken results should be interpreted with caution.
  • build options: --space-mask 11101110111 --custom-metamer reduced_15_pattern.txt