updateDB
The updateDB module adds new sequences to an existing Metabuli database. It supports both GTDB-based and NCBI/custom taxonomy-based databases, and allows adding sequences for existing taxa as well as new taxa.
Note
If you want to upgrade to a new GTDB release, build a new database from scratch using the build module instead.
GTDB-based Database
Add GTDB genomes
You can add assemblies included in the same GTDB release as the existing database. For example, if your existing database is built with GTDB R214.1, you can add new assemblies from GTDB R214.1 that were not included in the original build.
Note
Reference FASTA file names (or paths) must include an assembly accession matching the pattern GC[AF]_[0-9]+\.[0-9]+ (e.g., GCF_028750015.1). Files from RefSeq or GenBank meet this requirement.
| Argument | Description |
|---|---|
NEW_DBDIR |
Directory where the updated database will be generated |
FASTA_LIST |
File listing absolute paths to new FASTA files |
GTDB_TAXDUMP/taxid.map |
Path to the taxid.map file from the GTDB taxdump |
OLD_DBDIR |
Directory of the existing database to update |
Add sequences of new taxa to a GTDB database
Warning
Mixing taxonomies within the same domain is not recommended. For example, adding prokaryotes using NCBI taxonomy to a GTDB database will cause issues. However, adding eukaryotes or viruses using NCBI taxonomy is fine since GTDB does not cover them.
Option A — Using createnewtaxalist
If you have accession2taxid and taxonomy dump files for the new sequences, use createnewtaxalist to prepare the input automatically:
This generates:
OUTDIR/newtaxa.tsv— input for--new-taxaOUTDIR/newtaxa.accession2taxid
Then update the database:
metabuli updateDB --gtdb 1 <NEW_DBDIR> <FASTA_LIST> <OUTDIR/newtaxa.accession2taxid> <OLD_DBDIR> \
--new-taxa <OUTDIR/newtaxa.tsv>
Option B — Manually prepare a new taxa list
Provide a four-column TSV file for --new-taxa in the following format (no header):
Each new taxon must be linked to a taxon already present in the existing database's taxonomy.
Example — adding Saccharomyces cerevisiae to a GTDB database where Fungi is absent:
10000013 10000012 species Saccharomyces cerevisiae
10000012 10000011 genus Saccharomyces
10000011 10000010 family Saccharomycetaceae
10000010 10000009 order Saccharomycetales
10000009 10000008 class Saccharomycetes
10000008 10000007 phylum Ascomycota
10000007 10000000 kingdom Fungi
Corresponding accession2taxid:
NCBI / Custom Taxonomy-based Database
Add sequences of existing taxa
1. Prepare two files
- New FASTA file list — each sequence must have a unique
>accession.versionor>accessionheader. - NCBI-style
accession2taxid— sequences with accessions absent here are skipped; version numbers are ignored:
accession accession.version taxID gi
SequenceA SequenceA.1 960611 0
SequenceB SequenceB.1 960612 0
NoVersionOkay NoVersionOkay 960613 0
2. Update the database
| Argument | Description |
|---|---|
NEW_DBDIR |
Directory where the updated database will be generated |
FASTA_LIST |
File listing paths to new FASTA files |
accession2taxid |
NCBI-style accession2taxid file |
OLD_DBDIR |
Directory of the existing database to update |
Add sequences of new taxa
Use the same --new-taxa approach described in the GTDB section above.
Options
| Option | Default | Description |
|---|---|---|
--gtdb |
0 |
Set 1 for a GTDB-based update |
--threads |
all | Number of threads to use |
--max-ram |
128 (GiB) |
Maximum RAM usage in GiB |
--accession-level |
0 |
Set 1 to include accession-level classification |
--make-library |
0 (GTDB) / 1 (NCBI) |
Enable for faster execution when many species share a single FASTA file |
--new-taxa |
— | TSV file listing new taxa to add |
--cds-info |
— | File listing absolute paths to CDS files |
--validate-input |
0 |
Set 1 to validate FASTA file format |
--validate-db |
0 |
Set 1 to validate database files |