Skip to content

classify

It classifies metagenomic reads by comparing them to a reference database.

Note

  • Set --max-ram according to your system's available RAM.
  • Run adapter trimming and quality filtering before classification.
    • fastp for short reads and fastplong for long reads are integrated in Metabuli App

Usage

metabuli classify <i:FASTA/Q> <i:DBDIR> <o:OUTDIR> <Job ID> [options]
Argument Description
FASTA/Q Input (gzipped) FASTA/Q file(s). Provide two files for paired-end samples.
DBDIR Reference database directory.
OUTDIR Directory to write output files.
Job ID Prefix for output file names.

Examples

metabuli classify read_1.fna read_2.fna DBDIR OUTDIR JOB_ID
metabuli classify --seq-mode 1 read.fna DBDIR OUTDIR JOB_ID
metabuli classify --seq-mode 3 read.fna DBDIR OUTDIR JOB_ID

Important Options

Option Default Description
--precise 0 Use presets for precise mode. 1: short-read, 2: HiFi long-read.
-e 1.0 Ignore matches with larger E-value. Set 0 to disable it.
--max-ram 128 Maximum RAM usage in GiB
--threads all Number of threads to use
--min-score 0 Minimum score to classify a read
--min-sp-score 0 Minimum score to classify at or below species rank

Other Options

Option Default Description
--validate-input 0 Set 1 to validate query file format
--validate-db 0 Set 1 to validate database files
--lineage 0 Set 1 to print full lineage
--priority-taxid* - Favors these and child taxa instead of LCA in case of a tie. (Comma-separated list of tax IDs.)
--syncmer* 0 Set 1 to use syncmers instead of all k-mers
--smer-len 5 s-mer length used for syncmer selection. Compression factor = (k-s+1)/2

Tip

  • Specifying --priority-taxid for virus clades can help in detecting viruses. Virus sequences often matched both the virus genome and integrated host genomes, leading to a tie and a classification to the LCA of the virus and the host. This can be mitigated by prioritizing the virus taxID. Please refer to this issue.
  • --syncmer and --smer-len can be used for faster classication even when the database is built without syncmers.

Output Files

classify produces three prefixed output files in OUTDIR:

File Description
JOB_ID_classifications.tsv read-by-read classification results
JOB_ID_report.tsv Summary report in Kraken2 format
JOB_ID_krona.html Interactive Krona taxonomy chart

Tip

You can open JOB_ID_report.tsv in the Metabuli App to explore a Sankey plot.

Output File Formats

1. JOB_ID_classifications.tsv

Column Name Description
1 is_classified 1 if classified, 0 if not
2 name Read ID
3 taxID Taxonomy ID in the taxonomy dump files used for database creation
4 query_length Effective read length
5 score DNA-level identity score
6 e_value E-value of observed amino acid matches (-1: not supported)
7 rank Taxonomic rank of the assigned taxon
8 taxID:match_count List of taxID:k-mer_match_count pairs

Example

#is_classified  name    taxID   query_length    score      e_value      rank           taxID:match_count
1               read_1  2688    294             0.627551   4.45084e-36  subspecies     2688:65
1               read_2  2688    294             0.816327   0            subspecies     2688:78
0               read_3  0       294             0          -            no rank

2. JOB_ID_report.tsv

Column Name Description
1 clade_proportion Percentage of reads classified to the clade rooted at this taxon
2 clade_count Number of reads classified to the clade rooted at this taxon
3 taxon_count Number of reads classified directly to this taxon
4 rank Taxonomic rank
5 taxID Taxonomy ID
6 name Taxonomic name

Example

#clade_proportion  clade_count  taxon_count  rank          taxID   name
33.73              77571        77571         no rank       0       unclassified
66.27              152429       132           no rank       1       root
64.05              147319       2021          superkingdom  8034    d__Bacteria
22.22              51102        3             phylum        22784   p__Firmicutes
22.07              50752        361           class         22785   c__Bacilli
17.12              39382        57            order         123658  o__Bacillales
15.81              36359        3             family        126766  f__Bacillaceae
15.79              36312        26613         genus         126767  g__Bacillus
2.47               5677         4115          species       170517  s__Bacillus amyloliquefaciens
0.38               883          883           subspecies    170531  RS_GCF_001705195.1
0.16               360          360           subspecies    170523  RS_GCF_003868675.1

3. JOB_ID_krona.html

Alt text