BASIC PROTOCOL 2: Download MIDAS Reference Database
miriam.goldman, chunyu.zhao
Abstract
This protocol describes how to download all or part of a MIDASDB, a set of custom files constructed from microbial genome sequences and containing all the information needed to metagenotype the species detected in a set of shotgun-metagenomic samples. MIDAS2 provides two prebuilt MIDASDBs sourced from large, public microbial genome collections: MIDASDB-UHGG (4,644 species / 286,997 genomes) based on the Unified Human Gastrointestinal Genome catalog (v1) [9] and MIDASDB-GTDB (47,893 species / 258,405 genomes) based on the Genome Taxonomy Database (v202) [10]. Support Protocol 2 describes how to build a new MIDASDB locally from a custom genome collection. A MIDASDB should be downloaded or built before any other MIDAS2 protocols can be run.
There are three components in a MIDASDB: single-copy marker genes (SCGs), representative genomes (rep-genome), and pangenomes (pan-genome). Each species contributes sequences to all three components. By preloading the MIDASDB, individual calls to MIDAS2 commands do not need to automatically download the necessary files. As a result, with a preloaded MIDASDB, per-sample analyses can be run in parallel without a risk of processes interfering with one another.
Steps
Initialize a local copy of MIDASDB-UHGG
midas2 database --init --midasdb_name uhgg --midasdb_dir midasdb_uhgg
Customize the MIDASDB download. In Basic Protocol 1, 22 species were present in at least one sample list_of_species.tsv). Now those will be downloaded in the database components (both rep-genome and pan-genome) only for these 22 species.
midas2 database --download --midasdb_name uhgg --midasdb_dir midasdb_uhgg --species_list list_of_species.tsv
The download has completed successfully when the command midas2 database --download finishes and no error is reported.