BASIC PROTOCOL 1: Species Prescreening
miriam.goldman, chunyu.zhao
Abstract
Reference-based metagenotyping depends crucially on the choice and customization of reference database. Therefore, a typical MIDAS2 workflow starts with a species prescreening step for each metagenome, which enables customization of the reference database to match the species in the sample. This protocol describes the species selection step: estimating species coverage per sample, merging the single-sample profiling results, and generating a list of species confidently detected in at least one sample. MIDAS2 estimates species coverage per sample by aligning reads to a database of sequences of 15 universal, single-copy genes (SCGs) and using the median (or mean) coverage of each species’ SCGs.
Steps
Install MIDAS2 (See Support Protocol 1)
Create a work folder containing the FASTQ files (here example input files are downloaded from Zenodo)
Initialize a local copy of a MIDAS Reference Database (MIDASDB). Here the SCG data from the UHGG MIDASDB is downloaded:
midas2 database --init –-midasdb_name uhgg \
--midasdb_dir midasdb_uhgg
Run the single-sample species analysis to identify confidently detectable (i.e., relatively abundant) species in each sample, looping through samples. The output file is created automatically under the directories midas2_output/SRR172902/species and midas2_output/SRR172903/species
for sample_name in SRR172902 SRR172903
do
midas2 run_species --sample_name ${sample_name} \
-1 reads/${sample_name}.fastq.gz \
--midasdb_name uhgg --midasdb_dir midasdb_uhgg \
--num_cores 4 midas2_output
Done
Prepare the sample manifest file for the purpose of merging metagenotyping results across samples in the SNV and CNV modules. Generate the desired sample manifest file for SRR172902 and SRR172903.
echo -e "sample_name\tmidas_outdir" > list_of_samples.tsv
ls reads | awk -F '.' '{print $1}' | awk -v OFS='\t' '{print $1, "midas2_output"}' >> list_of_samples.tsv
Merge species profiling results for the samples listed in the list_of_samples.tsv. The --min_cov flag defines the minimum median_marker_coverage for estimating species prevalence. The output files are created automatically under the directory midas2_output/merge/species.
midas2 merge_species --samples_list list_of_samples.tsv --min_cov 0.01 midas2_output/merge