BASIC PROTOCOL 4: Pan-genome Copy Number Variant Calling

Abstract

This protocol describes the CNV module of MIDAS2, which takes as input metagenomic sequencing reads from a set of samples and generates files with CNV genotypes for each sample for all detected species. There are two steps for population CNV calling: (1) single-sample quantification of copy number for each gene in the pangenome of each species with the midas2 run_genes command and (2) population CNV calling with the midas2 merge_genes command. Basic Protocols 1 (Species) and 2 (MIDASDB) should be run before this protocol.

Steps

Perform species prescreening as described in Basic Protocol 1.

BASIC PROTOCOL 1: Species Prescreening

Download MIDASDB as described in Basic Protocol 2.

BASIC PROTOCOL 2: Download MIDAS Reference Database

Execute the run_genes command for each sample

Note

Conceptually, a typical invocation of the run_genes command proceeds by four steps:Select the list of species abundant enough for accurate metagenotyping based on the species profiling results and user-defined species selection criterion. Taking SRR172902 as an example, run_genes expect to find the species profiling results at midas2_output/SRR172902/species/species_profile.tsv.Compile the pangenomes for these species and build a sample-customized pan-genome bowtie2 index.Align reads to this index with bowtie2.For each gene in the pan-genome, normalize gene coverage by the mean coverage of all that species’ SCGs to estimate copy number per cell [11].Output a read mapping summary and CNV estimates for each species.

#midas2 run_genes 
for sample_name in SRR172902 SRR172903
do
    midas2 run_genes \
    --sample_name ${sample_name} \
    -1 reads/${sample_name}.fastq.gz \ 
    --midasdb_name uhgg --midasdb_dir midasdb_uhgg \
    --species_list 100122,100277 \
    --select_by median_marker_coverage,unique_fraction_covered \
    --select_threshold=0,0.6 \
    --num_cores 8 midas2_output
done

Note

The number of CPUs used is specified via --num_cores 8.

Prepare sample manifest file for merging purpose. We can use the same list_of_samples.tsv generated by step 6 in Basic Protocol 1.

BASIC PROTOCOL 1: Species Prescreening

Upon the completion of run_genes for all the samples listed in the list_of_samples.tsv, MIDAS2 merges the CNV profiles across samples with the merge_genes command.

midas2 merge_genes --samples_list list_of_samples.tsv \
--midasdb_name uhgg --midasdb_dir midasdb_uhgg \
--min_copy 0.5 \
--num_cores 2 midas2_output/merge

Population pangenome CNV analysis has finished successfully when all the following output files are created under the directory midas2_output/merge/genes/ without any error message

Note

genes_summary.tsv: merged single-sample CNV summary containing information such as mean_coverage.<species_id>/<species_id>.genes_copynum.tsv.lz4: gene-by-sample matrix of copy-number estimates<species_id>/<species_id>.genes_preabs q.tsv.lz4: gene-by-sample matrix of gene presence / absence<species_id>/<species_id>.genes_depth.tsv.lz4: gene-by-sample read coverage matrix<species_id>/<species_id>.genes_reads.tsv.lz4: gene-by-sample read counts matrix

Abstract

Steps

推荐阅读