SARS-CoV-2 consensus genome reconstruction, quality control, and lineage analysis
Benjamin Schwessinger
Abstract
This protocols is part of the ANU Biosecurity mini-research project #2 "An SARS-COV2 incursion scenario: Genomics, phylogenetics, and incursions." This mini-research project is modeled on the yearly Quality Assurance Program of The Royal College of Pathologists of Australia (RCPAQAP), we take part in together with ACT Pathology.
This research project is split into two major parts, identical to how the official RCPAQAP is run every year.
Part #1 is focusing on the 'wet- lab' by sequencing SARS-COV2 from real world RNA samples provided by ACT Pathology especially for our ANU biosecurity course (Thank YOU!). Here you will amplify and sequence five (5) RNA samples per research group. You will assess the SARS-COV2 genome sequences for their lineage assignments using online programs, put sequences into a global context, estimate the collection date based on genetic information, and describe mutations in the spike protein.
Part #2 is focusing on the 'dry-lab' by investigating a hypothetical incursion scenario in the so-called city Fantastica. You will combine genomic surveillance of SARS-COV2 with case interview data to trace the spread into of SARS-COV2 in the community and into high risk settings. We will provide you with real publicly available SARS-COV2 genome and fantasized case interviews. You will put these two together to trace the spread and suggest potential improvements in containment strategies with a focus on high risk settings.
This protocol describes the analysis component of Part #1. The metrics you are suppose to report for each of your samples are mostly borrowed from the official SARS-CoV-2 QAP. Don't worry if not all of these mean something to you at the moment as we will explain them again during the prac. In case all/most of your samples have < 50% genome coverage please also include the analysis of MakeUp for points 1 to 7 and TimeMakeUp for point 8. You can access the MakeUp data here (ANU only). Make sure to read the README file so you understand what each item relates to.
The metrics you have to report for each of our samples (or MakeUps) include the following.
- Consensus genome coverage.
- Average read depth; You might want to include detailed read depth plots here as well.
- Pangolin Lineage.
- NextClade Lineage.
- Base pair differences relative to the original SARS-CoV-2 genome.
- Amino acid replacements and deletions in the S (spike) protein sequence.
- Evaluation if your and/or the MakeUp samples would make the QC cut-off 90% genome coverage and other metrics that you deem important for QC. Would you 'flag' any of your samples as standing out e.g. being negative control?
- Approximate sampling date of your sequences for month and year.
You must report the versions of all tools used in your report and the day the analysis was performed. This is extremely important for reporting as lineage naming and such change VERY frequently during the pandemic and any outbreak.
This protocol is applicable for week 9.
The following links might be useful for your report:
The original publication that describes the sequencing protocol is here: https://academic.oup.com/biomethods/article/5/1/bpaa014/5873518?login=true
Original sources describing the consensus reconstruction from raw reads are here: https://artic.network/ncov-2019, https://artic.network/ncov-2019/ncov2019-bioinformatics-sop.html, https://labs.epi2me.io/, https://www.nature.com/articles/s41467-020-20075-6
Other websites and resources used in the protocol: https://igv.org/, https://clades.nextstrain.org/, https://pangolin.cog-uk.io/, https://genome.ucsc.edu/cgi-bin/hgPhyloPlace
Before start
You must study the protocol carefully before you start. If anything is unclear post questions directly here on protocols.io.
Steps
Section I: Consensus genome reconstruction using Epi2Me and the 'wf-artic'.
This section aims to reconstruct each samples consensus genome based on the publication here https://academic.oup.com/biomethods/article/5/1/bpaa014/5873518?login=true and an adaptation of the artic Sars-CoV-2 analytics workflow https://artic.network/ncov-2019
The principle steps are:
- Read filtering on quality and length.
- Aligning reads to the reference.
- Normalizing to 200x per base read depth on average.
- Trimming of primer sequences.
- Variant calling for both pools independently.
- Combining variant calls of both pools.
- Generation of sample's consensus genome reconstruction.
- Generation of QC statistics like genome coverage, average read coverage etc.
- Additional analysis like lineage calling, variant quantification, variant impact, etc.
We will run all of this within Epi2Me till step 8. The input for the Epi2Me analysis will be the "fastq_pass" folder of the MinKNOW basecalling output. This folder contains one "barcodeXX" folder for each barcode identified by the basecalling software.
Overall, the Epi2Me "wf-artic" workflow takes two inputs.
- The basecalled fastq output of MinKNOW (or equivalent e.g. guppy).
- A sample sheet in csv (comma separated values) format with the columns "barcode", "alias", "type".
You might also want to execute this section on the MakeUp dataset if your data was of poor quality.
Your basecalled and demutiplexed, by barcode, fastq data should be in the 'fastq_pass' folder be in a similar to this C:\data\20230426-SARS-COV2-MakeUpData\no_sample\20230426_1208_MN20227_AOI905_2cab113b\fastq_pass.

Before we start with the workflow we need to make a backup (e.g. copy of the 'fastq_pass' folder on the desktop for now) and then delete all the barcode folders we have not specified in the sample sheet. We also need to delete all folders of barcodes we have used but have not received any data.

Now we are ready to set-up and run the Epi2Me workflow for artic analysis.
We will use the "all_consensus.fasta" for most downstream analyses. It contains the consensus fasta sequences of your genomes.
Second II: Visualization whole genome alignment and principle steps 2-7 of section I
You will use IGV to look at one of your read sets mapped against the reference genome to get a better understanding of the "raw" data.
Select "Genomes" > "Load Genomes from File.." and add the "MN908947_3.fasta" reference shared here (ANU only).

<img src="https://static.yanyin.tech/literature_test/protocol_io_true/protocols.io.14egn2kp6g5d/mru3b5nw27.jpg" alt="The file can end in ".txt" or better ".fasta"" loading="lazy" title="The file can end in ".txt" or better ".fasta""/>
Load one of the "XXX.primertrimmed.rg.sorted.bam" files from the "output" folder of Epi2Me here (ANU only) or from your own dataset.
Now you see the alignments of all the reads as pileup against the reference, the read coverage and variants.

You can zoom into certain regions e.g. the S-gene 21563-25384 to see how the reads are different to the reference via the different colored bars in the grey area labelled with variants in the picture above. There is loads more to explore. Ask if you have any questions.
Section III: Nextclade for lineage calling, QC measures, amino acid substitutions identification and so much more
We will use Nexclade https://clades.nextstrain.org/ to get some quality assessment metrics, lineages calling, number of mutations, and much more.

Open the webpage https://clades.nextstrain.org/, select SARS-CoV-2 and drag and drop your consensus sequences into it.



Explore the results page to get to the following:
- Pango lineage assignment.
- Genome coverage.
- Number of mutations relative to reference.
- Number of Ns.
- Gaps.
- Changes to the amino acid sequence of the S protein.
Section IV: Pangolin lineage assignment
Let's assign the latest lineage designation with Pangolin.
Navigate to https://pangolin.cog-uk.io/.

Drag and drop your sequences, start analysis, and hope for the best.
Step V: Use Usher to place your samples into a global context and see its closest public neighbor.
Navigate to https://genome.ucsc.edu/cgi-bin/hgPhyloPlace.

Ask questions in during the prac.



























