Evolution of Viral Genomes: Interplay Between Selection, Recombination, and Other Forces

Stephanie J. Spielman, Steven Weaver, Stephen D. Shank, Brittany Rife Magalis, Michael Li, Sergei L. Kosakovsky Pond

Published: 2021-09-03 DOI: 10.17504/protocols.io.brcwm2xe

Abstract

Natural selection is a fundamental force shaping organismal evolution, as it both maintains function and enables adaptation and innovation. Viruses, with their typically short and largely coding genomes, experience strong and diverse selective forces, sometimes acting on timescales that can be directly measured. These selection pressures emerge from an antagonistic interplay between rapidly changing fitness requirements (immune and antiviral responses from hosts, transmission between hosts, or colonization of new host species) and functional imperatives (the ability to infect hosts or host cells and replicate within hosts). Indeed, computational methods to quantify these evolutionary forces using molecular sequence data were initially, dating back to the 1980s, applied to the study of viral pathogens. This preference largely emerged because the strong selective forces are easiest to detect in viruses, and, of course, viruses have clear biomedical relevance. Recent commoditization of affordable high-throughput sequencing has made it possible to generate truly massive genomic data sets, on which powerful and accurate methods can yield a very detailed depiction of when, where, and (sometimes) how viral pathogens respond to various selective forces.

Here, we present recent statistical developments and state-of-the-art methods to identify and characterize these selection pressures from protein-coding sequence alignments and phylogenies. Methods described here can reveal critical information about various evolutionary regimes, including whole-gene selection, lineage-specific selection, and site-specific selection acting upon viral genomes, while accounting for confounding biological processes, such as recombination and variation in mutation rates.

Steps

3.1 How to Run a Selection Analysis

Note

There is a uniform workflow to run any of the described methods, either locally (on one’s own computer and/or a high-performance computing environment) in HyPhy or using the Datamonkey web-service, available at There is a uniform workflow to run any of the described methods, either locally (on one’s own computer and/or a high-performance computing environment) in HyPhy or using the Datamonkey web-service, available at www.datamonkey.org. The version of HyPhy that supports all of the analyses is a command-line program, i.e., it must be run from a terminal prompt (similar to most other bioinformatics packages) in Linux or Mac OS X. It is also possible to run the program in Windows, with an appropriate POSIX emulation environment (e.g., MinGW) installed.. The version of HyPhy that supports all of the analyses is a command-line program, i.e., it must be run from a terminal prompt (similar to most other bioinformatics packages) in Linux or Mac OS X. It is also possible to run the program in Windows, with an appropriate POSIX emulation environment (e.g., MinGW) installed.

To execute a selection analysis locally, the following steps will need to be taken:

1.1.

Prepare your coding sequence alignment. In general, any duplicate sequences should be removed before analysis. Most importantly, it is imperative that the sequence alignment be in the correct reading frame, meaning that alignment must be performed with codon structure in mind. A common approach to ensure this criterion is met is to generate the alignment using translated amino-acid data and then back-translate to the original nucleotide sequences.

1.2.

Prepare a phylogenetic tree from the multiple sequence alignment. Note that certain analyses may require a labeled phylogenetic tree, as indicated within each subsequent tutorial. Keep in mind that for most selection analyses, a tree topology is a nuisance parameter. Hence, while it is advisable to use good practices when inferring trees, minor errors in tree inference tend to have minor effects on gene- and site-level inference. A notable exception occurs when lineage-specific selection is investigated; in this case, ensuring high-quality tree topologies is important.

1.3.

An essential and strongly recommended step before analyzing data for selection is to screen sequences for recombination. If recombinant sequences are naively analyzed without an appropriate phylogenetic correction, inference results are likely to be biased (Posada et al. [33]) (see the section on Screening sequences for recombination later in this chapter).

1.4.

Prepare your data (alignment and phylogeny) for input to HyPhy. There are three ways to provide a dataset for HyPhy analysis, each of which will trigger a different analysis prompt at runtime:

Two separate files containing the alignment and phylogeny, respectively. In this circumstance, HyPhy issues two successive prompts: the first for the file containing the alignment, and the second for the file containing the tree.
A single file containing an alignment in one of the formats supported by HyPhy (FASTA, MEGA, and PHYLIP), with a Newick-formatted phylogeny included at the bottom of this file. In this circumstance, HyPhy issues two successive prompts: the first for the file containing the alignment, and the second asking whether to accept the tree found in the file (provide the affirmative response, e.g., “y,” to accept it).
A NEXUS file containing both the alignment and phylogeny. In this circumstance, HyPhy automatically accepts the provided phylogeny and therefore only issues a single prompt for the file containing the alignment. This is also the format that can be used to specify partitioned data, which is necessary to account for recombination.

1.5.

Execute the appropriate method in HyPhy, selecting options suitable for the specific analysis.

Note

Each method will provide live on-the-screen progress updates and, when finished, a text summary of the analysis. The output is generated in Markdown (see footnote 2), which can either be read directly as text or formatted using one of many Markdown viewers.When an analysis is finished, HyPhy will write a JSON file with numerous details about the analysis to disk. By convention, this file will be placed in the same directory as the input alignment file, with the added.json extension, e.g., flu_ha.nex.BUSTED.json for an input alignment named flu_ha.nex analyzed by the method BUSTED. All results contained in this JSON file can be explored visually within a web browser using a web application from the hyphy-vision suite of tools, accessible at When an analysis is finished, HyPhy will write a JSON file with numerous details about the analysis to disk. By convention, this file will be placed in the same directory as the input alignment file, with the added .json extension, e.g., flu_ha.nex.BUSTED.json for an input alignment named flu_ha.nex analyzed by the method BUSTED. All results contained in this JSON file can be explored visually within a web browser using a web application from the hyphy-vision suite of tools, accessible at vision.hyphy.org. Since JSON files can be easily accessed by scripting and data-analysis languages, these are also well-suited for incorporation into pipelines (see footnote 3).. Since JSON files can be easily accessed by scripting and data-analysis languages, these are also well-suited for incorporation into pipelines (see footnote 3).When run through When run through www.datamonkey.org, this entire workflow is automated: one simply uploads an alignment, selects options for the analysis, and waits for the job to finish. Once the job has completed, the results will be displayed in an interactive application within the web browser. Note that Datamonkey will automatically remove duplicate sequences before executing any analysis., this entire workflow is automated: one simply uploads an alignment, selects options for the analysis, and waits for the job to finish. Once the job has completed, the results will be displayed in an interactive application within the web browser. Note that Datamonkey will automatically remove duplicate sequences before executing any analysis.

3.2 BUSTED

Note

What Biological Question Is the Method Designed to Answer? Is there evidence that some sites in the alignment have been subject to positive diversifying selection, either pervasive (throughout the evolutionary tree) or episodic (only on some lineages)? In other words, BUSTED asks whether a given gene has been subject to positive, diversifying selection at any site, at any time [26]. If a priori information about lineages of interest is available (e.g., due to migration, change in the environment, etc.), then BUSTED can be restricted to test for selection only on a subset of tree lineages, potentially boosting power.Recommended ApplicationsAnnotating a collection of alignments with a binary attribute: Has this alignment been subject to positive diversifying selection (yes/no)? [34].Testing small- or low-divergence alignments (i.e., ≤∼ 10 sequences) for evidence of positive diversifying selection, where neither branch- nor site-level methods have sufficient power to detect weak, but present, signal.

Note

Statistical Test Procedure Each (branch, site) pair evolves with ω ₁ ≤ ω ₂ ≤ 1, or ω ₃ ≥ 1, with the ratio chosen independently of other (branch, site) pairs with probability p ₁, p ₂, p ₃(normalized to sum to 1). The three-rate ω distribution is estimated jointly from the entire alignment, i.e., rates are shared by all (branch,site) combinations. Therefore, BUSTED is technically a “branch-site” model [16], although it is not intended to detect individual sites which drive signal of selection.The test for episodic diversifying selection is performed by comparing the full model versus the nested null model, where ω ₃is constrained to 1. Statistical significance is obtained by the likelihood ratio test, assuming the

asymptotic distribution of the likelihood ratio statistic under the null model.When only some of the branches are chosen for testing, and the remainder are designated as the background, two independent three-rate ω distributions are fitted: one for the test branches, and one for the background branches. Testing for selection is carried out by constraining the distribution on the test branches as described above.

Note

Example Analysis To begin, we will perform a BUSTED analysis using a dataset of primate-specific KSR2, kinase suppressor of RAS2, genes from Enard et al. [5]. This gene has been implicated as a so-called ‘virus-interacting protein,’ and previous work has suggested it has experienced adaptation in mammalian lineages due to selective pressures exerted by viruses [5]. We will test all lineages for positive selection (rather than specifying a subset of “test” branches), thereby asking the question: “Has KSR2 been subject to diversifying selection at some time during evolution in primates?”

To run BUSTED, open a terminal session and enter HYPHYMP from the command line to launch the HyPhy analysis menu. Enter 1 (Selection Analyses) and then 5 to reach the BUSTED analysis menu, and supply values for the following prompts:

Choose genetic code. This option tells HyPhy which translation table to use for codon-level analyses. Enter 1 to use the Universal genetic code.
Select a coding sequence alignment file. Provide the full path to the dataset of interest: /path/to/data/ksr2.fna .
A tree was found in the data file…Would you like to use it (y/n)? Enter “y” to use the tree.
Choose the set of branches to test for selection. Enter 1 to test all branches for selection.

BUSTED will now run to completion, printing status indicators to screen while it runs. For an example of how this output will look when rendered into HTML (or similarly, PDF), see this link: http://bit.ly/2vsRZrh.

Listing 1 Partial BUSTED screen output

###   Branches   to   test   for   selection in the BUSTED analysis
*   Selected   15   branches   to   test  in the BUSTED analysis: ‘HUM, PAN, Node6, GOR, Node5, PON, Node4, GIB, Node3, MAC, BAB, Node12, Node2, MAR, BUS‘

###   Obtaining   branch   lengths   and nucleotide substitution biases under the nucleotide GTR model
*   Log ( L )   =   -5768.01,   AIC - c   =   11582.06 (23 estimated parameters)

###   Obtaining   the   global   omega  estimate based on relative GTR branch lengths and nucleotide substitution biases
*   Log ( L )   =   -5342.48,   AIC - c   =   10745.17 (30 estimated parameters)
*   non - synonymous / synonymous   rate ratio for *test* =   0.0342

###   Improving   branch   lengths ,   nucleotide substitution biases, and global dN/dS ratios under a full codon model
*   Log ( L )   =   -5333.46,   AIC - c   =   10727.13 (30 estimated parameters)
*   non - synonymous / synonymous   rate ratio for *test* =   0.0307

###   Performing   the   full   ( dN / dS  > 1 allowed) branch-site model fit
*   Log ( L )   =   -5319.67,   AIC - c   =   10707.62 (34 estimated parameters)
*   For   * test *   branches ,   the   following rate distribution for branch-site combinations was inferred

|         Selection   mode           |     dN/dS     |Proportion, %|       Notes |
|----------------------------------|--------------|-------------|------------------|
|       Negative selection     |     0.024    |   99.151    |                 |
|       Negative selection     |     0.085    |   0.812    |                 |
|     Diversifying selection     |     118.143    |   0.037    |                 |

###   Performing   the   constrained  (dN/dS > 1 not allowed) model fit
*   Log ( L )   =   -5326.18,   AIC - c   =   10718.63 (33 estimated parameters)
*   For   * test *   branches   under   the null (no dN/dS > 1 model), the following rate distribution for branch-site combinations was inferred

|         Selection mode          |     dN/dS   |Proportion, %|          Notes       |
|-----------------------------|--------------|-------------|-------------------------|
|           Negative selection           |     0.000     |   10.598    |             |
|           Negative selection          |     0.000     |   86.086    | Collapsed rate class  |
|            Neutral evolution           |     1.000     |    3.316    |             |

----
##   Branch - site   unrestricted   statistical test of episodic diversification [BUSTED]
Likelihood   ratio   test   for   episodic diversifying positive selection, **p =   0.0015**.

Note

Interpreting Results The results printed to the terminal indicate a highly significant result (P = 0.0015) in the test for whole-gene selection. Analysis with BUSTED therefore provides robust evidence that KSR2 experienced episodic positive selection in the primates. Because we performed the original BUSTED analysis on the entire tree (i.e., without a specified set of test branches), we do not know from this result along which lineages KSR2 was subject to positive selection. We can conclude only that a non-zero proportion of sites on some lineage(s) in the primate tree experienced diversifying selection pressure.The output additionally provided information about the specific BUSTED model fits to the test data, including the inferredω distributions and corresponding weights. The BUSTED alternative model (shown under the output header Performing the full ( dN/dS >1 allowed) branch-site model fit ) found that a very small proportion (only ∼0.037%) of sites evolved under a very large ω of over 100 (118.143 ). Importantly, neither of these estimates is precise because they were derived from a small subset of the data. As such, all the BUSTED tests establish the fact that the proportion of sites along test lineages (here, the entire phylogeny) with ω > 1 is non-zero. For example, if BUSTED had inferred a rate category of ω = 10 on a different gene, it would not be correct to claim that this gene evolves under weaker selection than does KSR2. A formal statistical test would have to be carried out to establish such a claim.Conversely, had the result not been statistically significant, we would not be able to reject the null hypothesis that no positive selection had occurred in KSR2. Importantly, however, a negative finding would not unequivocally rule out the presence of positive selection. This outcome could be due to a lack of statistical power wherein the provided data did not contain a sufficiently strong selection.BUSTED’s fixed a priori assumption of model complexity (a three-rate ω distribution) may lead to over-parameterized (or under-parameterized) models. For example, in the constrained model for KSR2, two of the three rate classes have the same value of ω (0.0), implying that one of them is unnecessary. HyPhy will report this to the screen as a diagnostic message ‘‘Collapsed rate class,’’ but there is no corrective action that needs to be taken. These messages simply point to low-complexity data.We will additionally take this opportunity to showcase the visual power of our accompanying web browser, HyPhy-Vision. Figure 1 displays the rendering of the output ksr2.fna.BUSTED.json as it appears in HyPhy-Vision. On this site, users can interactively view and explore inference results, view figures and charts, and perform other tasks.

Fig. 1Example analysis visualization in HyPhy-Vision of BUSTED results. (a) The summary section provides a brief overview of the analysis performed, including information about the inputted data (which can be downloaded via the linked file name) and primary results from the hypothesis test performed. (b) The model statistics section provides information about models fitted to the data. In BUSTED, this section additionally includes an interactive display of site evidence ratios, which can be interpreted as a descriptive measure for which sites may have contributed to the selection signal. (c) The tree section displays the phylogeny as fitted under all inferred models and data partitions, if specified. Tree views can be toggled under the Options drop-down menu. (d) Graphical views of each model’s inferred ω distribution can be viewed when clicking on a given row’s plot icon in the Model fits table seen in (b)

Rules of Thumb for BUSTED UseBest applied to small- or medium-sized datasets (e.g., up to 100 sequences). Larger datasets will take longer to run and may not be well described by a fixed complexity model.If one suspects that only a small subset of lineages is subject to selection, e.g., because the phenotype, environment, or fitness changed along those branches, designating those a priori as the test set will significantly boost power.In simulation studies, BUSTED performs best when a sufficient proportion (5–10%) of branch site combinations is subject to positive diversifying selection, and the effect size (ω value) is reasonably large (e.g., ≥ 3).

3.3 RELAX

Note

What Biological Question Is the Method Designed to Answer? Is there evidence that the strength of selection has been relaxed (or conversely intensified) on a specified group of lineages (Test ) relative to a set of reference lineages (Reference )? We note that the RELAX framework can perform this specific hypothesis test as well as fit a suite of descriptive models which address, for example, overall rate differences between test and reference branches or lineage-specific inferences of selection relaxation. We focus our attention here on RELAX’s hypothesis testing abilities. More information about descriptive analyses is available on hyphy.org Is there evidence that the strength of selection has been relaxed (or conversely intensified) on a specified group of lineages (Test) relative to a set of reference lineages (Reference)? We note that the RELAX framework can perform this specific hypothesis test as well as fit a suite of descriptive models which address, for example, overall rate differences between test and reference branches or lineage-specific inferences of selection relaxation. We focus our attention here on RELAX’s hypothesis testing abilities. More information about descriptive analyses is available on hyphy.org as well as in RELAX’s primary publication [43]. Importantly, RELAX is not designed to detect diversifying selection specifically. as well as in RELAX’s primary publication [43]. Importantly, RELAX is not designed to detect diversifying selection specifically.Recommended ApplicationsTesting for a systematic shift (relaxation/intensification) in the distribution of selection pressure associated with major biological transitions such as hosting switching in viruses [6] or lifestyle evolution in bacteria (i.e., transition from free-living to endosymbiotic lifestyle [43]).Comparing selective regimes between two subsets of branches in the tree, e.g., to investigate selective differences among transmission routes in HIV-1 [42].

Note

Statistical Test Procedure Given a tree with at least two sets of branches, one of which is designated as Test , and the other as Reference , the core version of RELAX compares two nested models, which follow the same general framework as BUSTED. Each (branch, site) combination is drawn independently from a 3-rate ω distribution. The evolutionary rates for Test branches are functions of those for Reference branches. Specifically,

, where K is the relaxation or intensification parameter. The alternative model infers K from the data, and the null model sets K = 1. Statistical significance is obtained by the likelihood ratio test, assuming the

asymptotic distribution under the null model. A significant result of K > 1 indicates that selection strength has been intensified along the test branches, and a significant result of K < 1 indicates that selection strength has been relaxed along the test branches. In other words, for K < 1 the Test ω values shrink toward neutrality (ω = 1) relative to Reference , and for K > 1 they move away from neutrality.If some branches in the tree belong to neither the T est or the R eference set, they are allocated to a group with its own (Unclassified ) distribution of ω , which is uncoupled from the testing procedure.

10.

Note

Example Analysis We will perform a RELAX analysis using a dataset of Influenza A PB2 subunit sequences from Tamuri et al. [41]. The PB2 subunit, which is part of influenza’s RNA polymerase complex, has emerged as a critical determinant of influenza infectivity and, as a consequence, host range [9, 18]. The dataset we examine here contains sequences from both avian host and human host strains (see footnote 4). Previous studies have shown that this host switch is correlated with significant shifts in selection pressures and preferred amino acids at key sites in PB2 [36, 40, 41]. We now re-analyze this dataset using RELAX to ask a different but related question: “Was the shift from avian to human hosts associated with a relaxation of selection pressures in Influenza A PB2?”RELAX requires an a priori specification of test and reference lineages, although not all lineages in a tree need to be classified. As such, you must label your test (and reference, if desired) branches in the input phylogeny. We provide an online widget to assist with tree labeling at RELAX requires an a priori specification of test and reference lineages, although not all lineages in a tree need to be classified. As such, you must label your test (and reference, if desired) branches in the input phylogeny. We provide an online widget to assist with tree labeling at http://phylotree.hyphy.org. The dataset we have provided for this analysis already has a labeled phylogeny, with the human host lineages labeled as “test.”. The dataset we have provided for this analysis already has a labeled phylogeny, with the human host lineages labeled as “test.”

To run RELAX, open a terminal session and enter HYPHYMP from the command line to launch the HyPhy analysis menu. Enter 1 (Selection Analyses) and then 7 to reach the RELAX analysis menu, and supply values for the following prompts:

Choose genetic code. Enter 1 to use the Universal genetic code.
Select a coding sequence alignment file. Provide the full path to the dataset of interest: /path/to/data/pb2.fna .
A tree was found in the data file…Would you like to use it (y/n)? Enter “y” to use the tree.
Choose the set of branches to test for selection. This option asks you to specify the label inside your tree used to specify the test lineages. You can either select all unlabeled branches, or HyPhy will show all labels it found in the tree you provided. Enter1to select the branches labeled as “test” as the test set in RELAX analysis. Note that when multiple labels are present in your tree, HyPhy will issue an additional prompt to choose the set of Reference branches, in the event that some branches should remain Unclassified .
Analysis type. This option asks you to specify the scope of RELAX analysis. Selecting “Minimal” will run the RELAX hypothesis test, and selecting “All” will run hypothesis testing and fit two additional descriptive models, described earlier. Here, we will perform only hypothesis testing to determine whether the data shows evidence for a relaxation or intensification of selection intensity between the test and reference lineages. Enter the option 2 to run the “Minimal” analysis.

RELAX will now run to completion, printing status indicators to screen while it runs.

11.

Listing 2 Partial RELAX screen output

###   Obtaining   branch   lengths   and nucleotide substitution biases under the nucleotide GTR model
*   Log ( L )   =   -16755.26,   AIC - c   =   33660.66 (75 estimated parameters)

###   Obtaining   the   global   omega  estimate based on relative GTR branch lengths and nucleotide substitution biases
*   Log ( L )   =   -14410.97,   AIC - c   =   28988.46 (83 estimated parameters)
*   non - synonymous / synonymous   rate ratio for *Reference* =   0.0401
*   non - synonymous / synonymous   rate ratio for *Test* =   0.0604

###   Improving   branch   lengths ,   nucleotide substitution biases, and global dN/dS ratios under a full codon model
*   Log ( L )   =   -14354.67,   AIC - c   =   28875.86 (83 estimated parameters)
*   non - synonymous / synonymous   rate ratio for *Reference* =   0.0358
*   non - synonymous / synonymous   rate ratio for *Test* =   0.0609

###   Fitting   the   alternative   model to test K != 1
*   Log ( L )   =   -14337.22,   AIC - c   =   28849.02 (87 estimated parameters)
*   Relaxation / intensification   parameter (K) =     0.73
*   The   following   rate   distribution was inferred for **test** branches

|             Selection mode             |     dN/dS     |Proportion, %|Notes        |
|---------------------------------|--------------|-------------|--------------------|
|           Negative selection           |     0.031     |   94.752    |               |
|           Negative selection           |     0.086     |    2.951    |                |
|     Diversifying selection         |     1.406     |    2.297    |               |

*   The   following   rate   distribution was inferred for **reference** branches

|             Selection mode             |     dN/dS     |Proportion, %|  Notes       |
|---------------------------------|--------------|-------------|--------------------|
|           Negative selection           |     0.009     |   94.752    |               |
|           Negative selection           |     0.035     |    2.951    |        |
|         Diversifying selection         |     1.591     |    2.297    |                |

###   Fitting   the   null   ( K   :=   1)   model
*   Log ( L )   =   -14342.33,   AIC - c   =   28857.22 (86 estimated parameters)
*   The   following   rate   distribution for test/reference branches was inferred

|             Selection mode             |     dN/dS     |Proportion, %|Notes        |
|---------------------------------|--------------|-------------|--------------------|
|           Negative selection           |     0.010     |   94.149    |              |
|           Negative selection           |     0.021     |    3.391    |              |
|         Diversifying selection         |     1.735     |    2.460    |            |

----
##   Test   for   relaxation   ( or   intensification) of selection [RELAX]
Likelihood   ratio   test   ** p   =      0.0014**.
> Evidence   for   * relaxation   of   selection* among **test** branches _relative_ to the **reference** branches at P<=0.05
----

12.

Note

Interpreting Results On this data, RELAX has inferred a relaxation parameter K = 0.73 with a highly significant P = 0.0014. Therefore, there is evidence to reject the null hypothesis that selection pressure has not been shifted in the test (here, human host) lineages. We instead have strong evidence that selection has been relaxed (because the inferred K < 1) in the human host lineages. In other words, selection in the test branches has generally moved towards neutrality (ω = 1) compared to the reference branches. This finding is consistent with the evolutionary changes that typically occur during a virus host-switching event, wherein selection stringency will be reduced to facilitate viral adaptation.Keep in mind that RELAX defines relaxation (or intensification) in a fairly restrictive fashion. In other words, all selective regimes (i.e., all ω rates), both negative and positive, must weaken or strengthen. Therefore, certain relaxation scenarios, for example, when only positive selection is relaxed but negative selection is maintained, may result in a non-significant RELAX test even though selection has changed.Rules of Thumb for RELAX UseAlways provide a labeled phylogeny indicating which branches to include in the “test” lineages. You can additionally label “reference” lineages if you wish to keep some branches as unclassified. It is convenient to use the phylotree.js online widget at http://phylotree.hyphy.org/ Always provide a labeled phylogeny indicating which branches to include in the “test” lineages. You can additionally label “reference” lineages if you wish to keep some branches as unclassified. It is convenient to use the phylotree.js online widget at http://phylotree.hyphy.org/ to label branches before analysis. to label branches before analysis.

3.4 aBSREL

13.

Note

It is often of interest to determine whether a specific lineage or lineage(s) have been subject to selection. Such analyses have historically been performed using the so-called branch or branch-site class of models, which allow evolutionary rates to vary across branches or across sites and branches [16, 45, 46]. Early versions of branch-site models allowed users to compare selection pressure on a pre-selected branch sets of “foreground” branches to a pre-selected set of “background” branches, on which positive selection was disallowed [45,46]. (Note that this approach is similar to how BUSTED performs gene-wide selection inference [26].) Later efforts demonstrated that disallowing positive selection on background branches could lead to highly elevated false positive rates and advocated a strategy wherein any branch, regardless of data partition, could evolve at any rate [16]. This strategy has been described as the BS-REL model in HyPhy [16]. However, in BS-REL, each branch was constrained to have three rate categories, an assumption with little justification.Since then, we have developed a greatly improved branch-site model called aBSREL (“adaptive branch-site random effects likelihood”). Rather than assuming that each branch should be fit with three rate classes, aBSREL infers, using small-sample Akaike Information Criterion correction (AICc), the optimal number of rate categories per branch. In this manner, computational complexity and the number of parameters are greatly reduced, leading to a tractable runtime for larger datasets that could not otherwise be studied with earlier branch-site models.

14.

Note

What Biological Question Is the Method Designed to Answer? Like classical branch-site models, aBSREL asks whether some proportion of sites is subject to positive selection along specific branches or lineages of a phylogeny.Recommended ApplicationsExploratory testing for evidence of lineage-specific positive diversifying selection in small- to medium-sized alignments (up to 100 sequences).Targeted testing of branches selected a priori for positive diversifying selection. This includes alignments with prohibitive runtimes under older branch-site models (up to ∼1000 sequences) [37].

15.

Note

Statistical Test Procedure aBSREL uses the information-theoretic criterion AICc to automatically determine the complexity of the evolutionary process at every branch [37]. As a heuristic optimization, aBSREL will always examine branches in order from longest to shortest, because longer branches tend to be the ones requiring more complex models. In this adaptive model, one rate class is allowed to assume any value of ω > 1, whereas for any other inferred rate class is constrained as ω ≤ 1. In the null model, all ω categories are constrained as ω ≤ 1. For any branch inferred to have sufficient rate variation (i.e., more than one rate category) where one rate category is described by ω > 1, aBSREL will proceed to fit a null model to this branch. In other words, if the maximum-inferred ω ≤ 1 on a branch, the null model will have the same exact fit as the alternative model, and the resulting P -value is 1. The test for lineage-specific diversifying selection is performed by comparing the full model versus the nested null model, and statistical significance is obtained by the likelihood ratio test. Significance is evaluated using a mixture of

, and

distributions (proportions determined via simulations Smith et al. [37]). Finally, aBSREL will correct all P -values obtained from individual tests for multiple comparisons using the Bonferroni–Holm procedure to control family-wise false-positive rates (i.e., the probability of generating one or more false positives, when all null hypotheses are correct).One can either select a specific set of branches in order to test a specific a priori hypothesis or one can perform an exploratory analysis across the entire phylogeny by testing all branches for selection. The former approach may have substantially more power to detect selection, especially if only a few branches in a large tree are chosen, due to the decreased volume of multiple testing. However, the approach does carry the risk of failing to identify branches subject to positive selection that have not been included in the test set.

16.

Note

Example Analysis Here, we will demonstrate aBSREL use and interpretation using a dataset of HIV-1 env sequences collected from an epidemiologically linked donor–recipient transmission pair [7]. This dataset can be found in the provided filehiv1_transmission.fna .

To run aBSREL, open a terminal session and enter HYPHYMP from the command line to launch the HyPhy analysis menu. Enter 1 (Selection Analyses) and then 6 to reach the aBSREL analysis menu, and supply values for the following prompts:

Choose genetic code. This option tells HyPhy which translation table to use for codon-level analyses. Enter 1 to use the Universal genetic code.
Select a coding sequence alignment file. Provide the full path to the dataset of interest: /path/to/hiv1_transmission.fna .
A tree was found in the data file…Would you like to use it (y/n)? Enter “y” to use the included tree.
Choose the set of branches to test for selection. You can now select on which branches aBSREL should conduct a formal hypothesis test for positive selection. Enter 1 to test all branches for selection.

aBSREL will now run to completion, printing status indicators to screen while it runs (some output abbreviated).

17.

Listing 3 Partial aBSREL screen output

###   Obtaining   branch   lengths   and nucleotide substitution biases under the nucleotide GTR model
*   Log ( L )   =   -5524.50,   AIC - c   =   11153.08 (52 estimated parameters)

###   Fitting   the   baseline   model  with a single dN/dS class per branch, and no site-to-site variation.
*   Log ( L )   =   -5402.40,   AIC - c   =   11009.72 (102 estimated parameters)
*   Branch - level   non - synonymous / synonymous rate ratio distribution has median  0.66, and 95% of the weight in  0.00--5.41

###   Determining   the   optimal   number of rate classes per branch using a step up procedure

|      Branch | Length   |     Rates   |     Max. dN/dS     |    Log(L)     |     AIC-c     |Best AIC-c so far|
|-----------|-------|-----|------------------|------------|---------|-----------------|
|      0564 _22|      0.01      |       2       |    1.96 (52.27%)   |   -5402.41    |   11013.78    |    11009.72     |
|      0564 _7       |      0.01      |       2       |    0.74 ( 5.19%)   |   -5402.40    |   11013.76    |    11009.72     |
|     Separator     |      0.01      |       2       |  197.32 ( 3.95%)   |   -5397.53    |   11004.02    |    11004.02     |
|     Separator     |      0.01      |       3       |  180.22 ( 4.08%)   |   -5397.53    |   11008.06    |    11004.02     |
|      0564 _4       |      0.01      |       2       |   29.79 ( 2.15%)   |   -5394.37    |   11001.74    |    11001.74    |
|      0564 _4       |      0.01      |       3   |   29.78 ( 2.15%)   |   -5394.37    | 11005.78    |    11001.74   |
|      0564 _3       |      0.01      |     2       |  126.86 ( 3.14%)   |   -5388.59    |   10994.22    |    10994.22     |
|      0564 _3       |      0.01      |       3       |  135.96 ( 3.05%)   |   -5388.59    |   10998.25    |    10994.22     |
|      0564 _9 |      0.01      |       2       |   10.01 ( 8.61%)   |   -5388.37    |   10997.82  |    10994.22     |
 ...
|      Node53       |      0.00      |       2       |   1.00 (100.00%)   |   -5371.63    |   10976.46    |    10971.76     |
|      0557 _6       |      0.00      |       2       |  27.66 (100.00%)   |   -5371.32    |   10975.83    |    10971.76     |
|      0557 _21      |      0.00      |       2       |    0.25 ( 1.96%)   |   -5371.30    |   10975.80    |    10971.76     |
|      0557 _7       |      0.00      |       2       |    0.25 ( 1.96%)   |   -5371.30    |   10975.80    |    10971.76     |

###   Rate   class   analyses   summary
*     38   branches   with   **1**   rate  classes
*     6   branches   with   **2**   rate   classes

###   Improving   parameter   estimates of the adaptive rate class model
*   Log ( L )   =   -5370.66,   AIC - c   =   10970.49 (114 estimated parameters)

###   Testing   selected   branches   for selection

|                 Branch                 |  Rates   |     Max. dN/dS     | Test LRT |Uncorrected p-value|
|-------------------|---------|----------------------|----------|--------------------|
|                 0564 _22                |     1    |   1.22 (100.00%)   |        0.11        |       0.43015    |
|                 0564 _7                 |     1    |   0.61 (100.00%)   |        0.00        |       1.00000    |
|                Separator               |     2    |  197.72 ( 3.95%)   |       14.13        |       0.00029    |
|                 0564 _4                 |     2    |   28.89 ( 2.15%)   |        4.81        |       0.03281    |
|                 0564 _3                 |     2    |  127.66 ( 3.14%)   |       14.06        |       0.00030    |
|                 0564 _9                 |     1    |   0.72 (100.00%)   |        0.00        |       1.00000    |
|                 0564 _1                 |     1    |   1.07 (100.00%)   |        0.01        |       0.48208    |
...
|                 0557 _21                |     1    |   1.00 (100.00%)   |        0.00        |       1.00000    |
|                 0557 _7                 |     1    |   1.00 (100.00%)   |        0.00        |       1.00000    |
----
###   Adaptive   branch   site   random effects likelihood test
Likelihood   ratio   test   for   episodic diversifying positive selection at Holm-Bonferroni corrected _p =   0.0500_ found **3** branches under selection among **44** tested.

*   Node35 ,   p - value   =     0.00018
*   Separator ,   p - value   =     0.01251
*   0564 _3 ,   p - value   =     0.01266

18.

Note

Interpreting Results The first printed markdown table ("Determining the optimal number of rate classes per branch using a step up procedure" ) summarizes the model selection process. For example, when two ω rates were assigned to branch Separator , this improved the AICc _cscore of the fit (compared to the single-rate model) from 11, 009.72 to 11, 004.02. However, allocating three ω rates to the same branch worsens the score to 11, 008.06. Therefore the aBSREL model will use two ω rates at the branch.The second printed markdown table ("Testing selected branches for selection" ) shows the results of tests for episodic selection on individual branches. At branch 0564_4 , for example, the tested model includes two ω rates, with the positive selection class taking on value 28.89 (2.15% proportion of the mixture). Constraining this rate to range between 0 and 1 yields the likelihood ratio test statistic of 4.81, which maps to a P -value (before multiple test correction) of 0.03281.Finally, aBSREL reports three branches under episodic diversifying selection pressure. Further examination of results using HyPhy-Vision shows that these branches are found (a) along the transmission event from donor to recipient, and (b) within a highly diverged clade in the donor (Fig. 2). The first finding is consistent with an expected increase in evolutionary rate when a virus infects a new host and encounters novel host immunity, and the second finding is consistent with intrahost adaptive dynamics of the donor’s long-term HIV infection. Importantly, a close examination of the markdown-output table under the header "Testing selected branches for selection" reveals several nodes with uncorrected P -values whose significance was lost upon applying the Bonferroni–Holm correction, e.g.,0564_4whose uncorrected P = 0.03281. This result illustrates the potential loss of power incurred by this aBSREL exploratory analysis. Rules of Thumb for aBSREL Use A priori identification of branches to test for selection will generally increase power to detect selection on those branches. That said, to maintain statistical robustness, we strongly discourage performing multiple separate tests for selection on different branch sets. Such an approach will necessarily introduce false positives. In such a case, we recommend performing an exploratory analysis wherein all branches are considered.Exploratory analyses of very large datasets are unlikely to yield many significant results, because correcting for multiple testing will reduce power as the number of branches grows, while the amount of statistical signal does not increase for larger datasets. One option is to thin out large phylogenies (before performing any testing), retaining major clades and lineages of interest.

Fig. 2HyPhy-Vision tree viewer depicting the fitted aBSREL Adaptive model to HIV-1 data. Branches are colored by their inferred ω distribution, as indicated in the legend. Lineages identified as positive selection at P < 0.05 after correction for multiple testing are shown with thick branches, with color distributions representing the relative values and proportions of inferred ω categories. Note that taxon labels beginning with “0554” represent HIV-1 sequences derived from the donor patient, and labels beginning with “0557” represent HIV-1 sequences derived from the recipient patient

3.5 Site-Level Selection: MEME, FEL, SLAC, and FUBAR

19.

Note

What Biological Question Is the Method Designed to Answer? The methods FEL, SLAC, and FUBAR address the question: Which site(s) in a gene are subject to pervasive , i.e., consistently across the entire phylogeny, diversifying selection? MEME addresses a more general question: Which site(s) in a gene are subject to pervasive or episodic , i.e., only on a single lineage or subset of lineages, diversifying selection?Recommended ApplicationsMEME is the sole method in HyPhy for detecting selection at individual sites that considers both pervasive and episodic selection. MEME is therefore our recommended method if maximum power is desired.The phenomenon of pervasive selection is generally most prevalent in pathogen evolution and any biological system influenced by evolutionary arms race dynamics (or balancing selection), including adaptive immune escape by viruses. As such, FEL, SLAC, and FUBAR are ideally suited to identify sites under positive selection which represent candidate sites subject to strong selective pressures across the entire phylogeny. Each of these methods has a particular use case as well:FEL is our recommended method for analyzing small-to-medium size datasets when one wishes only to study pervasive selection at individual sites.FUBAR is our recommended method for detecting pervasive selection at individual sites on large (> 500 sequences) datasets for which other methods have prohibitive runtimes, unless you have access to a computer cluster.SLAC provides legacy functionality as a counting-based method adapted for phylogenetic applications. In general, this method will be the least statistically robust.

20.

Note

Statistical Test Procedure Each method presented here employs a distinct algorithmic approach to inferring selection. FEL uses maximum likelihood to fit a codon model to each site, thereby estimating a value for dN and dS at each site. FEL tests for selection with the likelihood ratio test using the

distribution, asking whether the dN estimate is significantly greater than the inferred dS estimate.SLAC represents the most basic inference method and is an extension of the Suzuki–Gojobori counting-based method [39] for phylogenetically related sequences (as opposed to sequence pairs). SLAC uses maximum likelihood to infer ancestral characters for each site across the phylogeny and then directly counts the number of synonymous and non-synonymous changes which have occurred at each site over evolutionary time. SLAC then tests for selection by testing whether or not there are too many or too few non-synonymous changes compared to what is expected under neutrality. The neutral expectation is derived based on the phylogeny-wide estimated numbers of synonymous and non-synonymous nucleotide sites at a given codon. The statistical test employs the binomial distribution to compute significance, e.g., how likely is it to observe 13 non-synonymous and 1 synonymous substitutions at a site, if the expected synonymous to non-synonymous substitution count ratio under neutrality is 1:4?MEME employs a mixed-effects maximum likelihood approach. For each site, MEME infers two ω rate classes and corresponding weights representing the probability that the site evolves under each rate class at a given branch. To this end, MEME infers a single α (dS ) parameter and two separate β (dN ) parameters, β ₋and β ₊. The ω rates per site, therefore, consist of β ₊∕α and β ₋∕α . MEME uses this framework to fit a null and alternative model each, both models enforcing the constraint β ₋≤ α . The null model disallows positive selection by enforcing the constraint β ₊ ≤ α , whereas the alternative model places no constraint on β ₊. MEME uses the likelihood ratio test to compare between null and alternative model fits, with significance assessed using the mixture of

, and

.FUBAR takes a Bayesian approach to selection inference and is a particular case of statistical models developed in the context of document classification (latent Dirichlet allocation). The key innovation to FUBAR’s approach is its use of an a priori specified grid of dN and dS values (typically 20 × 20), spanning the range of negative, neutral, and positive selection regimes, whose likelihoods can be pre-computed and used throughout analysis (rather than having to re-compute likelihoods during optimization as traditional random-effects approaches do [12, 29]). This approach, combined with other algorithmic advances, speeds computation time by at least an order of magnitude compared to FEL, while yielding comparable statistical performance. FUBAR estimates every model parameter except the proportion of sites allocated to each grid point using simple (and fast) nucleotide models. The proportions are estimated using an MCMC procedure, and non-neutral evolution at each site is inferred using a straightforward naive empirical Bayes approach [29]. Sites are called positively or negatively selected if the corresponding posterior probabilities are sufficiently high.Note that FEL and SLAC report both positively and negatively selected sites, but MEME and FUBAR report only sites under positive selection.

21.

Note

Example Analysis We will demonstrate the use and interpretation of site-level methods using data from influenza strain H3N2 (the “Hong Kong flu”), the primary circulating strain of seasonal influenza since the late 1960s. We specifically will assess selection on the H3 hemagglutinin, the influenza surface protein which is responsible for host cell binding. Hemagglutinin experiences rapid evolution triggered by host immune escape, and previous studies have identified numerous signatures of positive diversifying selection in H3 sequences with a particular concentration around the host-binding domain [28].We base analyses here on an alignment from Meyer and Wilke [22] of H3 sequences sampled over time since the 1991–1992 influenza season. We removed all partial and strongly outlying sequences (i.e., those with excessive divergence) from the original dataset before proceeding, yielding 2555 sequences to comprise our “full” H3 dataset. We further subsetted this alignment to two smaller alignments with comparable numbers of taxa but spanning different evolutionary time frames: The first smaller alignment (“trunk”) contains 163 sequences sampled along the influenza H3 trunk, whereas the second smaller alignment (“shallow”) contains 121 sequences sampled from a single clade (Fig. 3). Therefore, while these two smaller datasets contain a comparable number of sequences, the trunk dataset spans a much longer time frame and contains substantially more sequence divergence relative to the shallow dataset. Indeed, the trunk dataset has a total tree length (sum of branch lengths, in units substitutions/site/unit time) of 0.43, whereas the shallow dataset had a total tree length of 0.12, meaning that the trunk dataset contains nearly four times the amount of sequence divergence seen in the shallow dataset. We have compiled results for all three datasets analyzed with all four methods (Table 1). We now describe, using the trunk dataset as an example, how to run each of these analyses in HyPhy.

Fig. 3Phylogeny of H3 hemagglutinin sequences analyzed here. Tip colors indicate those selected for each dataset

Table 1Sites identified as positively selected across the H3 datasets analyzed hereBold sites are those identified by multiple methods for a given dataset. Bold italicized sites are those identified in more than one dataset, generally by more than one method. Numbers in parentheses give the total number of positively selected sites identified with the given method and dataset∗For FUBAR, significance is assessed as posterior probability ≥ 0.9

FEL: Launch HyPhy from the command line, and enter options 1 (Selection Analyses) and then 2 to reach the FEL analysis menu, and supply values for the following prompts:

Choose genetic code . Enter 1 to use the Universal genetic code.
Select a coding sequence alignment file . Provide the full path to the dataset of interest: /path/to/data/h3_trunk.fna .
A tree was found in the data file…Would you like to use it (y/n)? . Enter “y” to use the tree.
Choose the set of branches to test for selection . This option allows you to specify which branches along which site-level inference should be performed. Enter 1 to test all branches for selection.
Use synonymous rate variation? . This option asks you to specify whether the dS parameter in the codon model should be allowed to vary across sites (“Yes”) or be fixed to 1 at all sites (“No”). Enter1to use a model with synonymous rate variation.
Select the P -value used to perform the test at (permissible range = [0,1], default value = 0.1) . Provide the default threshold of 0.1 .

FEL will now run to completion and print status indicators to the screen, including results for any site found to be under selection (either positive or negative). Abbreviated results are shown below.

22.

Listing 4 Partial FEL screen output

###   Obtaining   branch   lengths   and nucleotide rates under the  GTR model
*   Log ( L )   =   -7506.06

###   Obtaining   the   global   omega  estimate based on relative GTR branch lengths and nucleotide substitution biases
*   Log ( L )   =   -7302.10
*   non - synonymous / synonymous   rate ratio for *test* =   0.2923

###   Improving   branch   lengths ,   nucleotide substitution biases, and global dN/dS ratios under a full codon model
*   Log ( L )   =   -7289.65
*   non - synonymous / synonymous   rate ratio =   0.2598

###   For   partition   1   these   sites are significant at p <=0.1

|        Codon         |      Partition      |     alpha      |      beta      |      LRT       |Selection detected?|
|:-----------:|:------------:|:----------:|:--------:|:-----------:|:-------------------:|
...
|         146          |          1           |        3.818   |      0.000   |        7.336   |  Neg. p = 0.0068  |
|         152          |          1           |       1.968   |        0.000   |        3.634   |  Neg. p = 0.0566  |
|         154          |          1           |        0.000   |        3.912   |        4.652   |  Pos. p = 0.0310  |
|         159          |          1           |        4.413   |        0.716   |        2.972   |  Neg. p = 0.0847  |
|         164          |          1           |        2.082   |        0.000   |        2.713   |  Neg. p = 0.0995  |
|         176          |          1           |        1.659   |        0.000   |        2.986   |  Neg. p = 0.0840  |
|         177          |          1           |        6.393   |        0.000   |        8.421   |  Neg. p = 0.0037  |
|         181          |          1           |        1.928   |        0.000   |        3.286   |  Neg. p = 0.0699  |
|         190          |          1           |        2.085   |        0.000   |        2.715   |  Neg. p = 0.0994  |
|         201          |          1           |        1.645   |        0.000   |        3.370   |  Neg. p = 0.0664  |
|         208          |          1           |        0.000   |        3.625   |        4.668   |  Pos. p = 0.0307  |
...

 ###   **   Found   _3_   sites   under   pervasive positive diversifying and _115_ sites under negative selection at p <= 0.1**

Note

Inference details for codons with significant likelihood ratio tests for positive or negative selection are reported to the screen.Codon The codon where non-neutral evolution has been detected.Partition Allows one to keep track which subset of the alignment a particular site belongs to. This is important for recombination-corrected partition analyses.alpha Site-specific synonymous substitution ratebeta Site-specific non-synonymous substitution rateLRT Site-specific likelihood ratio test statistic for non-neutral evolution (alpha ≠ beta)Selection detected? Selection classification (positive or negative) and the corresponding P -valueNote that the “Codon” and “Partition” columns are common to all site-specific analyses.MEME and SLAC: SLAC and MEME follow identical menu prompts as FEL, with the exception that only FEL will prompt for synonymous rate variation. Instead, SLAC has a different prompt for Step 5: Select the number of samples used to assess ancestral reconstruction uncertainty . If this number is positive, then HyPhy will draw samples from the distribution of ancestral states and use them to measure whether or not inference is sensitive to ancestral inference uncertainty. When you encounter this option, provide the default value of 100 (or 0 to forego sampling). MEME does not emit any additional prompts.

23.

Listing 5 Partial SLAC screen output

...
###   For   partition   1   these   sites are significant at p <=0.1

|   Codon   |      Partition       |         S        |       N        |       dS       |       dN       |Selection detected?|
|:---------:|:--------:|:---------:|:-------:|:--------:|:--------:|:-----------------:|
 ...
|     146     |          1           |        3.000      |     0.000      |     3.000      |     0.000      |  Neg. p = 0.037   |
|     154     |          1           |        0.000      |     8.000      |     0.000      |     4.000      |  Pos. p = 0.039   |
|     177     |          1           |        3.000      |     0.000      |     4.038      |     0.000      |  Neg. p = 0.020   |
|     208     |          1           |        0.000      |     6.000      |     0.000      |     2.994      |  Pos. p = 0.089   |

...
###   Ancestor   sampling   analysis

> Generating   100   ancestral   sequence samples to obtain confidence intervals

Resampling   results   for   partition 1

|Codon|Part.|S [median,IQR]|N [median, IQR]|dS [median, IQR]|dN [median, IQR]|p-value [median, IQR]|
|:----:|:----:|:---------:|-----------:|----------:|-----------:|---------------------:|
...
| 146 | 1 | 3.00 [3.00-3.00]| 0.00 [0.00-0.00]| 3.00 [3.00-3.00] | 0.00 [0.00-0.00] | 0.04 [0.04-0.04].|
|154|1|0.00 [0.00-0.00]| 8.00 [8.00-8.00]| 0.00 [0.00-0.00]| 4.00 [4.00-4.00]| 0.04 [0.04-0.04] |
| 177 | 1 | 3.00 [3.00-4.00]| 0.00 [0.00-0.00]| 4.04 [4.04-5.38]| 0.00 [0.00-0.00]| 0.02 [0.01-0.02] |
| 208 | 1| 0.00 [0.00-0.00]| 6.00 [6.00-6.00]| 0.00 [0.00-0.00]| 2.99 [2.99-2.99]| 0.09 [0.09-0.09]|

 ...

Note

SLAC reports several key quantities for codons with significant P -values for positive or negative selection to the screen.S The number of synonymous substitutions inferred at this siteNS The number of non-synonymous substitutions inferred at this sitedS Estimated site-specific synonymous ratedN Estimated site-specific non-synonymous rateSelection detected? Selection classification (positive or negative) and the correspondingP -value for the binomial testIf the user elected to perform ancestral resampling, another table is reported, showing how much these quantities are affected by ancestral state reconstruction uncertainty. For example, at codon 177, some ancestral reconstructions yielded 3 synonymous substitutions, whereas others yielded 4; however, this was not sufficient to move the P -value on different sides of the threshold.

24.

Listing 6 Partial MEME screen output

...
|   Codon     |   Partition   |     alpha     |  beta+  |   p+    |  LRT    |Episodic selection detected?|# branches|
|:------:|:-------:|:----:|:----:|:---:|:---:|:-----------------------:|:--------:|
|      64      |        1        | 0.000     |   14.717 |  0.204  |  3.512  |      Yes, p = 0.0816       |      5     |
|     154      |        1        | 0.000     |   35.302 |  0.145  |  5.334  |      Yes, p = 0.0317       |      8     |
|     171      |        1        | 0.000     |   45.005 |  0.017  |  5.753  |      Yes, p = 0.0256       |      1     |
|     208      |        1        | 0.000     |   59.749 |  0.089  |  5.554  |      Yes, p = 0.0283       |      6     |
|     242      |        1        | 1.839     |   34.114 |  0.216  |  4.273  |      Yes, p = 0.0549       |      7     |
|     402      |        1        | 0.000     |   10.476 |  0.091  |  3.493  |      Yes, p = 0.0824       |      2     |

###   **   Found   _6_   sites   under   episodic diversifying positive selection at p <= 0.1**

Note

MEME prints information only about codons subject to positive selection, since MEME does not directly test for negative selection.alpha Site-specific synonymous substitution ratebeta+ Site-specific non-synonymous substitution rate for the positive selection categoryp+ Site-specific weight (∼ proportion of branches) assigned for the positive selection categoryLRT Site-specific likelihood ratio test statistic for episodic diversifying selection (beta+ > 1 and p+ > 0)Episodic selection detected? Selection classification (yes) and the corresponding P -value# branches An exploratory estimate of the number of individual branches which have sufficient empirical Bayes support for positive selection; since MEME pools signal from multiple branches, there may be overall evidence for selection, without necessarily implicating any individual branches.

25.

FUBAR: To run FUBAR, launch HyPhy from the command line, and enter options 1 (Selection Analyses) and then 4 to reach the FUBAR analysis menu, and supply values for the following prompts ( see footnote 5):

Choose genetic code . Enter 1 to use the Universal genetic code.
Select a coding sequence alignment file . Provide the full path to the dataset of interest: /path/to/data/h3_trunk.fna .
A tree was found in the data file…Would you like to use it (y/n)? . Enter “y” to use the tree.
Number of grid points per dimension. This option controls how fine the FUBAR analysis is by setting the range of possible dN and dS values that can be inferred, along an N × N grid. We will use the default value of 20 (leading to a 20 × 20 grid of dN ∕ dS ratios). FUBAR will now pre-compute likelihoods for each value in the grid.
Number of MCMC chains to run . This option determines the number of Markov Chain Monte Carlo chains to run during Bayesian inference of evolutionary rates. Enter the default value of 5 to run 5 chains.
The length of each chain . This option controls for how long each MCMC chain should be run. Enter the default value of 2000000 to run each chain for two million generations (thus obtaining two million samples).
Use this many samples as burn-in . This option determines how many initial samples drawn from the MCMC chain should be discarded as burn-in, as is standard in Bayesian analyses. Enter the default value of 1000000 , leading to a final value of one-million draws per chain.
How many samples should be drawn from each chain . This option determines the final number of samples to draw from the full set of one-million draws per chain. Enter the default value of 100 .
The concentration parameter of the Dirichlet prior . This option controls the shape of the Dirichlet prior distribution. Enter the default value of 0.5 .

26.

Listing 7 Partial FUBAR screen output

...
###   Tabulating   site - level   results
|     Codon       | Partition   |        alpha      |      beta      |     N.eff      |Posterior prob for positive selection|
|:-------:|:--------:|:----:|:-----:|:--------:|:------------------------------------:|
|       61        |       1        |           0.753   |        4.365   |       64.549   |       Pos. posterior = 0.9262       |
|       64        |       1        |           0.753   |        3.920   |       77.106   |       Pos. posterior = 0.9095       |
|       69        |       1        |           0.730   |        4.447   |       64.182   |       Pos. posterior = 0.9325       |
|      154        |       1        |           0.637   |        6.595   |       53.312   |       Pos. posterior = 0.9826       |
|      208        |       1        |           0.622   |        5.908   |       55.794   |       Pos. posterior = 0.9731       |
|      242        |       1        |           2.215   |       12.055   |     1489.879   |       Pos. posterior = 0.9131       |
----
##   FUBAR   inferred   6   sites   subject to diversifying positive selection at posterior probability >= 0.9
Of   these ,     0.36   are   expected   to be false positives (95% confidence interval of 0-2 )

Note

Like other site analyses, FUBAR will print a number of inferences about each individual site detected to be under pervasive positive selectionalpha The posterior estimate of the synonymous substitution rate at a sitebeta The posterior estimate of the non-synonymous substitution rate at a siteN.eff An estimate of the effective sample size for inferring positive selection at this site; smaller values (e.g., < 20) imply that the MCMC procedure may have failed to sample the parameter space well, and longer chains (or more chains) might be warrantedPosterior prob for positive selection The estimated posterior probability for pervasive diversifying selection (dN ∕dS > 1).

27.

Note

Interpreting Results Sites identified as positively selected by each method, across all three datasets, are given in Table 1. In general, we expect MEME to be the most comprehensive and robust of all site-level methods because it uniquely considers both pervasive and episodic selection [24]. In addition, power studies have shown that FUBAR is expected to outperform FEL and SLAC under most circumstances [25]. Finally, we expect that SLAC will be the least robust method due to its reliance on a relatively naive counting-based approach [12].These expectations are generally borne out in the results obtained here in our brief study of H3 selection. For the full H3 dataset of 2555 sequences, MEME identified 16 sites, and FEL identified 15 sites under positive selection. All sites were identical except for the following: MEME uniquely identified sites 151 and 208, and FEL uniquely identified with 237. Interestingly, site 208 was additionally identified as positively selected by all methods on the trunk H3 dataset. Combined, these results demonstrate MEME’s ability to identify sites subject to both pervasive and episodic selection, as site 208 appears to be under pervasive selection only along the H3 trunk. Because FEL uses a less stringent test statistic distribution (𝜒₁²) to call significance, occasionally sites subject to pervasive selection near the significance thresholds may be detected by FEL but missed by MEME (e.g., site 237, with FEL reporting P = 0.08 and MEME reporting P = 0.105).FUBAR identified two fewer selected sites in the full H3 alignment compared to FEL (which is a directly comparable test), missing sites 19 (posterior 0.83), 277 (posterior 0.59), and 292 (posterior 0.89) relative to FEL, but adding site 160 (FEL P = 0.8).In addition to differences across methods, we expect to see some important differences for sites inferred across the full, shallow, and trunk H3 datasets. Because the trunk and full H3 datasets span similar time frames, we expect sites returned for these two datasets to have the most overlap. In addition, sites found to be under selection in the shallow lineage may not be detected across the full H3 phylogeny, as selection may have been fleeting, weak, or constrained to the specific shallow clade examined here. For example, site 49 was specifically selected in the shallow H3 lineage alone, as indicated by three of the four methods. In contrast, sites 19 and 241 were found to be selected in both the shallow and the full H3 datasets, but this signal was not apparent when the trunk lineage was examined independently, perhaps because these sites experience only transient changes that do not propagate along the trunk.What are some potential reasons for seeing discrepancies in inferences across H3 datasets? The site 154, for example, is positively selected in both the full H3 phylogeny and the trunk H3 lineage, but not the shallow H3 lineage. This result suggests that site 154 may have experienced pervasive selection throughout H3 evolution, but its signal in the shallow clade alone was either too weak to detect or selection was attenuated in the shallow clade. In addition, sites which appeared only in the shallow clade analyses may have experienced lineage-specific selection where the signal was too weak to detect when the entire phylogeny was considered.Furthermore, while MEME, FEL, and FUBAR were able to detect selected sites in the shallow H3 lineage, SLAC did not identify any such sites. This is because SLAC requires a large number of substitutions, which are unlikely to have occurred in the shallow sample, to achieve significance. Overall, we emphasize that in many cases different site-level methods will not identify exactly the same set of sites under selection, although, as the H3 example shows, the agreement between is typically good.Rules of Thumb for Site-Level Detection of Selection Small datasets, i.e., ≤ 10 sequences (especially when coupled with low divergence), are unlikely to yield any sites under selection. Consider using gene-wide methods like BUSTED or aBSREL to look for selection in these cases.On large datasets (e.g., > 500 sequences), all methods tend to give similar results (but see the MEME exception below), hence the default method of choice is FUBAR, since its run time is dramatically shorter than FEL or MEME, and its statistical performance is better than SLAC.MEME tends to be the most sensitive method, because it is the only one designed to detect episodic selection. Indeed, sometimes SLAC, FEL, or FUBAR may all call a site subject to episodic positive selection site negatively selected, if a burst of selection is followed by strong conservation. MEME is often able to tease the two processes apart and correctly call such sites positively selected. Hence, MEME should be the preferred method, unless computationally prohibitive.We cannot universally recommend running all the available methods on a given dataset and then aggregating the results, as done in Table 1, for several reasons. Firstly, while it may be tempting to use agreement between all methods as a hedge against false positives, i.e., calling a site selected only if all the methods agreed on it, reduces the power of the analysis to that of the least sensitive method. Secondly, while comparing the sites on which methods disagree can potentially reveal critical information (e.g., a site detected by MEME but not FUBAR may be under strong episodic selection), considerable effort and diligence must be put into disentangling meaningful biological differences from statistical artifacts. Thirdly, statistical strategy must be informed before the analysis commences by deciding which is more important to optimize: does one care more about specificity (reducing false positives) or sensitivity (reducing false negatives)? For example, if little is known about a gene, it may be advisable to generate the most inclusive list of sites that could be subject to selection for subsequent testing using other approaches; in this case, the most sensitive method or the union of all methods may be appropriate.We strongly recommend against performing multiple testing or false discovery rate correction on individual site results. Firstly, methods are calibrated to not generate excessive false positives on strictly neutral data. In most genes, most sites will be under relatively strong negative selection, making the statistical testing procedure conservative. Secondly, multiple testing corrections will nearly always yield no significant results on small to moderate sized datasets. Thirdly, some key assumptions of methods for correcting false discovery rates are not applicable for site-level testing. For example, a typical collection of results from site-level testing will contain very few, if any, true sites with P -values supporting neutrality (dN ∕dS = 1).

3.6 Screening Sequences for Recombination

28.

Note

A critical aspect of sequence analysis we have not yet covered is the detection of and correction for intragenic recombination in an alignment of homologous sequences. Because recombination is such a key biological process in many viral pathogens, we strongly advocate screening an alignment for recombination before proceeding with additional analyses, unless there is a sound biological reason to discount (i.e., intragenic recombination Influenza A is negligibly rare). Indeed, because recombination causes different regions of an alignment to be related by different phylogenies, its presence can heavily influence selection detection and other downstream applications.There are many computational approaches to finding evidence of recombination in a sequence alignment [32], however at their core, many such methods look for evidence of phylogenetic incongruence. Here, we demonstrate one such method, GARD (genetic algorithms for recombination detection) that we have found to perform very well among a wide range of approaches on simulated data [14]. Note that at this time, GARD will not produce a JSON file as output but instead several text files containing inference information, as well as a final partitioned alignment for downstream use if recombination was detected.

3.7 GARD

29.

Note

What Biological Question Is the Method Designed to Answer? Have sequences in the given alignment undergone recombination, and if so what are the recombination breakpoints and segment-specific phylogenies?Recommended Applications:GARD is geared towards mapping the breakpoints and detecting segments of the alignment which can be adequately described by a single tree topology. Therefore, alignments, particularly alignments of viral sequences, should be screened for the presence of recombination before performing any selection inference. The NEXUS output from GARD can be directly used as input for most downstream selection detection analyses.

30.

Note

Statistical Test Procedure GARD employs a genetic algorithm to find a solution to a complex optimization problem by mimicking processes of biological evolution (mutation, recombination, and selection) in a population of competing solutions. In this application of genetic algorithms, we are evolving a population of “chromosomes” that specify different numbers and locations of recombination breakpoints in the alignment with the objective of detecting topological incongruence, i.e., support for different phylogenies by separate regions of the alignment. The “fitness” of each chromosome is determined by using maximum likelihood methods to evaluate a separate phylogeny for each non-recombinant fragment defined by the breakpoints (e.g., to the left and to the right of a breakpoint in Fig. 4), and computing a goodness of fit (AICc ) for each such model. The genetic algorithm searches for the number and placement of breakpoints yielding the best AICc and also reports confidence values for inferred breakpoint locations based on the contribution of each considered model weighted by how well the model fit the data. For computational expedience, the current implementation of GARD infers topologies for each segment using neighbor joining [37] based on the TN93 pairwise distance estimator [41] and then fits a user-specified nucleotide evolutionary model using maximum likelihood to obtain AICc scores.

Fig. 4Phylogenetic incongruence caused by the presence of a recombinant sequence in an alignment. Sequence R is a product of homologous recombination between sequences A and B. Phylogenies reconstructed from sequences A, B, R and an outgroup sequence (O) will differ based on which part of the alignment is being considered. To the left of the breakpoint, R clusters with A, whereas to the right of the breakpoint R clusters with B

31.

Note

Example Analysis 1 We will demonstrate the use of GARD, as well as its benefits for downstream analysis, using a dataset consisting of 13 glycoprotein sequences from Cache Valley Fever virus (cvf.fna ). We will first use GARD to detect recombination in this dataset, and then we will process both the GARD-informed data and the original alignment (with no recombination assumed) with FEL to see how the presence of recombination may confound selection inference.Importantly, GARD specifically requires the use of HyPhy’s MPI-enabled executable, HYPHYMPI. To run GARD from the command line, you will need an operating system with a MPI headers and libraries installed so that this executable can be compiled. Here, we will describe how to use GARD from the command line, but we emphasize that GARD is fully implemented and available on www.datamonkey.org Importantly, GARD specifically requires the use of HyPhy’s MPI-enabled executable, HYPHYMPI. To run GARD from the command line, you will need an operating system with a MPI headers and libraries installed so that this executable can be compiled. Here, we will describe how to use GARD from the command line, but we emphasize that GARD is fully implemented and available on www.datamonkey.org and takes the same input options described here. and takes the same input options described here.

To run GARD, open a terminal session and start HYPHYMPI in the appropriate MPI environment (e.g., MPIRUN in OpenMPI) from the command line to launch the HyPhy analysis menu. Enter 12 (Recombination) and then 1 to reach the GARD analysis menu, and supply values for the following prompts:

Nucleotide file to screen : Provide the full path to the dataset of interest: /path/to/data/cvf.fna .
Please enter a 6-character model designation (e.g., 010010 defines HKY85) . This option controls which nucleotide substitution model is to be used for analysis, using PAUP notational shorthand. The six-character shorthand allows the user to specify the entire spectrum from F81 (000000) to GTR (012345), which we recommend as default option. Provide the value 012345 for this prompt.
Rate variation options . This option determines how site-to-site rate variation should be modeled. The option None will discount site-to-site rate variation, allowing the analysis to run several times faster than other options but also creating the risk of mistaking rate heterogeneity for recombination. As such, we can only recommend this option for extremely small alignments (i.e., 3–5 sequences). The option General Discrete (the default) models rate variation using an N bin general discrete distribution, and option Beta-Gamma models rate variation using an adaptively discretized distribution, a more flexible version of the standard Gamma+4 model. Enter option 2 to select the General Discrete model.
How many distribution bins [2–32]? . If rate variation was selected in the previous step, this option allows the user to decide how many different rate classes should be included in the model. We recommend using 3 rate classes by default, as both General Discrete and Beta-Gamma distributions are flexible enough to reliably capture rate variability in the majority of alignments with only a few rate classes. Therefore, enter the value 3 .
Save results to . For this option, provide a full path to the output file to which you would like GARD to write results. The supplied file name will ultimately contain an HTML-formatted summary of the analysis. HyPhy will generate several other files with names obtained by appending suffixes (as in_suffix) to the main result file. In particular, the _finalout file stores the original alignment in NEXUS format with inferred non-recombinant sections of the alignment saved in the ASSUMPTIONS block and trees inferred for each partition in the TREES block. This NEXUS file can be input into many recombination-aware analyses in HyPhy and other programs that can read NEXUS. The _ga_details file contains two lines of information about each model examined by the genetic algorithm: its AICc score and the location of breakpoints in the model. Finally, the _ga_splits file stores information about the location of breakpoints and trees inferred for each alignment region under the best model found by the GA.

GARD will now run to completion, printing status indicators to screen while it runs.

32.

Listing 8 Partial GARD output

Fitting a baseline nucleotide model...
Done with single partition analysis. Log(L) = -5921.9511901113, c-AIC = 11914.85153276497
Starting the GA ...

GENERATION 2 with 1 breakpoints (~0% converged)
Breakpoints     c - AIC   Delta c - AIC [BP      1]
           0 11914.85
           1 11804.56      110.291        1393
GA has considered         92/         328 (92 over all runs) unique models
Total run time           0 hrs 0 mins 2 seconds
Throughput                  46.00 models/second
Allocated time remaining 999 hrs 59 mins 58 seconds (approx. 165599908 more models.)
...
GENERATION 52 with 4 breakpoints (~100% converged)
Breakpoints     c - AIC   Delta c - AIC [BP      1] [BP      2] [BP      3] [BP      4]
           0 11914.85
           1 11804.56      110.291        1445
           2 11783.92       20.638          617       1490
           3 11778.94        4.978          587         962       1475
           4 11778.94        0.000          587         962       1475
GA has considered         268/    473490550 (1356 over all runs) unique models
Total run time            0 hrs 4  mins 2 seconds
Throughput                 5.60 models/second
Allocated time remaining 999 hrs 55 mins 58 seconds (approx. 20170544.82644628 more models.)
Performing the final optimization...

33.

Note

Interpreting Results GARD found evidence of recombination in this dataset with three breakpoints, yielding a 135.9 point AICc improvement over the model without recombination. Among all models with three breakpoints in the Cache Valley Virus glycoprotein alignment, the best model places them at nucleotides 587, 962, and 1475. Importantly, if GARD had reported that the best model had 0 breakpoints, we could conclude that no evidence of recombination had been found. Note that because genetic algorithms are stochastic, there is no guarantee that replicate runs will converge to exactly the same quantitative results. When there is a strong signal of recombination breakpoints in the data, however, the qualitative results (number and general location of breakpoints) should be fairly robust.

34.

Note

Example Analysis 2 The NEXUS file that GARD produced is a partitioned dataset , wherein different groups of sites are described by different trees. Most HyPhy selection analyses discussed here (see footnote 6), including MEME, FUBAR, FEL, SLAC, and BUSTED, are able to analyze partitioned data. To demonstrate the importance of screening for recombination, we will now compare results for a FEL analysis performed on the original alignment of 13 Cache Valley Virus glycoproteins, as well as on the GARD-inferred partitioned alignment. All steps here were carried out as described earlier in this chapter.

35.

Note

Interpreting Results FEL inference on the GARD-processed partitioned Cache Valley Virus data does not detect sites under selection at P ≤ 0.1. By contrast, FEL inference on the unpartitioned Cache Valley Virus data (i.e., not pre-screened for recombination) detects three positively selected sites at P ≤ 0.1 (212, 516, and 558 at P = 0.08, P = 0.03, and P = 0.09, respectively). From these results, we can clearly tell that not screening or recombination has the potential for adverse consequence including an increased false positive rate as seen here. As such, we strongly encourage users to screen alignments for recombination if such processes are suspected before proceeding to selection detection.

3.8 Accounting for Synonymous Rate Variation

36.

Note

A critical genomic process that one must consider when detecting selection is the phenomenon of synonymous rate variation , wherein the rate of synonymous codon evolution (represented by dS in the context of codon models and representing mutation rate) varies across species, genes, and even intragenic positions. In particular, intragenic synonymous rate variation has been identified across domains of life [11, 20, 30] and can arise from a variety of evolutionary processes, including selection on mRNA secondary structure [2], gene expression [4], GC-biased gene conversion [10], and other neutral mutation processes. For example, even the genomic context of a given nucleotide can influence its mutation rate; indeed, experimental work has shown that GC-neighboring sites can feature up to a 75-fold increase in mutation rate [20, 38]. In addition, the synonymous rate at certain sites may be elevated due to the mutational vulnerability of the non-template DNA strand during transcription [20]. These processes must be accounted for in order to ensure an appropriate baseline dS is used when testing for selection.We demonstrate the importance of considering synonymous rate variation for selection inference using a dataset of 10 mammalian CD2 genes, which code for a specific T-cell surface adhesion molecule [21]. We use FEL to detect selection in this dataset under two specifications: with synonymous rate variation (“yes” in prompt 4 in the FEL analysis menu), and without synonymous rate variation (“no” in prompt 4 in the FEL analysis menu).

37.

Note

Interpreting Results At P ≤ 0.1, analysis of CD2 with synonymous rate variation revealed a total of 14 sites under positive selection. By contrast, CD2 analysis with FEL without dS variation only detected four sites under positive selection (Fig. 5). Similarly, analysis with dS variation revealed 27 sites under purifying selection, but analysis without dS variation revealed only 15 sites under purifying selection. Most importantly, all sites detected when dS was fixed to 1 were a subset of the sites identified by the model with dS variation (Fig. 5). Together, these results demonstrate that ignoring dS variation can induce both an increased false negative rate regarding positive selection detection and an overall decrease in power to detect any selective regime. We acknowledge that it is possible that the opposite conclusion might be true, namely, that additional sites identified by FEL with dS variation might instead be false positives. However, in our experience, this is much less frequently the case [12].

Fig. 5Sites identified as positively (red) and negatively (blue) selected in CD2 at P ≤ 0.1 by FEL run with (above the line) and without dS variation (below the line). Sites with arrows represent those identified as selected by FEL with dS variation that were not identified by FEL when dS variation was ignored

4 Tips

38.

Note

Here we provide some helpful notes on HyPhy usage.An actively maintained board for usage questions and filing bug reports is available at https://github.com/veg/hyphy/issues .Each HyPhy analysis described here will export a JSON file. This file can either be uploaded to HyPhy-Vision for visual examination, or it can be easily parsed using a standard scripting language using standard packages, for example, the json package in Python or the jsonLite package in R. All fields used in these output files are defined in Each HyPhy analysis described here will export a JSON file. This file can either be uploaded to HyPhy-Vision for visual examination, or it can be easily parsed using a standard scripting language using standard packages, for example, the json package in Python or the jsonLite package in R. All fields used in these output files are defined in http://hyphy.org..Mac OS(X) users may need to install a new set of compilers (i.e., gcc-6) that are compatible with openMP in order to have full functionality from the HYPHYMP executable, as is described on the HyPhy website.

5 Exercises

39.

Earlier, we performed a BUSTED analysis without designating a specific subset of test lineages. For this exercise, we will analyze the HIV-1 transmission dataset with BUSTED in two different ways: testing all branches, and testing only recipient-derived HIV-1 sequences. The input data for this exercise, with an appropriately labeled phylogeny, is available in exercises/hiv1_transmission_exercise1.fna . For select branches labeled All or test as the test lineages.

Is there evidence (compare model fits using the small sample AIC) that test branches have a different selective regime than the rest of the tree?
The entire dataset should provide evidence for episodic diversification, but the recipient only analysis should return a negative result. What does this mean biologically, i.e., where does the selection signal come from?

40.

Investigate the effect of recombination of site-specific inference of episodic selection using MEME. Run MEME on exercises/cvf.fna (single partition data, i.e., assuming no recombination), and then on the same dataset screened for recombination using GARD exercises/cvf_gard.nexus , testing for selection on all branches, with P=0.1 . Compare the list of sites detected to be under selection by the two analyses.

Which analysis generated more positive results?
Do you think these results are true or false positives? How does this compare to the FEL analysis we described in the text?
Compare site-wise estimates of substitution rates (e.g., α ) between the two analyses. Is there a discernible bias introduced by not accounting for recombination?

41.

When analyzing intraspecies or intrahost data, dN ∕ dS estimates may be inflated due to the fact that not all observed sequence variation are due to substitutions, but some are simply mutations that have not yet been filtered by selection [17, 23, 31, 35]. In other words, dN ∕ dS may be elevated by intraspecies/intrahost polymorphism that should not necessarily be attributed to positive selection. One simple approach to mitigating this undesirable effect is to restrict site-specific analyses to Internal branches only. Internal branches are less likely to contain spurious polymorphic variants because they encompass at least one process on which selection can act (i.e., transmission and/or multiple rounds of replication). Apply MEME and FEL to an intrahost sample of HIV-1 sequences, found in exercises/JS1774.nex , from an infected individual analyzed in Lorenzo-Redondo et al. [19] first choosing to test All branches, and next choosing Internal branches.

42.

Compare the lists of selected sites between All/Internal analyses. How different are they?

43.

Use RELAX to formally test whether or not selective regimes ( dN ∕ dS distributions) are different between terminal and internal branches in exercises/JS1774.nex .

Evolution of Viral Genomes: Interplay Between Selection, Recombination, and Other Forces

Abstract

Steps

3.1 How to Run a Selection Analysis

3.2 BUSTED

3.3 RELAX

3.4 aBSREL

3.5 Site-Level Selection: MEME, FEL, SLAC, and FUBAR

3.6 Screening Sequences for Recombination

3.7 GARD

3.8 Accounting for Synonymous Rate Variation

4 Tips

5 Exercises

推荐阅读