Protein Sequence Analysis Using the MPI Bioinformatics Toolkit

Martin Steinegger, Martin Steinegger, Felix Gabler, Felix Gabler, Seung-Zin Nam, Seung-Zin Nam, Sebastian Till, Sebastian Till, Milot Mirdita, Milot Mirdita, Johannes Söding, Johannes Söding, Andrei N. Lupas, Andrei N. Lupas, Vikram Alva, Vikram Alva

Published: 2020-12-14 DOI: 10.1002/cpbi.108

CLANS

cluster analysis

HHpred

HMM

homology

profile hidden Markov models

sequence comparison

sequence similarity searches

structure prediction

AI 解读

Abstract

The MPI Bioinformatics Toolkit (https://toolkit.tuebingen.mpg.de) provides interactive access to a wide range of the best-performing bioinformatics tools and databases, including the state-of-the-art protein sequence comparison methods HHblits and HHpred. The Toolkit currently includes 35 external and in-house tools, covering functionalities such as sequence similarity searching, prediction of sequence features, and sequence classification. Due to this breadth of functionality, the tight interconnection of its constituent tools, and its ease of use, the Toolkit has become an important resource for biomedical research and for teaching protein sequence analysis to students in the life sciences. In this article, we provide detailed information on utilizing the three most widely accessed tools within the Toolkit: HHpred for the detection of homologs, HHpred in conjunction with MODELLER for structure prediction and homology modeling, and CLANS for the visualization of relationships in large sequence datasets. © 2020 The Authors.

Basic Protocol 1 : Sequence similarity searching using HHpred

Alternate Protocol : Pairwise sequence comparison using HHpred

Support Protocol : Building a custom multiple sequence alignment using PSI-BLAST and forwarding it as input to HHpred

Basic Protocol 2 : Calculation of homology models using HHpred and MODELLER

Basic Protocol 3 : Cluster analysis using CLANS

INTRODUCTION

The structure, function, and evolution of new or uncharacterized proteins are routinely inferred based on their homology to proteins with experimentally characterized properties. Sequence searches are a common first step in this process, as sequence similarity is widely accepted as the best marker for substantiating homologous relationships. Over the years, many high-quality sequence search methods [e.g., BLAST (Altschul et al., 1997; Ladunga, 2017), HMMER (Potter et al., 2018; Prakash, Jeffryes, Bateman, & Finn, 2017), HHblits (Remmert, Biegert, Hauser, & Soding, 2011), HHpred (Soding, 2005; Steinegger et al., 2019)]; protein sequence and domain databases [SCOPe (Fox, Brenner, & Chandonia, 2014), ECOD (Cheng et al., 2014; Schaeffer, Liao, & Grishin, 2018), Pfam (Coggill, Finn, & Bateman, 2008; El-Gebali et al., 2019), RefSeq (O'Leary et al., 2016), UniProt (Pundir, Martin, O'Donovan, & The UniProt Consortium, 2016; The UniProt Consortium, 2019)]; and integrative Web resources [the EMBL-EBI Bioinformatics Web Services (Madeira et al., 2019; Madeira, Madhusoodanan, Lee, Tivey, & Lopez, 2019), the SIB Bioinformatics Resource Portal (Swiss Institute of Bioinformatics Members, 2016), National Center for Biotechnology Information Web Resources (NCBI Resource Coordinators, 2018; Gibney & Baxevanis, 2011; Yang, Derbyshire, Yamashita, & Marchler-Bauer, 2020)] have been developed to help researchers make meaningful inferences based on homology. Driven by our work at the interface of computational and experimental biology, we launched the MPI Bioinformatics Toolkit in 2005 to provide researchers in the life sciences with easy, web-based access to the best-performing bioinformatics tools and databases (Biegert, Mayer, Remmert, Soding, & Lupas, 2006). The Toolkit has been in continuous operation ever since, and we replaced the first version with an entirely new one built using more scalable and robust web technologies in 2017 (Alva, Nam, Soding, & Lupas, 2016; Zimmermann et al., 2018). The Toolkit currently includes 35 in-house and external tools for sequence similarity searching [e.g., PSI-BLAST (Altschul et al., 1997), HHblits, HHpred]; calculation of multiple sequence alignments [ClustalΩ (Sievers et al., 2011), Τ-Coffee (Notredame, Higgins, & Heringa, 2000)]; prediction of secondary structure and sequence features [Quick2D, PCOILS (Gruber, Soding, & Lupas, 2006), TPRpred (Karpenahalli, Lupas, & Soding, 2007)]; and sequence classification [CLANS (Frickey & Lupas, 2004), MMseqs2 (Mirdita, Steinegger, & Soding, 2019)].

Over the years, the Toolkit has established itself as an important resource for molecular biology research, mainly due to the sensitive sequence-comparison tools HHblits and HHpred, which, in many instances, can detect homologous relationships that are not readily recognized by other tools. A further strength of the Toolkit lies in the tight interconnection of the tools, allowing the results of one tool to be forwarded as input to others; for instance, the output of a PSI-BLAST search could be forwarded to ClustalΩ to obtain a multiple sequence alignment (MSA) of the identified matches or to MMseqs2 to obtain a reduced set filtered by pairwise sequence identity. Finally, our implementations of some external tools offer enhanced features, such as versions of the NCBI nonredundant (nr) database for PSI-BLAST that are clustered down to 30% (nr30), 50% (nr30), 70% (nr30), or 90% (nr90) sequence identity.

In this article, we describe detailed protocols for the application of the three most frequently used tools. Basic Protocol 1 describes how to use HHpred to search for remote homologs of a protein and make inferences about its domain composition, structure, function, and evolution. The Alternate Protocol describes the pairwise comparison mode of HHpred, which allows two protein sequences or MSAs to be compared with each other. The Support Protocol describes how to build a custom, high-quality MSA starting with a protein sequence and use it as input for HHpred. Basic Protocol 2 describes how to use HHpred in conjunction with MODELLER (Sali & Blundell, 1993) to build a three-dimensional (3D) structural model for a protein sequence of interest. Basic Protocol 3 describes the use of PSI-BLAST in conjunction with CLANS to detect distant homologs of a protein of interest and then visualize the relationships between the detected homologs. To demonstrate these protocols, we use as an example the experimentally uncharacterized FtsZ protein of the Asgard group archaeon Prometheoarchaeum syntrophicum strain MK-D1, which currently represents the closest cultured prokaryotic relative of eukaryotes (Imachi et al., 2020). In most bacteria, many archaea, all chloroplasts, and some mitochondria, with the latter two representing endosymbiosis-derived eukaryotic organelles, FtsZ forms filaments that assemble into a ring (Z-ring) at the future site of cell division (Lowe & Amos, 1998; Margolin, 2005; Szwedziak, Wang, Bharat, Tsim, & Lowe, 2014). Notably, eukaryotic tubulins, which polymerize to form microtubules, a major component of the cytoskeleton, are remotely homologous to FtsZ (Nogales, Downing, Amos, & Lowe, 1998). FtsZ and tubulins are GTPases that comprise an N-terminal GTP-binding domain with a highly conserved GGGTG(T/S)G motif associated with GTP binding and a C-terminal regulatory domain (Erickson, 1998). Strikingly, the pairwise sequence identity between FtsZ and tubulins is lower than 15%. Therefore, most sequence search methods fail to substantiate a homologous relationship between them. We note that the structure, function, and evolution of FtsZ and tubulins have been studied extensively, and that their evolutionary relatedness is also widely accepted (Erickson, 1998; Nogales et al., 1998). However, for instructional purposes, we will assume that the homology between them is unclear. In the following, we show how the Toolkit could be used to investigate the relationship between FtsZ and tubulins.

Basic Protocol 1: SEQUENCE SIMILARITY SEARCHING USING HHpred

An almost ubiquitous first step in the characterization of a protein is the identification of functionally and structurally characterized homologs using BLAST (Altschul et al., 1997) or HMMER (Potter et al., 2018). Frequently, however, these search methods fail to detect statistically significant connections to characterized proteins. In many such cases, the more sensitive sequence search method HHpred (Steinegger et al., 2019), which is based on the comparison of profile hidden Markov models (HMMs), is able to establish connections to remotely homologous, characterized proteins. Starting from a single protein sequence, HHpred builds a multiple sequence alignment using HHblits (Steinegger et al., 2019) or PSI-BLAST (Altschul et al., 1997) and annotates the obtained alignment with the predicted secondary structure using PSIPRED (Jones, 1999). Next, this annotated alignment is converted to a profile HMM and compared to each profile HMM in user-selected target databases, which represent proteins of known structure or annotated protein families. Such databases are, for example, the Pfam (El-Gebali et al., 2019), CDD (Lu et al., 2020), and SMART (Letunic & Bork, 2018) domain databases; the SCOPe (Fox et al., 2014) and ECOD (Cheng et al., 2014) structural classification databases; the Protein Data Bank (Berman et al., 2000); and proteomes of several model organisms. Database HMMs are built using three iterations of HHblits over UniRef30 (Mirdita et al., 2017), which is a version of the UniRef sequence database (Suzek et al., 2015) clustered into groups of similar sequences at a length coverage of at least 80% and a maximum pairwise sequence identity of 30%. Like query HMMs, database HMMs include secondary structure information, either predicted by PSIPRED or assigned based on 3D structure by DSSP (Joosten et al., 2011; Kabsch & Sander, 1983). The inclusion of secondary structure information significantly increases the sensitivity of HHpred. The output of HHpred is a list of the closest homologs, with pairwise alignments.

Necessary Resources

Hardware

A desktop computer, a laptop, or a tablet with Internet access

Software

An up-to-date, JavaScript-enabled Web browser (preferably Google Chrome, Mozilla Firefox, or Apple Safari)

Input files

A protein sequence (in FASTA format or as plain text) or a multiple protein sequence (MSA) alignment (in FASTA, STOCKHOLM, or CLUSTAL format)

Submission page of HHpred

1.Navigate your Web browser to the submission page of HHpred at https://toolkit.tuebingen.mpg.de/tools/hhpred.

Note

The submission page of HHpred is organized into two tabs, an ‘Input’ tab (Fig. 1A) and a ‘Parameters’ tab (Fig. 1B). The ‘Input’ tab contains a large text box for pasting the query protein sequence or MSA, and drop-down lists for choosing the target profile HMM database(s). It also includes options for pasting an example protein sequence (‘Paste Example’), uploading the input sequence as a file (‘Upload File’), and activating the pairwise comparison mode (‘Align two sequences/MSAs’). The ‘Parameters’ tab provides drop-down lists for customizing different input parameters (Fig. 1B). Options to access the help pages, toggle between windowed mode and full-screen mode, enter a custom job identifier, and submit a job are provided on both tabs.

Submission page of HHpred. The submission page of all tools within the Toolkit, including HHpred, contains two tabs: (A) ‘Input’ and (B) ‘Parameters.’ In the ‘Input’ tab, the amino acid sequence of FtsZ from P. syntrophicum in FASTA format (UniProt ID: A0A5B9D775) is shown as an example, and the target database is set to PDB_mmCIF30 (version of July 23, 2020). In the ‘Parameters’ tab, default values are set for all parameters, except ‘Max target hits’ (= 500).

2.Paste the amino acid sequence of your protein of interest (in FASTA format or as plain text) or an MSA (in FASTA, CLUSTAL, or STOCKHOLM format) into the large textbox (Fig. 1A). Alternatively, the input sequence or MSA can be uploaded using the ‘Upload File’ option. Follow the Support Protocol to build a custom MSA, starting with a protein sequence of interest.

Note

If you do not have the amino acid sequence of your protein of interest at hand, you can retrieve it from the protein database at NCBI (https://www.ncbi.nlm.nih.gov/protein) or the UniProt database (https://www.uniprot.org). The query sequence or MSA is validated as soon as it is pasted or uploaded, and an error message is displayed if it is not in one of the permitted formats. Upper- and lowercase letters, as well as the special characters for a gap (‘.’, ‘-’) and stop codon (‘*’), are allowed in the amino acid sequence. If your input sequence is longer than 2000 residues, we advise you to cut it into overlapping blocks of less than 2000 residues and search with these blocks separately. Generating MSAs for long sequences is computationally very expensive and might result in your jobs running for several hours.

Note

In Figure 1A, we use the putative FtsZ protein of the archaeon P. syntrophicum as an example (UniProt ID: A0A5B9D775; NCBI ID: WP_147661771).

3.Select target profile HMM database(s) against which you wish to compare the query protein (Fig. 1A).

Note

The target profile HMM databases are organized into two different drop-down lists, one for structural and annotated sequence family databases and the other for proteomes of several archaeal, bacterial, and eukaryotic model organisms. Presently, up to four databases can be selected at a time from one or both drop-down lists. Detailed information on the databases is available in the help pages. The choice of target database primarily depends on the research question one is trying to address. To identify a homolog of known structure and function, an obvious first choice is the PDB_mmCIF70 or the PDB_mmCIF30 database. These are versions of the Protein Data Bank (PDB), a repository for all publicly available 3D structures of proteins, filtered for a maximum pairwise sequence identity of 70% (PDB_mmCIF70) or 30% (PDB_mmCIF30). To make inferences about function, evolutionary history, and domain architecture, searches can also be carried out against the expert-curated structural classification databases ECOD and SCOPe, both of which organize proteins of known structure into hierarchies of families, superfamilies, and folds based on their evolutionary history. For these databases, we offer versions that are filtered for a maximum pairwise sequence identity of 70% (ECOD_F70 and SCOPe70). Since annotated sequence family databases such as PfamA, SMART, and CDD include conserved domains, both with or without characterized function or structure, they can be beneficial for the inference of function. Finally, proteomes of model organisms can be searched to identify extremely divergent homologs. We update these target databases regularly and also include new ones. For instance, we recently included a profile HMM database comprising all manually curated viral proteins in the UniProt database (UniProt-SwissProt-viral70).

Note

For our example sequence, we will run four separate searches against four different target databases: PDB_mmCIF30 (Fig. 1A) to identify homologs of known structure and function, ECOD_F70 and Pfam-A to identify domains, and the proteome of Saccharomyces cerevisiae to identify its homologs in eukaryotes.

4.Customize input parameters in the ‘Parameters’ tab (Fig. 1B). The default values for the various parameters are set to yield the best results for most standard cases, and we recommend using them, at least in the initial steps of the analysis.

Note

Detailed information on each parameter is available in the help pages. Upon selection of custom values for parameters, the corresponding drop-down lists are highlighted in light red, and a ‘Reset’ button for reloading the default values is introduced (Fig. 1B). The Toolkit caches the last used custom values and reloads them when a new submission is initiated. The sensitivity of HHpred relies heavily on the quality of MSAs. By default, three iterations of HHblits over the UniRef30 database are used to build the query MSA; if the input is an MSA, the number of iterations (‘MSA generation iterations’) is set to 0. In some instances, the depth of identified homologs is too low in UniRef30, which is filtered at 30% sequence identity. For instance, highly conserved proteins such as ribosomal proteins, ubiquitin, and heat shock proteins are poorly represented in UniRef30. In such cases, building the MSA with PSI-BLAST over nr70, which is a version of the nonredundant protein sequence database filtered for a maximum pairwise sequence identity of 70%, is recommended (‘MSA generation method’). On the other hand, in cases where you know that your protein's orthologs are extremely divergent, you could increase the sensitivity of an HHpred search substantially by trying to include more remotely homologous sequences into the query MSA. You could, for instance, use more iterations of HHblits or PSI-BLAST, or make the MSA building criteria less stringent by increasing the ‘E-value cutoff for MSA generation’ (hits with an E-value better than this cutoff are used in the next search iteration or, in the last iteration, for building the query HMM). We note that a corrupted query alignment, typically resulting from the inclusion of non-homologous sequences, especially repetitive or low-complexity ones, is the main source of high-scoring false positives. If you suspect that the hits yielded by your search are false positives, make the MSA-building criteria more stringent by adjusting the values for ‘E-value cutoff for MSA generation’, ‘Min seq identity of MSA hits with query’, and ‘Min coverage of MSA hits’. Finally, using an expert-built or expert-edited alignment as input may significantly increase the sensitivity and reliability of an HHpred search. Since scoring the secondary structure similarity of query and template sequences improves the sensitivity of HHpred searches significantly, ‘Secondary structure scoring’ is turned on by default.

Note

For our example, we will set ‘Max target hits’, which controls how many matches will be displayed in the results, to 500, and use default values for all other input parameters (Fig. 1B).

5.Optionally, assign your job a custom identifier by entering one in the ‘Custom Job ID’ text field (Fig. 1). The identifier should contain at least two characters. If this text field is left empty, an identifier is assigned automatically.

Note

For our example jobs, we will let the Toolkit assign identifiers automatically.

6.Start your search by pressing the ‘Submit’ button.

Note

Upon submitting a job, a new tab that shows the current status of the job is appended. Also, an entry for the job is added to the job pane located on the left of the screen. This pane provides an overview of all jobs in the current session, allows their sorting by different criteria, gives access to individual jobs, and includes a search box to identify jobs. A job starts immediately or is queued depending on our compute cluster's actual load at the time of submission. If a previously completed job with identical input and parameters is found, an option to reload the results is offered. In the job pane, jobs are color-coded based on their current status: queued jobs are colored gray, running jobs yellow, completed jobs green, failed jobs red, and jobs with an identical copy in our database colored lavender. A new submission with modified parameters and target database(s) can be initiated from a running or completed job by simply switching to the ‘Input’ or ‘Parameters’ tabs.

HHpred search results

7.Typical HHpred searches take about 5 min to run through. However, searches involving long input sequences (>600 residues), large input MSAs, higher MSA generation iterations (4 or more), or multiple target databases could take hours to complete.

Note

Upon successful completion, the status tab is removed from the job view, and five new tabs are appended: ‘Results’, ‘Raw Output’, ‘Probability Plot’, ‘Query Template MSA’, and ‘Query MSA’ (Fig. 2).

‘Results’ tab of HHpred. The results yielded by an HHpred search are presented in an interactive format and organized into three sections: (A) ‘Visualization’, (B) ‘Hitlist’, and (C) ‘Alignments’. An HHpred search with FtsZ of P. syntrophicum over the PDB_mmCIF30 database (version of July 23, 2020, performed on September 3, 2020) yielded 297 matches, including several high-scoring matches to FtsZ proteins of archaea and bacteria as well as to eukaryotic tubulins. In the ‘Alignments’ sections, the conserved GTP-binding motif (GGGTG(T/S)G) is marked.

8.The ‘Results’ tab presents information on the detected matches in a user-friendly and interactive manner (Fig. 2).

Note

The output is organized into three sections: ‘Visualization’, ‘Hitlist’, and ‘Alignments’ (Fig. 2). These sections can be accessed directly, without having to scroll to them, using the quick links (‘Vis’, ‘Hits’, and ‘Aln’) in the floating toolbar offered at the top of the tab. The number of detected matches and quantification for the diversity of the query MSA (Neff) is displayed directly above the ‘Visualization’ section. Neff ranges between 1 (for a single sequence) and 20 (for an extremely deep MSA). Additionally, the user is alerted if the query MSA contains too few sequences or if a signal peptide, coiled coils, intrinsically disordered regions, or transmembrane segments are detected.

Note

The ‘Visualization’ section (Fig. 2A) shows the query sequence as a slider bar. The database matches are shown as horizontal bars underneath, indicating their coverage with respect to the query. The bars are color coded according to their significance from red (very significant) to orange, yellow, green, and cyan, to blue (less significant). In this section, only hits with a probability value of more than 40% are shown. Placing the mouse cursor over a bar shows a textual description of the match, and clicking on it takes one to the corresponding query-template alignment in the ‘Alignments’ section.

Note

The ‘Hitlist’ section (Fig. 2B) provides a tabular listing of matches sorted by their probability of being a true positive (column ‘Probability’). It includes columns with information on identifiers (‘Hit’), descriptions (‘Name’), E-values (‘E-value’), raw scores (‘Score’), secondary structure scores (‘SS’), lengths of the aligned region (‘Aligned Cols’), and lengths of the template (‘Target Length’). The hits can be sorted by clicking on the column headers or filtered by a keyword using the search box above the table. Clicking on an index number in the leftmost column takes one to the corresponding query-template alignment in the ‘Alignments’ section. For the calculation of probability, the raw score (‘Score’) and the secondary structure score (‘SS’) are considered. The raw score is computed using the Viterbi HMM-HMM alignment and the secondary structure score from the alignment of secondary structure assignments between query and template, as provided by PSIPRED (3-state) or determined by DSSP (8-state). The E-value is the average number of false positives (wrong hits) with a score better than the one for the given template in the target database(s). While E-values close to 0 signify a very reliable hit, an E-value of 10 indicates that about 10 wrong hits are expected to be found in the database with a score at least this good. The P-value is the E-value divided by the number of sequences in the database. It is the probability that a wrong hit will score at least this well in a pairwise comparison. Unlike Probability, E-value and P-value are calculated without taking the secondary structure score into account.

Note

The ‘Alignments’ section (Fig. 2C) provides pairwise alignments between query and template for all matches. Each entry starts with a row of hyperlinks. These always include a link to an alignment of the 100 most distinct database sequences used to generate the template HMM, and may also provide links to external resources, for example, in order to visualize the structure of the template (if the search was carried out over PDB_mmCIF30, PDB_mmCIF70, ECOD_F70, SCOPe70). The entry header then lists a description of the match and the scores for Probability, E-value, Score, Aligned cols, Identities, Similarity, and Template Neff (quantification for the diversity of the template MSA). The alignment between the query and the template itself is split into one or more blocks in which lines corresponding to the query are marked with ‘Q’ and the template with ‘T’. The amino acid residues are colored based on their physicochemical properties. ‘ss_pred’ and ‘ss_dssp’ display secondary structure predicted by PSIPRED and assigned by DSSP, respectively. The three states predicted by PSIPRED are: H (α-helix; colored red), E (extended strand; blue), and C (residues not in H and E); upper-case and lower-case letters indicate high and low prediction confidence, respectively. The eight states assigned by DSSP are: H (α-helix; red), B (residue in isolated β-bridge), C (loop or irregular element), E (extended strand; blue), G (3₁₀-helix), I (π-helix), T (hydrogen-bonded turn), and S (bend). In the consensus sequences, upper- and lower-case characters indicate high (≥60%) and moderate (≥40%) conservation, respectively. The row between the query and template consensus sequences indicates the quality of the column-column match: ‘|’ very good, ‘+’ good, ‘·’ neutral, ‘−’ bad, and ‘=’ very bad.

Note

The search with our protein of interest, FtsZ of P. syntrophicum, over the PDB_mmCIF30 database (version of July 23, 2020, performed on September 3, 2020) yielded 297 matches (Fig. 2). The top six matches were to FtsZ proteins of archaea and bacteria and the FtsZ-like plasmid replication protein (RepX) from Bacillus cereus, all at a probability value of 100% and E-values better than 8.3e-30. Furthermore, the pairwise alignments showed the conservation of the GTP-binding motif (GGGTG(T/S)G), substantiating that our query protein is a member of the FtsZ family (Fig. 2C). The next best match was Tubulin alpha-1B of mammals at a probability value of 99.92% and E-value of 3.4e-24, indicating that our query protein is homologous to eukaryotic tubulins. The proteins share an overall pairwise sequence identity of only ∼14%, but both possess a highly conserved GTP-binding motif. The search against the ECOD_F70 and Pfam-A databases indicated that our query protein comprises two domains, an N-terminal Tubulin/FtsZ family GTPase domain followed by an FtsZ-family C-terminal-like domain (Fig. 3). The search against the proteome of S. cerevisiae yielded as best matches tubulins and Dml1p, a protein involved in the partitioning of mitochondria, all at probability values greater than 97% (Fig. 4).

Results of HHpred searches with P. syntrophicum FtsZ over ECOD_F70 (A) and Pfam-A (B). The results indicated that our query protein consists of two domains, an N-terminal Tubulin/FtsZ family domain followed by an FtsZ-family C-terminal-like domain.

Results of an HHpred search with P. syntrophicum FtsZ over the proteome of S. cerevisiae. On September 3, 2020, this search returned a total of 68 matches, with tubulins and Dml1p being the best matches. In the ‘Alignments’ section, the conserved GTP-binding motif (GGGTG(T/S)G) is marked.

9.The ‘Raw Output’ tab allows visualizing and downloading the raw output file yielded by an HHpred search (Fig. 5A). It is advisable to download and save this file for future reference.

‘Raw Output’ and ‘Query MSA’ tabs of HHpred. The ‘Raw Output’ tab (A) provides access to the output file produced by HHpred in plain text format, whereas the ‘Query MSA’ tab (B) provides access to the MSA built by HHpred for the query. The latter also allows an MSA of all or just selected sequences to be forwarded as input to other tools.

10.The ‘Probability Plot’ tab displays a cumulative histogram of the hits and can be used to obtain a count of matches with probability values higher or lower than a given value.

11.The ‘Query Template MSA’ tab provides access to an MSA comprising the query sequence and sequences of all the obtained hits. It provides options to download the complete alignment (‘Download MSA’) or to forward the alignment to other tools (‘Forward Selected’), either completely (‘Select All’) or only for individually selected sequences.

12.The ‘Query MSA’ tab provides access to the MSA built by the HHpred server for the query (Fig. 5B). The tab displays the 200 most divergent sequences and allows an MSA of selected or all sequences to be forwarded to other tools (‘Forward Selected’). This tab also includes options for downloading this reduced query alignment or the full alignment in A3M format, a space-efficient format that we use internally to store alignments. Alignments in A3M format can be converted to FASTA using the FormatSeq tool offered within our Toolkit (https://toolkit.tuebingen.mpg.de/tools/formatseq).

Alternate Protocol: PAIRWISE SEQUENCE COMPARISON USING HHpred

The pairwise mode of HHpred allows the comparison of two sequences or MSAs. This is particularly useful when you wish to substantiate a homologous relationship between two proteins that you suspect to be homologous, compare proteins that do not exist in our profile HMM databases, or obtain an HMM-HMM based alignment of two distantly related proteins. HHpred builds MSAs for the two input sequences using HHblits or PSI-BLAST, assigns secondary structure using PSIPRED, and converts the annotated MSAs to profile HMMs. In the next step, it compares the computed HMMs and reports an alignment if a match is found that satisfies the cutoffs set in ‘Parameters’. For proteins that contain multiple homologous repeats or domains, it typically reports two or more alignments. For detailed information on using HHpred, please refer to Basic Protocol 1.