Exploring Curated Conformational Ensembles of Intrinsically Disordered Proteins in the Protein Ensemble Database

Federica Quaglia, Federica Quaglia, András Hatos, András Hatos, Damiano Piovesan, Damiano Piovesan, Silvio C. E. Tosatto, Silvio C. E. Tosatto, Tamas Lazar, Tamas Lazar, Peter Tompa, Peter Tompa

Published: 2021-07-12 DOI: 10.1002/cpz1.192

intrinsically disordered proteins

Abstract

The Protein Ensemble Database (PED; https://proteinensemble.org/) is the major repository of conformational ensembles of intrinsically disordered proteins (IDPs). Conformational ensembles of IDPs are primarily provided by their authors or occasionally collected from literature, and are subsequently deposited in PED along with the corresponding structured, manually curated metadata. The modeling of conformational ensembles usually relies on experimental data from small-angle X-ray scattering (SAXS), fluorescence resonance energy transfer (FRET), NMR spectroscopy, and molecular dynamics (MD) simulations, or a combination of these techniques. The growing number of scientific studies based on these data, along with the astounding and swift progress in the field of protein intrinsic disorder, has required a significant update and upgrade of PED, first published in 2014. To this end, the database was entirely renewed in 2020 and now has a dedicated team of biocurators providing manually curated descriptions of the methods and conditions applied to generate the conformational ensembles and for checking consistency of the data.

Here, we present a detailed description on how to explore PED with its protein pages and experimental pages, and how to interpret entries of conformational ensembles. We describe how to efficiently search conformational ensembles deposited in PED by means of its web interface and API. We demonstrate how to make sense of the PED protein page and its associated experimental entry pages with reference to the yeast Sic1 use case. © 2021 The Authors. Current Protocols published by Wiley Periodicals LLC.

[Correction added on May 17, 2022, after first online publication: CRUI-CARE funding statement has been added.]

Basic Protocol 1 : Performing a search in PED

Support Protocol 1 : Programmatic access with the PED API

Basic Protocol 2 : Interpreting the protein page and the experimental entry page—the Sic1 use case

Support Protocol 2 : Downloading options

Support Protocol 3 : Understanding the validation report—the Sic1 use case

Basic Protocol 3 : Submitting new conformational ensembles to PED

Basic Protocol 4 : Providing feedback in PED

INTRODUCTION

Intrinsically disordered proteins (IDPs) and disordered regions (IDRs) represent segments of a polypeptide chain that lack a stable three-dimensional tertiary fold (Dunker et al., 2001; Tompa, 2002; Wright & Dyson, 1999). The diverse roles of these regions stem from the dynamic nature of their structure, e.g., IDRs can function by molecular recognition, or display sites for post-translational modification, flexible linkers, or entropic bristles (Dunker, Brown, David Lawson, Iakoucheva, & Obradović, 2002; Lee et al., 2014; Piovesan et al., 2017). These experimentally verified IDRs and their corresponding functions, as well as their interaction partners and structural transitions, are catalogued in the manually curated DisProt database (Hatos et al., 2019). As a consequence of their inherently dynamic structure, instead of a single snapshot, this group of proteins is characterized by a representative set of conformers reflecting internal heterogeneity, called the conformational ensemble (Fisher, Huang, & Stultz, 2010; Jensen, Ruigrok, & Blackledge, 2013; Tompa, 2011).

Experimental data to characterize conformational ensembles are primarily derived from nuclear magnetic resonance (NMR) spectroscopy measurements that provide residue-specific information in the form of structural restraints (proximity, dynamics and secondary structure) and small-angle X-ray scattering (SAXS) curves characteristic of the shape and compactness of the proteins (Bernadó & Svergun, 2012; Jensen et al., 2013; Marsh & Forman-Kay, 2012; Schramm et al., 2019). These data can be readily accessed in dedicated primary databases such as the BioMagResDB (BMRB; Romero et al., 2020) and SASBDB (Kikhney, Borges, Molodenskiy, Jeffries, & Svergun, 2020). Förster resonance energy transfer (FRET) experiments can also provide data on the global compactness and dynamics of the protein in the form of FRET efficiencies between donor and acceptor fluorescent tags grafted onto the polypeptide chain (Fuertes et al., 2017; Nath et al., 2012; Schramm et al., 2019).

Reweighting approaches for ensemble calculation usually rely on the generation of a pool of thousands of conformers based on random coil generation tools [e.g., Flexible-Meccano (Ozenne et al., 2012), TraDES (Feldman & Hogue, 2002)] or molecular dynamics (MD) simulations (Chan-Yao-Chong et al., 2019; Rauscher et al., 2015; Robustelli, Piana, & Shaw, 2018), followed by the refinement of the pools by the addition of experimental data to select sub-ensembles that better fit the data (Köfinger et al., 2019; Krzeminski, Marsh, Neale, Choy, & Forman-Kay, 2013; Rangan et al., 2018; Varadi et al., 2014). In contrast, restraining approaches introduce experimentally derived constraints/restraints and only generate conformations that comply with the measurement data (Cavalli, Camilloni, & Vendruscolo, 2013; Rangan et al., 2018; Varadi et al., 2014). The resulting final ensembles are a set of conformers with ideally atomic coordinates that sufficiently reflect the measurements. However, multiple sets of conformational ensembles can fit the data equally well, due to the limitations of the set of experimental constraints, causing it to be an inherently underdetermined problem (Lazar et al., 2020). However, as the number of available ensembles covering a given protein region accumulates, the alternative ensembles might shed light on this problem. The Protein Ensemble Database (abbreviated now as PED, initially pe-DB) was launched to create a deposition database making these conformational ensembles available to the public in a structured and well-organized manner (Varadi et al., 2014).

PED is now a service of the Italian node of ELIXIR, the European infrastructure for biological data, and a central resource of the ELIXIR IDP user community (Davey et al., 2019). First published in 2014 and updated in 2016 (Varadi & Tompa, 2015; Varadi et al., 2014), PED was revamped in 2020 and extensively expanded to include six times more data than the previous version (Lazar et al., 2021). A re-designed graphical interface allows an improved user experience. Two types of pages can be found in PED, the Protein entry pages that group together experiments at the protein level, and the Experimental entry pages, revolving around the experiments and corresponding to the submitted entries. The high quality of the PED experiment-centric entries is now ensured both by manual curation from a dedicated team of biocurators and an automatic pipeline for validation.

In this article we describe in detail how to perform a search in PED (Basic Protocol 1), visualize and interpret protein-centric and experiment-centric entries in PED (Basic Protocol 2), submit new conformational ensembles (Basic Protocol 3), and provide feedback (Basic Protocol 4). We also describe programmatic access with the PED REST API (Support Protocol 1), the downloading options in PED (Support Protocol 2), and how to understand the conformational ensemble validation report (Support Protocol 3).

For abbreviations used in the text, see List of Abbreviations at the end of this article.

Basic Protocol 1: PERFORMING A SEARCH IN PED

To search the Protein Ensemble Database (PED), please visit the webpage at https://proteinensemble.org/. PED is an open-access database freely available without registration. In this protocol, we describe the different options for searching the database, browsing the ensembles, and retrieving relevant information.

Necessary Resources

Hardware

While PED works best on laptop or desktop computers, it is also accessible from smartphones and tablets. An active and stable internet connection is required.

Software

Internet browser. The browsing was optimized for Firefox (http://www.mozilla.org/firefox/), Google Chrome (http://www.google.com/chrome/), and Safari (http://www.apple.com/safari/).

Input data

No input data are required

Performing a text search

1.Open a web browser and connect to the Main page of the Protein Ensemble Database at https://proteinensemble.org/.

2.Searches in PED can be performed in two ways:

Users can perform a search using the “Search” box on the top-middle of the PED home page. After typing the query text in the box, hit the “Search” button (Fig.1).

Users can look for a specific protein, e.g., Sic1 from yeast, by submitting the protein name, e.g.,Sic1or its corresponding UniProtKB (UniProt Consortium,2019) accession,P38634. This will redirect them to a list of associated PED entries.

Users might also be interested in looking for a specific publication. To do so, please provide a valid identifier of the publication, i.e., the corresponding PubMed (White,2020) identifier (PMID) or DOI.

Alternatively, users can perform an advanced search by connecting to the Browse page of the Protein Ensemble Database athttps://proteinensemble.org/browse/. Navigation to the Browse page is also possible from the Main page by clicking the corresponding button in the navigation panel, i.e., the first button on the top-right, titled “Browse” (Fig.1). A search can be performed here using the search box by typing the query text, or a combination of queries, e.g., “UniProt ACC” and “experimental tag.” This search field is auto-updating, so that there is no need to click a button to register each change.

3.Select a term from the drop-down menu on the Browse page. The option selected from the drop-down list will limit the content being searched. Users can perform the following functions:

It is possible to perform a free text search in the database by selecting the “free text” term from the drop-down menu.
For a specific experiment-centric page, select the “Entry identifier” term and look for a specific PED identifier, e.g., PED00001.
For all the experiment-centric pages, i.e., entries of conformational ensembles, associated to a specific protein, select the “UniProt ACC” term, e.g., P38634. This will retrieve all the experiment-centric entries associated with yeast pSic1.
Select the “Experimental tag” term to look for a specific experimental tag, e.g., “NMR” or “EOM.” This will retrieve all the entries associated with the tag(s). Three categories of tags are available—depending on the experimental setup used to generate the conformational ensemble—and are highlighted in the Browse table in yellow for the tags related to “measurement methods,” in orange for the tags describing “ensemble generation methods,” and in red for the tags associated to “molecular dynamics” simulations.
Find conformational ensembles from a specific author by selecting the “Data owner” name and looking for the author's name. A list of the entries from that author will appear.
Find all the entries curated by a specific biocurator by selecting the “Biocurator name” and specifying the biocurator's name.
Find a specific protein. By selecting the “Protein name” term and looking for “Sic1” it is possible to retrieve all the entries deposited in PED related to that protein, i.e., PED00001, PED00023, PED00159, PED00160, and PED00161.
A specific publication. In this case it is possible to select the “Publication title” option and enter the title of the publication into the search box or to search for the identifier of a publication—PubMed identifier or DOI—by selecting the “Publication ID” term.
Look for PED entries cross-referenced to entries in DisProt, e.g., DP00631, or to raw experimental data deposited in SASBDB or BMRB, e.g., “16659” is a BMRB cross-reference, by selecting the “Cross-reference (PDB, BMRB, SASBDB)” term.

4.Users can customize the table columns shown on the Browse page to visualize more/fewer details of the entries. By default, all columns are shown, i.e., “UniProt ACC,” “PED identifier,” “PED title,” “Coverage (to UniProt),” “Number of ensembles,” and “Number of conformers.”

Note

A more detailed description of the ways to search the Protein Ensemble Database follows, with dedicated examples.

Searching for PED entries associated with a specific protein: The p53 use case

This section demonstrates how to look for all the entries—protein-centric and experiment-centric—associated with a specific use case, the p53 protein.

5.Open a web browser and connect to PED at https://proteinensemble.org/, then select the “Browse” button from the top bar. Alternatively, directly access the Browse page of the Protein Ensemble Database at https://proteinensemble.org/browse/ and select the type of search next to the search box.

6.Entries associated with p53 can be retrieved by searching for the “Protein name” term and entering p53 in the Search box. This will immediately retrieve all the entries related to p53 in PED. The displayed results now include PED identifiers associated with more than one UniProt ACC, since p53 is part of multiple protein complexes with other proteins, e.g., for PED00037 (https://proteinensemble.org/PED00037), the experimental entry titled “Solution Structure of p53-TAD::Cbp-TAZ1 fusion protein” is related to both the CREBBP (https://proteinensemble.org/P45481) and p53 (https://proteinensemble.org/P04637) protein pages.

7.Users can then choose if they want to explore the protein-centric or the experiment-centric pages.

To access the protein-centric page of p53, click on the “P04637” button in the first column of the result table. This will redirect the user to the protein entry, where it is possible to visualize all the conformational ensembles associated with the protein.
To access the experimental entry “Solution Structure of p53-TAD::Cbp-TAZ1 fusion protein,” click on the PED00037 button in the second column of the results table of the Browse page to be redirected to the page dedicated to this conformational ensemble.

Users can also directly search for the UniProt accession number of the p53 protein, i.e., P04637.

8.Open a web browser, connect to PED at https://proteinensemble.org/, then select the “Browse” button from the top bar.

9.Alternatively, directly access the Browse page of the Protein Ensemble Database at https://proteinensemble.org/browse/ and select the type of search next to the search box.

10.Select the “UniProt ACC” term from the list of the drop-down menu. Type the given UniProt accession, e.g., P04637 for the p53 protein, into the search box to find the experimental entries containing protein region(s) of the specified protein (e.g., P04637) (Fig. 2).

Searching for a UniProt accession number in the Browse page of PED.

11.Users can then choose if they want to explore the protein-centric page associated to the p53 protein or the experiment-centric pages.

To access the protein-centric page of p53, click on the “P04637” button (https://proteinensemble.org/PED00037) in the first column of the result table. This will redirect the user to the protein entry, where it is possible to visualize all the conformational ensembles associated with the protein.
To access a specific experimental entry, “Structural ensemble of the complex between Tfb1 (2-115) and the activation domain of p53 (20-73),” click on “PED00087” (https://proteinensemble.org/PED00087) in the second column of the results table of the Browse page to be redirected to the page dedicated to this conformational ensemble.

Exploring specific experimental setup using the controlled vocabulary of PED

To explore whether a specific experimental setup has been used by someone to model ensembles, users can employ the searchable experimental tags of the controlled vocabulary created and maintained by the PED biocurator team. For instance, one might be interested in a setup where both NMR and SAXS data were used to refine an MD-based pool of conformations generated by an AMBER force field.

12.Open a web browser and connect to PED at https://proteinensemble.org/, then select the “Browse” button from the top bar. Alternatively, directly access the Browse page of the Protein Ensemble Database at https://proteinensemble.org/browse/ and select the type of search next to the search box.

13.To search terms from the PED controlled vocabulary (https://proteinensemble.org/about#cv; Fig. 3A), select “Experimental tag” from the drop-down menu.

Searching for entries with method-specific keyword terms from the controlled vocabulary of PED (A) using multiple search fields (Experimental tag) in the Browse (B).

14.Type the ’AMBER’ keyword into the search box to see the entries that contain the term in their ensemble determination methodology.

15.You can add more tags to your search by clicking on the “Add” button on the right. The new search field that appears will be set to the default value “Free text.”

16.Change the type of the second search field to “Experimental tag” similarly as before and enter the second keyword ’NMR.’

17.Click again on the “Add” button on the right.

18.Change the field from “Free text” to “Experimental tag” and type your third keyword SAXS (Fig. 3B). As of today, there are two ensembles that fulfill the above selected criteria—PED00180 and PED00181—generated by the same authors with almost identical methodology (Chan-Yao-Chong et al., 2019, 2020).

Note

Browse the ontology of the controlled vocabulary to find experimental tags relevant to you. You can do this on the About page (https://proteinensemble.org/about; Fig. 3A) by opening the sub-ontologies with a simple click on the arrows to their left or by typing keywords into the filtering textbox.

Support Protocol 1: PROGRAMMATIC ACCESS WITH THE PED API

PED has a RESTful API for the programmatic access to retrieve entries and corresponding ensemble models, or to perform a customized search in the database. The Help page of PED lists all API endpoints with short descriptions and examples (https://proteinensemble.org/help#api). These endpoints are available from the https://proteinensemble.org/api/{endpoint_name} URL. Below, we exemplify three procedures, with explanations, using a Python script as a client: (i) how to retrieve metadata of entries; (ii) how to download ensembles; and (iii) how to search for the available ensembles associated with a protein accession number (UniProt accession) and with both an accession number and a given method.