NCBI's Conserved Domain Database and Tools for Protein Domain Analysis

Mingzhang Yang, Mingzhang Yang, Myra K. Derbyshire, Myra K. Derbyshire, Roxanne A. Yamashita, Roxanne A. Yamashita, Aron Marchler-Bauer, Aron Marchler-Bauer

Published: 2019-12-18 DOI: 10.1002/cpbi.90

Abstract

The Conserved Domain Database (CDD) is a freely available resource for the annotation of sequences with the locations of conserved protein domain footprints, as well as functional sites and motifs inferred from these footprints. It includes protein domain and protein family models curated in house by CDD staff, as well as imported from a variety of other sources. The latest CDD release (v3.17, April 2019) contains more than 57,000 domain models, of which almost 15,000 were curated by CDD staff. The CDD curation effort increases coverage and provides finer-grained classifications of common and widely distributed protein domain families, for which a wealth of functional and structural data have become available. The CDD maintains both live search capabilities and an archive of pre-computed domain annotations for a selected subset of sequences tracked by the NCBI's Entrez protein database. These can be retrieved or computed for a single sequence using CD-Search or in bulk using Batch CD-Search, or computed via standalone RPS-BLAST plus the rpsbproc software package. The CDD can be accessed via https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml. The three protocols listed here describe how to perform a CD-Search (Basic Protocol 1), a Batch CD-Search (Basic Protocol 2), and a Standalone RPS-BLAST and rpsbproc (Basic Protocol 3). © 2019 The Authors.

Basic Protocol 1 : CD-search

Basic Protocol 2 : Batch CD-search

Basic Protocol 3 : Standalone RPS-BLAST and rpsbproc

INTRODUCTION

The Conserved Domain Database (CDD) of the National Center for Biotechnology Information (NCBI) is a collection of protein family and protein domain models. A domain is defined as a compact, discrete unit of 3D structure, typically in the range of 50 to 200 amino acids in size, and as a unit of molecular evolution that can be utilized to establish evolutionary classifications; a domain is usually associated with discrete aspects of protein function, such as enzyme activity, membrane transport, or nucleic-acid binding, to name a few. Domain models in the CDD include many fine-grained hierarchical classifications for selected domain families established with the help of phylogenetic analyses and manually curated by CDD staff, as well as sets of domain models imported from external high-quality and comprehensive resources, collected as annotated multiple sequence alignments and converted into position-specific score matrices. The current CDD collection (version 3.17) contains 57,242 total models: 14,908 models from the CDD curation effort, 35 NCBIfams (Haft et al., 2018), 1012 models from SMART v6.0 (Letunic, Doerks, & Bork, 2014), 16,709 models from Pfam v31 (Finn et al., 2016), 4873 COGs v1.0 (Tatusov et al., 2001), 10,885 NCBI Protein Clusters (Klimke et al., 2009), and 4488 models from TIGRFAM v15 (Haft et al., 2013).

The conserved domain summary pages give access to a wealth of data associated with each domain family, including hierarchical classifications, taxonomic information, sequence alignments, structural interaction data, domain architectures, functional site annotations, and literature. Figure 1 diagrams some of the variety of information available to the user in navigating the CDD. In an effort to take advantage of these multiple types of information, the CDD uses Reverse Position-Specific BLAST (RPS-BLAST), also known as CD-Search (Conserved Domain Search), in its interactive web-based implementation to match protein sequences with domain and family models, providing a live search service for protein and nucleotide queries, as well as pre-computed (at a pre-set E -value) domain and site annotations for the majority of protein sequences in the NCBI's Entrez system. The CDD has been integrated with several resources at the NCBI, including BLAST, Protein, and Gene, and with external collections such as InterPro (Apweiler et al., 2000; Mitchell et al., 2019; https://www.ebi.ac.uk/interpro), in order to provide a comprehensive workflow that will fit most user's needs.

Some of the wealth of information available through the Conserved Domain Database (CDD), which includes hierarchical classifications, taxonomic information, aligned sequences, structural interaction data, domain architectures, functional site annotations, and current literature sources.

You can access the CDD resource by using CD-Search for a single nucleotide or protein sequence query, Batch CD-Search for up to 4000 queries at a time, or standalone RPS-BLAST plus rpsbproc running searches on your local infrastructure. You can also query Entrez (https://www.ncbi.nlm.nih.gov/cdd/) to access the CDD's domain information in the CDD resource. In Basic Protocols 1 to 3, we describe how to use each of these services so that you can customize the settings, and we outline commonly used workflows. In addition, we provide links to Help documentation (Table 1) to aid you as you navigate these pages.

Table 1. URLs and FTP Sites Associated with the CDD Protocols Described in This Paper

https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml	CDD home page
https://www.ncbi.nlm.nih.gov/cdd	Entrez interface to the CDD
https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi	CD-Search Interface
https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml	CDD Help documentation
https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi	BATCH Web CD-Search interface
https://www.ncbi.nlm.nih.gov/Structure/cdd/cdd_help.shtml#BatchRPSBInput	Batch CD-Search Help documentation
https://www.ncbi.nlm.nih.gov/Structure/cdd/docs/cdd_news.html	CDD News page (for most recent domain database versions)
https://blast.ncbi.nlm.nih.gov/Blast.cgi	BLAST Homepage
https://www.ncbi.nlm.nih.gov/books/NBK279690/	BLAST® Command Line Applications User Manual
https://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=BlastHelp	BLAST Help documentation
https://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3dtut.shtml	CDD Cn3d tutorial
https://www.ncbi.nlm.nih.gov/Structure/cdtree/cdtree.shtml	CDD CDTree: protein domain hierarchy viewer and editor
https://www.ncbi.nlm.nih.gov/sparcle	Entrez interface to SPARCLE
https://www.ncbi.nlm.nih.gov/Structure/sparcle/docs/sparcle_help.html	SPARCLE Help documentation
https://www.ncbi.nlm.nih.gov/books/NBK3837/	Entrez Help documentation
https://ftp.ncbi.nih.gov/pub/mmdb/cdd	CDD FTP site; see the README file for content
https://ftp.ncbi.nih.gov/pub/mmdb/cdd/rpsbproc/	CDD rpsbproc FTP site
https://ftp.ncbi.nih.gov/pub/mmdb/cdd/rpsbproc/README	rpsbproc README file
https://ftp.ncbi.nih.gov/pub/mmdb/cdd/little_endian	CDD pre-formatted search databases FTP site
https://ftp.ncbi.nih.gov/blast/executables/LATEST/	NCBI BLAST executables FTP site
https://ftp.ncbi.nih.gov/toolbox	NCBI C++ toolkit distribution FTP site

Basic Protocol 1: CD-SEARCH

The NCBI's CD-Search service (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi; Figure 2) allows users to query a nucleotide or protein sequence against the CDD database via a sequence identifier or by pasting in the sequence in FASTA or raw text format. For the majority of queries provided as valid sequence identifiers, the default CD-Search settings display results of pre-computed RPS-BLAST searches (storing up to 500 hits each) that were run against the entire CDD database—including CDs curated by CDD staff along with additional sources from Pfam (Finn et al., 2016), SMART (Letunic et al., 2014), KOG (Tatusov et al., 2003), COG (Tatusov et al., 2001), PRotein K(c)lusters (PRK; Klimke et al., 2009) and TIGRFAMs (Haft et al., 2013)—at an E -value threshold of 0.01.The results are displayed by default in a concise format that shows the best-scoring domain model for each region of the query sequence plus the associated domain superfamily. If a region is annotated by a model that does not score well enough to be classified as a “specific hit,” only the superfamily annotation is shown. Default CD-Search parameters employ a score adjustment to address compositional bias, which largely abolishes the need to mask out low-complexity regions. Basic Protocol 1 demonstrates how to identify protein domains for a single nucleotide or protein sequence.

CD-Search page using the Web server https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi/.

Necessary Resources

Hardware

Workstation with Internet access

Software

Web browser

Files

Protein sequence in FASTA format, accession number, or gi (GeneInfo) number

1.Open the protein sequence search page: https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi (see Figure 2).

2.In the text box, type the accession or gi number, or paste in the sequence of your protein or nucleotide of interest, in FASTA format.

3.To run the search with the default settings, press the Submit button.

4.View the results as they appear in HTML format (see Figure 3).

CD-Search results for query gi 223460687 (mouse myosin VIIB protein) showing the Protein Classification, Graphical summary, and List of domain hits results.

5.Select the scope of your graphical summary display by going to the top right-hand corner of the display and using the View pulldown menu to select either Concise Results , Standard Results , or Full Results.

Note

The display default is a view of the Concise Results, as shown in Figure 3. See to the Guidelines for Understanding Results section of this article for an explanation of the different views.

6.Scroll over the annotations marked by triangles under the Query sequence in the Graphical Summary to reveal a pop-up window with information about a functional feature mapped to the query sequence via a domain hit. The pop-up window links to a CD summary page, which shows the multiple sequence alignment of protein sequences used to curate the model, annotated with hash marks denoting the location of the conserved feature residues, and providing the option to examine evidence supporting the feature.

Note

An example of a result for an annotated site is shown in Figure 4.

CD-Search results for query gi 223460687 (mouse myosin VIIB protein) showing mouse-over pop-up of the ATP-binding site annotation.

7.Scroll over the cartoon of the CD domain to reveal a pop-up panel showing the E -value, accession ID, name, and description. This also highlights the corresponding domain hit (shown in green) in the List of domain hits.

Note

An example of this type of result for a CD domain is shown in Figure 5.

CD-Search results for query gi 223460687 (mouse myosin VIIB protein) showing mouse-over pop-up of the FERM domain.

8.Click on the plus [+] in the List of domain hits to see how your query is aligned with the domain model.

Note

An example of an expanded domain hit for a CD domain is shown in Figure 6.

CD-Search results for query gi 223460687 (mouse myosin VIIB protein) showing expanded FERM domain with domain definition and query alignment to CD.

9.To launch and view the CD summary page on your domain of interest, click on the CD link in the List of Domain Hits, and click on the cartoon “bubble” of the CD of interest or on the symbols (triangles) indicating the location of feature annotations. Invoking the CD summary pages via links from the Graphical Summary will result in your query imbedded into the sequence alignment on the CD summary page.

Note

An example of a CD domain summary page is shown in Figure 7.

Note

CD-Search results can also be accessed through the protein BLAST results pages, because CD-Searches are also run during protein BLAST searches. In the recently revised BLAST results pages, adopted in August 2019, CD-Search results appear under the Graphic Summary tab. If conserved domain hits are detected on the query sequence, you will see the message “Putative conserved domains have been detected”; clicking on the image below this message will take you to the familiar CD-Search results page for your query.

Note

An example of a BLAST search of gi 223460687 (mouse myosin VIIB protein) showing the CD-Search result is shown in Figure 8.

CD-Search results summary page for FERM domain with domain definition and user query added to the CD multiple sequence alignment.

CD-Search results page for the query gi 223460687 (mouse myosin VIIB protein) found under the Graphic Summary tab in BLAST.

Basic Protocol 2: BATCH CD-SEARCH

Use Batch CD-Search (https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi) to compute and retrieve domain annotations for a batch of protein queries. Basic Protocol 2 demonstrates how to identify protein domains for a batch of protein queries up to 4000 sequences. The limits may be adapted in the future due to the high peak usage of this shared resource.