Getting Started with the IDG KMC Datasets and Tools

Eryk Kropiwnicki, Eryk Kropiwnicki, Jessica L. Binder, Jessica L. Binder, Jeremy J. Yang, Jeremy J. Yang, Jayme Holmes, Jayme Holmes, Alexander Lachmann, Alexander Lachmann, Daniel J. B. Clarke, Daniel J. B. Clarke, Timothy Sheils, Timothy Sheils, Keith J. Kelleher, Keith J. Kelleher, Vincent T. Metzger, Vincent T. Metzger, Cristian G. Bologa, Tudor I. Oprea, Avi Ma'ayan

Published: 2022-01-27 DOI: 10.1002/cpz1.355

bioinformatics

data visualization

disease ontology

drug discovery

drug targets

druggable genome

web applications

AI 解读

Abstract

The Illuminating the Druggable Genome (IDG) consortium is a National Institutes of Health (NIH) Common Fund program designed to enhance our knowledge of under-studied proteins, more specifically, proteins unannotated within the three most commonly drug-targeted protein families: G-protein coupled receptors, ion channels, and protein kinases. Since 2014, the IDG Knowledge Management Center (IDG-KMC) has generated several open-access datasets and resources that jointly serve as a highly translational machine-learning-ready knowledgebase focused on human protein-coding genes and their products. The goal of the IDG-KMC is to develop comprehensive integrated knowledge for the druggable genome to illuminate the uncharacterized or poorly annotated portion of the druggable genome. The tools derived from the IDG-KMC provide either user-friendly visualizations or ways to impute the knowledge about potential targets using machine learning strategies. In the following protocols, we describe how to use each web-based tool to accelerate illumination in under-studied proteins. © 2022 The Authors. Current Protocols published by Wiley Periodicals LLC.

Basic Protocol 1 : Interacting with the Pharos user interface

Basic Protocol 2 : Accessing the data in Harmonizome

Basic Protocol 3 : The ARCHS4 resource

Basic Protocol 4 : Making predictions about gene function with PrismExp

Basic Protocol 5 : Using Geneshot to illuminate knowledge about under-studied targets

Basic Protocol 6 : Exploring under-studied targets with TIN-X

Basic Protocol 7 : Interacting with the DrugCentral user interface

Basic Protocol 8 : Estimating Anti-SARS-CoV-2 activities with DrugCentral REDIAL-2020

Basic Protocol 9 : Drug Set Enrichment Analysis using Drugmonizome

Basic Protocol 10 : The Drugmonizome-ML Appyter

Basic Protocol 11 : The Harmonizome-ML Appyter

Basic Protocol 12 : GWAS target illumination with TIGA

Basic Protocol 13 : Prioritizing kinases for lists of proteins and phosphoproteins with KEA3

Basic Protocol 14 : Converting PubMed searches to drug sets with the DrugShot Appyter

INTRODUCTION

There are approximately 25,000 protein-coding genes (Venter et al., 2001) in the human genome. Abnormal protein expression is associated with many human diseases, which makes proteins critical targets for therapeutic agents. Approximately 15% of protein-coding genes are considered part of the "druggable genome.” This means that these proteins can modulate cellular behavior when targeted by experimental small molecule compounds (Hopkins & Groom, 2002; Johns, Russ, & Fu, 2012; Lipinski, Lombardo, Dominy, & Feeney, 2001; Russ & Lampel, 2005). Moreover, only a few hundred targets represent the existing clinical pharmacopeia, leaving a massive swath of pharmacology that remains unexploited. Therefore, 85% of druggable proteins remain to be explored as potential therapeutic targets. Much of the druggable genome encodes three critical protein families: non-olfactory G-protein-coupled receptors (GPCRs), ion channels, and protein kinases. Critically, we currently lack crucial knowledge about the function of many proteins from these families and their roles in health and disease. A better understanding of these proteins, structurally or functionally, could shed light on new avenues of investigation for basic science and therapeutic discovery (Oprea et al., 2018).

In this article, we provide several protocols to guide users through the use of IDG tools that accomplish specific computational tasks related to illuminating the druggable genome. In Basic Protocol 1, we describe how users can query the Pharos web interface (Sheils et al., 2021) to search for data related to gene targets. Basic Protocol 2 explains how to use Harmonizome (Rouillard et al., 2016), a web application that stores gene-attribute associations from various sources that can be readily visualized and leveraged for machine learning. Basic Protocol 3 describes ARCHS4 (Lachmann et al., 2018), a web application that provides easy access to RNA-sequencing data from human and mouse experiments and also includes gene landing pages for all human genes with gene function predictions based on mRNA co-expression. Basic Protocol 4 describes PrismEXP (Lachmann, Rizzo, Bartal, Jeon, & Clarke, 2021), a machine learning Appyter (Clarke et al., 2021) that improves gene function predictions from gene co-expression correlation data by vertical partitioning the global gene-gene co-expression matrix used by ARCHS4. Basic Protocol 5 teaches the user how to use Geneshot (Lachmann et al., 2019), a web application that facilitates querying of biomedical search terms to retrieve prioritized lists of genes related to the search terms. In Basic Protocol 6, we introduce TIN-X (Cannon et al., 2017), the Target Importance and Novelty eXplorer. We demonstrate how to query and explore interesting disease-target associations based on novelty and importance metrics derived from natural language processing (NLP) of PubMed abstracts. Basic Protocol 7 describes DrugCentral (Avram et al., 2021), a comprehensive database of approved drugs that includes information relating to drug side effects, mode of action, indications, pharmacologic action, and other information. Basic Protocol 8 explains REDIAL-2020 (KC et al., 2021), an ensemble machine learning platform that extends the information available in DrugCentral to predict drugs and small molecules that may have anti-SARS-CoV-2 activity. In Basic Protocol 9 we discuss Drugmonizome (Kropiwnicki et al., 2021), a web application that facilitates drug set enrichment analysis and allows users to submit a drug set of interest to retrieve enriched terms that all, or most, of the members of the input set share. Basic Protocol 10 describes Drugmonizome-ML (Kropiwnicki et al., 2021), an Appyter that extends the information available in Drugmonizome to build on-the-fly machine learning models for predicting novel drug and small molecule attributes. In a similar vein, Basic Protocol 11 discusses Harmonizome-ML, an Appyter that enables users to utilize the datasets from Harmonizome to build machine learning models that predict novel gene-attribute associations. Basic Protocol 12 includes a discussion of TIGA (Yang et al., 2021), Target Illumination GWAS Analytics, a tool that summarizes gene-trait associations derived from genome-wide association studies (GWAS) with rational and intuitive evidence metrics. In Basic Protocol 13, we describe how users can submit an input list of genes or differentially phosphorylated proteins to KEA3 for kinase enrichment analysis (Kuleshov et al., 2021) to infer kinases associated with the input list. Basic Protocol 14 explains how to use DrugShot, an Appyter that allows for the querying of biomedical search terms to retrieve known and predicted lists of drugs and small molecules related to the query term.

Basic Protocol 1: INTERFACING WITH THE PHAROS USER INTERFACE

Pharos is the user interface to the Knowledge Management Center (KMC) for the IDG program, providing facile access to most data types collected by the KMC (Nguyen et al., 2017; Sheils et al., 2020). Given the complexity of the data surrounding any target, efficient and intuitive visualization has been a high priority for users to navigate and summarize search results and rapidly identify patterns. Underlying the interface is a GraphQL API that provides programmatic access to all KMC data, enabling the incorporation of IDG resources with other applications.

Necessary Resources

Hardware

Desktop or a laptop computer, or a mobile device, with a fast Internet connection

Software

An up-to-date web browser such as Google Chrome (https://www.google.com/chrome/), Mozilla Firefox (https://www.mozilla.org/en-US/firefox/), Apple Safari (https://www.apple.com/safari/), or Microsoft Edge (https://www.microsoft.com/en-us/edge)

Search targets

1.Navigate to Pharos (https://pharos.nih.gov).

2.To search for a target, click on the search box on the main page or in the top left corner of subsequent pages. Enter STAT3.Note that multiple search types are available in the drop-down menu. (Fig. 1)

Typeahead search results for STAT3 scroll or arrow down to view more options.

3.It is possible to search by pathway or view a list of diseases or ligands associated with a target. Additionally, pressing Enter or Return will allow a text-based search, which will return a list of results featuring ‘STAT3’ anywhere in the text.

4.Press Enter or Return, or click the magnifying glass icon to search for the ‘STAT3’ text string.

5.A list of 81 targets is returned, with ‘STAT3’ being at the top of the list. The rest of the targets will have the phrase ‘STAT3’ somewhere within the target details (Fig. 2).

Search Targets for STAT3 search results page.

6.Click on the STAT3 card to view the target details.

View target details

7.Follow the steps above, or alternatively, click on the STAT3 (Target) option from the search box auto-complete. This will navigate directly to the STAT3 target details page.

8.The target details page is divided into several sections that highlight an area of knowledge about the target.

9.Scroll down to the “Protein Summary” section. A brief description of the target, as well as several identifiers, are available. In addition, the central radar plot charts the relative knowledge of a target compared to the rest of TCRD on a 0 to 1 scale. This data is sourced from the Harmonizome, which will be discussed further (Fig. 3).

Target details page for STAT3; the radar chart in the center depicts data from Harmonizome.

10.Scroll down to the next section, “IDG Development Level Summary.” Displayed here is the current development level . Each level has the criteria listed, as well as links to the data for each property (Fig. 4).

IDG development level summary section that shows the current development level, and criteria met. Links provide the ability to view either the original source, or the relevant data in Pharos.

11.On the left side panel, click on “Disease Associations by Source.” This will navigate within the page to a section displaying disease associations from a variety of sources.

12.Scroll down to the “Disease Novelty (Tin-x)” section, just below Disease Associations. A scatterplot is visible that shows Tin-x data. This data is explained in Basic Protocol 6.Briefly, it is natural language processed PubMed abstracts that chart a target's importance to a disease, as well as the novelty of that target to the disease. A dense chart indicates a large amount of knowledge about a target and its disease associations, whereas a sparser chart would indicate that target is not frequently studied and has fewer disease associations (Fig. 5).

Scatterplot depicting TIN-X data for STAT3. Hovering over a data point opens up a tooltip, providing novelty and importance data for the disease.

13.Scroll down to the next section “GWAS Traits.” Here a table of GWAS traits is displayed. This list focuses on scoring and ranking protein-coding genes associated with traits from genome-wide association studies. This allows the discovery of traits most associated with a target, but also less emphasized traits (Fig. 6).

GWAS traits, and the associated TIGA scatterplot. For a more in-depth exploration of this data, click “Explore on Target Illumination GWAS Analytics.”

Finding a list of under-studied targets that share disease associations with STAT3

14.From the STAT3 target details page, click on “Disease Associations by Source” on the left panel.

15.Click on the “Find Similar Targets” button, directly under the panel header (Fig. 7).

Additional functions available within Pharos are shown within blue buttons. Users can click to browse filtered lists for targets similar to the current target, or associated diseases or ligands.

16.The targets list page is now shown, with a target similarity filter applied, showing 17,876 targets (Fig. 8).

List of targets that share associated diseases with STAT3. The Jaccard index is a numerical value of the ratio of overlap between the associated diseases of the target in relation to the original target (STAT3). The Venn diagram is a visual representation of the ratio with the TDL level color coded.

17.To refine this list for targets of interest to the IDG program (mentioned in Basic Protocol 1), click on the “Refined (2020)” checkbox in the IDG Target Lists filter panel on the left side of the page. The list of targets shown is reduced to 290.

18.To find only dark targets in this list, click the “Tdark” value in the Target Development Level filter panel, returning 48 targets (Fig. 9).

Note

Dark targets are the most under-studied proteins from the three gene families with the most known druggable targets: GPCRs, ion channels, and kinases.

The target list from Figure 8 filtered to display Target Development Level of Tdark, and on the Refined(2020) IDG target lists. Click on “Click for details…” to view an expanded list of the overlapping values.

19.Click on the “click for details…” text on the TMEM63A target card to view a list of associated diseases that this target shares with STAT3 (Fig. 10).

Expanded view of the Associated Disease Similarity section of the target card.

Download target list

20.Click on the downward-facing arrow on the right side of the Targets header (Fig. 11).

Target toolbar illustrating the download button on the right side. To the left of the download button is the upload button, which allows for the uploading of custom lists, to explore in the Pharos interface.

21.A window will pop open displaying a list of fields that can be selected (Fig. 12).

22.Click on the Associated Diseases checkbox. Note that many fields are deactivated, to reduce the overall file size.

23.Click on Name and Target Development Level under the Single Value Fields heading.

24.Click the Run Download Query Button. A file download dialog will open. Depending on the complexity of the target list and the fields selected, this may take some time.

25.After the file is downloaded, this list of targets can be used as a starting point for many of the protocols listed below.

GraphQL queries

26.Click on API on the main Pharos header.

27.A code “sandbox” is now visible, allowing testing of GraphQL queries to fetch complex data from Pharos. A distinct feature of GraphQL is the ability of the consumer to determine the exact fields returned from the query, as opposed to a SQL query, where the data returned is determined by the database developer.

28.Click the “Edit & Run” button for one of the Sample Queries on the left panel, and then the “Play” button in the top center. This will execute the query on the server and display the JSON results in the right panel.

29.Click on the “Docs” tab on the right side of the page. A menu will open up that displays the queries available, the inputs required, and the responses and properties returned. Click on the “Docs” tab again to close the menu.

30.Replace the text in the left column with this query:

query PaginateData {
batch(
filter: {
facets: [
{ facet: "Target Development Level", values: ["Tdark"] }
{ facet: "IDG Target Lists", values: ["Refined (2020)"] }
]
similarity: "(P40763, Associated Disease)"
}
) {
results: targetResult {
count
targets(skip: 0, top: 100) {
name
gene: sym
accession: uniprot
idgTDL: tdl
similarityDetails: similarity {
commonOptions
}
}
}
}
}

31.Press the play button. This query fetches all dark targets of interest to the IDG that share associated diseases with STAT3. Returned are the target name, gene symbol, Uniprot id, IDG TDL, and shared associated diseases (Fig. 13).

GraphQL sandbox interface. Examples on the left side and documentation on the right allow for highly customizable data requests.

Entire relational database download page

32.Navigate to the TCRD website (http://juniper.health.unm.edu/tcrd/).

33.Click on the “Downloads” tab on the navigation bar at the top of the page to be redirected to a table of downloadable, e.g., MySQL dump of the full TCRD (latest.sql.gz).

Basic Protocol 2: ACCESSING THE DATA IN HARMONIZOME

The Harmonizome resource contains processed datasets detailing functional associations between genes/proteins and their attributes extracted from 66 online resources. The information from the original datasets was distilled into attribute tables that define significant associations between genes and their attributes, where attributes could be other genes, proteins, pathways, cell lines, tissues, experimental perturbations, diseases, phenotypes, drugs, or other entities depending on the dataset. The Harmonizome web application can be accessed from https://maayanlab.cloud/Harmonizome/ (Rouillard et al., 2016).