Bioinformatic Analysis to Investigate Metaproteome Composition Using Trans-Proteomic Pipeline

Steven He, Steven He, Shoba Ranganathan, Shoba Ranganathan

Published: 2022-07-21 DOI: 10.1002/cpz1.506

Abstract

With evidence emerging that the microbiome has a role in the onset of many human diseases, including cancer, analyzing these microbial communities and their proteins (i.e., the metaproteome) has become a powerful research tool. The Trans-Proteomic Pipeline (TPP) is a free, comprehensive software suite that facilitates the analysis of mass spectrometry (MS) data. By utilizing available microbial proteomes, TPP can identify microbial proteins and species, with an acceptable peptide false-discovery rate (FDR). An application to a publicly available oral cancer dataset is presented as an example to identify the viral metaproteome on the oral cancer invasive tumor front. © 2022 The Authors. Current Protocols published by Wiley Periodicals LLC.

Basic Protocol 1 : Collection of data and resources

Basic Protocol 2 : Analysis of MS data using TPP

Basic Protocol 3 : Analysis of TPP output using R in RStudio

INTRODUCTION

Since the advent of the Human Microbiome Project in 2007 (Turnbaugh et al., 2007), there has been growing research interest in the human microbiome, which refers to the collective aggregate of all microorganisms, including the fungal mycobiome, colonizing on/within human tissues such as the skin, digestive tract, and genitalia. The loss of biodiversity and disruption of microbial homeostasis within these microbial communities, otherwise known as dysbiosis, has since been associated with a wide range of diseases and health conditions, including inflammatory bowel disease (Ni, Wu, Albenberg, & Tomov, 2017), autism spectrum disorders (Kang et al., 2017), pre-term birth (Proctor et al., 2019), and a wide range of cancers (Aykut et al., 2019; Picardo, Coburn, & Hansen, 2019). By studying the metaproteome, which refers to the collective proteome encoded by the microbiome, researchers are able to gain new insights into the microbial compositions associated with different disease states, and also identify differentially expressed functional pathways, if any.

The current paper aims to assist interested researchers in performing metaproteomic analyses using publicly available datasets from repositories, such as ProteomeXchange (Vizcaíno et al., 2014), where MS data is available for a wide range of human diseases, often having only been analyzed in the context of the human proteome. Additional use of the programming language R (Ihaka & Gentleman, 1996) and the Trans-Proteomic Pipeline (TPP), which is a free, high-quality suite of processing tools for the analysis of mass spectrometry (MS) data (Deutsch et al., 2011, 2015), is also explained in this paper. TPP is freely available from the Seattle Proteome Center website (http://tools.proteomecenter.org/wiki/index.php?title=Software:TPP).

Basic Protocol 1: COLLECTION OF DATA AND RESOURCES

Here we describe how to access and download various data resources required for analysis, including the collection of MS data, reference data, and taxonomic information, from the most common publicly available databases.

The Proteomics IDEntifications (PRIDE) database is one of the most common MS data repositories, and while other specialized databases also exist based on their disease classification, such as the Clinical Proteomic Tumor Analysis Consortium (CPTAC) data portal (https://proteomics.cancer.gov/data-portal), which holds data from various cancer tumors (Edwards et al., 2015), these will not be discussed here. Of note, annotation quality varies by dataset, as does the methodology implemented in data acquisition, which should be taken into consideration when planning your own analyses. In particular, it is recommended to select datasets where mechanical disruption (e.g., ultrasonication, bead beating) and detergent (e.g., SDS) have been used for protein extraction (Zhang & Figeys, 2019).

The UniProt (https://www.uniprot.org/proteomes/) database is one of the largest repositories for protein reference sequences (Bateman, 2019). UniProt has the advantage of providing a comprehensive, high-quality, and freely accessible resource of protein sequence and functional information. It also incorporates both the manually annotated and curated SwissProt resource as well as the computationally analyzed TrEMBL data, awaiting full manual annotation. UniProt is also regularly updated and contains microbial (bacterial, viral as well as fungal) proteomes. While the use of existing reference data has the advantage of being easily generalized to most experimental designs, it comes at the expense of database specificity. Other methods to prepare reference databases include the use of prior sample metagenome/metatranscriptome data for construction, or of alternative reference database sources that are specific to certain biological niches. Several examples of such human and murine microbiome databases, which can substitute for UniProt, have been provided in Table 1.The databases selected for inclusion here all have protein FASTA sequence files readily available for reference database construction. While the level of protein annotation for the mouse reference gut microbiome (MRGM; Kim et al., 2021) is somewhat lacking, it still remains one of the more comprehensive murine gut reference sources. While bacterial reference data is readily available from these databases, there currently remain limited options for fungal and viral components of the microbiome in these repositories, representing potential avenues for future research and development. Additionally, it is important to note that as new organisms continue to be sequenced, reference databases need to be updated regularly. To construct up-to-date and niche reference databases, alternative methods of database construction, comprehensively reviewed elsewhere (Blakeley-Ruiz & Kleiner, 2022), may need to be adopted.

Table 1. Alternative Targeted Microbiome Databases/Datasets for Metaproteomic Reference Sequences

Database	Utility	Limitations
Expanded human oral microbiome database (Escapa et al., 2018)	- Curated information on bacteria present in the human mouth and aerodigestive tract based on metagenomic sequencing - Corresponding NCBI taxonomy IDs provided for identified species - Periodically updated with corresponding release notes	- Lack of information regarding fungal and viral species
Unified human gastrointestinal protein catalog (Almeida et al., 2021)	- Curated information on prokaryotic microbiota of the human gut based on metagenomic sequencing - Protein coding sequences are available for individual organisms, or collectively clustered at 100, 95, 90, and 50% amino acid identity - Periodically updated approximately every 6–12 months	- Lack of information regarding fungal and viral species - Not readily able to map species information to existing NCBI taxonomy IDs
Mouse oral microbiome database (Joseph et al., 2021)	- Curated information on bacteria present in the murine oral cavity based on metagenomic sequencing - Corresponding NCBI taxonomy IDs provided for identified species	- Lack of information regarding fungal and viral species - Evidence of updates, though schedule/frequency is unclear
Mouse reference gut microbiome (Kim et al., 2021)	- Curated information on bacteria present in mouse gut based on metagenomic sequencing - Protein coding sequences available clustered at 100%, 95%, 90%, 70%, and 50% amino acid identity with lowest common ancestor identified - Functional annotation based on EggNOG database (Huerta-Cepas et al., 2019)	- Lack of information regarding fungal and viral species - Lack of protein annotation in .faa FASTA files - Not readily able to map species information to existing NCBI taxonomy IDs - Evidence of updates, though schedule/frequency is unclear

Finally, the National Center for Biotechnology Information (NCBI) taxonomy database (https://www.ncbi.nlm.nih.gov/taxonomy/) is one of the largest sources of taxonomic information (Schoch et al., 2020), and also is used by UniProt for taxonomy.

The current protocol topic will be broken down into the following:

1.Collection of raw MS data from PRIDE
2.Collection of microbial proteomes from UniProt for reference database creation
3.Collection of taxonomic information from NCBI.

Necessary Resources

Hardware

Any computer system with Internet access and a browser. Depending on the size of the datasets to be downloaded, additional storage may be required in the form of external hard drives.

Software

For File Transfer Protocol (FTP) downloads, depending on the organization of the data, the freely available software Filezilla may be used to streamline the download of many files. Filezilla is freely available at https://filezilla-project.org/ and is supported on Windows, Mac OS, and Linux. Additionally, software capable of extracting files from .gz file formats is required. 7-zip is a free option available at https://www.7-zip.org/. Depending on the size of the microbial database being implemented, additional software capable of reading and editing large text files, such as EmEditor (https://www.emeditor.com), is also recommended.

Collection of raw MS data from PRIDE

1.Using your Internet browser, navigate to the PRIDE database (https://www.ebi.ac.uk/pride/) (Martens et al., 2005). Enter a search query to call relevant datasets. If a publication has uploaded its dataset to PRIDE, this can instead be called by entering its PRIDE identifier. This is often in the format “PXD”, followed by a 6-digit identifier (e.g., PXD123456).

2.Clicking on an entry will bring up an information summary about the dataset. Scrolling down to the bottom of the page will show the project files that are available for download. The project files from PXD007232 (https://www.ebi.ac.uk/pride/archive/projects/PXD007232), an MS study on oral cancer (Carnielli et al., 2018), are shown as an example in Figure 1.

Screenshot of PRIDE PXD007232 database project files. (A) Input field to search for specific files within a project. (B) Link to open the FTP page of a project dataset. (C) Link to initiate download of an individual file. (D) Drop-down menu to change the number of files displayed per page.

3.Individual files from a project dataset can be selectively downloaded by clicking “FTP” next to the particular file. When there are a large number of files to download, it is recommended to click “Project FTP” and utilize Filezilla FTP download software, freely available from https://filezilla-project.org/.

4.Clicking “Project FTP” will open a new window with a list of all available files for a particular project. Open Filezilla and copy the website URL into the “Host” field in Filezilla to establish a remote connection with the PRIDE FTP server (Fig. 2A).

Filezilla software with remote connection to PRIDE database FTP server project PXD007232. (A) Input field to connect to host; when copying URLs, ensure that the input begins with “ftp”. (B) Local directory pane. Navigate to the desired location to which the files will be downloaded; in this case files will be downloaded to the folder “MS_dataset”. (C) Remote site pane; shows the directory layout of the FTP server that has been connected to. (D) Remote file directory. Shows files in the currently selected folder from (C). Multiple files can be selected, and then actions can be performed by right-clicking.

5.Once the connection is established, use the local navigation pane to select the destination to which files will be downloaded (Fig. 2B). If the URL has correctly been entered from step 4, there should be no need to use the remote site navigation (Fig. 2C), and the project files should be visible (Fig. 2D). After selecting all desired files, right-click and select “Download” to begin downloading the selected files. This can be more convenient when a large number of MS data files are required to be downloaded.

Collection of microbial proteomes from UniProt for reference database creation

6.Open your preferred Internet browser and navigate to UniProt proteomes (https://www.uniprot.org/proteomes).

7.On the left-hand sidebar under “Filter by”, select “Reference proteomes” (Fig. 3A).

Screenshot of UniProt. (A) Sidebar selection to filter by reference proteomes. (B) Sidebar selection to filter by either Bacteria or Viruses, depending on the desired microbes to be searched for. (C) Sidebar selection to show all proteins mapped to the currently selected species’ proteomes.

8.On the same left-hand sidebar, select “Bacteria” or “Viruses” to filter by the desired microbe. Here, “Viruses” will be used as an example (Fig. 3B).

9.With all viral reference proteomes now selected, on the left-hand sidebar under “Map to”, select “UniProtKB”. This will show all protein entries mapping to the selected viral proteomes. Selecting “Download” from above the results window will allow all returned entries to be downloaded into a single FASTA file, which can be used as the reference database. For the example reference database, all viral reference proteomes were downloaded in FASTA format, totaling 594,570 protein sequences (515,513 viral proteins, 79,057 human proteins; files downloaded 15^th Feb 2022).

10.Of note, as fungi are grouped in Eukaryota on the sidebar, a separate query, “taxonomy:Fungi [4751]”, must be entered into the search bar to obtain fungal sequences. After this, select “Reference proteomes” and map to “UniProtKB” as previously described before downloading as FASTA.

11.Assuming that metaproteomic analysis is being performed on human samples, in order to improve FDR, a copy of the human reference proteome should also be downloaded (available from https://www.uniprot.org/proteomes/UP000005640) and appended to the previously downloaded FASTA file reference database using a copy and paste command. If an alternate host species is being used (e.g., mouse), then the corresponding reference proteome should be appended instead.

Note

If the database file exceeds approximately 1 GB in file size, terminal commands or additional reading/editing software such as EmEditor may be required to perform this step. If the final database file exceeds 2 GB in size, it is recommended to split this database into multiple smaller files of approximately 2 GB using software such as EmEditor.

Collection of taxonomic information from NCBI

12.The National Center for Biotechnology Information (NCBI) taxonomy database, on which UniProt's taxonomy is based, can be downloaded from https://ftp.ncbi.nih.gov/pub/taxonomy/new_taxdump/. Download one of either “new_taxdump.tar.Z”, “new_taxdump.tar.gz”, or “new_taxdump.zip”; all three contain the same data but are intended to provide convenience in unpacking on different operating environments. Opening the “taxdump_readme.txt” will provide more information.

13.Extract “rankedlineage.dmp” from the downloaded file. The .dmp file contains the taxonomic information for all organisms in the NCBI database, and will be used in Basic Protocol 3

Basic Protocol 2: ANALYSIS OF MS DATA USING TPP

Here we describe the use of various TPP modules for the analysis of the downloaded MS data using the constructed reference database. This includes the conversion of the MS data, database searching, and then peptide and protein validation. A breakdown of the following workflows will be provided in this protocol:

1.Starting up TPP
2.Conversion of proprietary file formats using msconvert
3.Database searching using Comet
4.Peptide validation using PeptideProphet
5.Protein inference using ProteinProphet.