Using ExpressAnalyst for Comprehensive Gene Expression Analysis in Model and Non-Model Organisms

Guangyan Zhou, Guangyan Zhou, Jessica Ewald, Jessica Ewald, Jianguo Xia, Jianguo Xia, Yao Lu, Yao Lu

Published: 2023-11-06 DOI: 10.1002/cpz1.922

differential expression analysis

Abstract

ExpressAnalyst is a web-based platform that enables intuitive, end-to-end transcriptomics and proteomics data analysis. Users can start from FASTQ files, gene/protein abundance tables, or gene/protein lists. ExpressAnalyst will perform read quantification, gene expression table processing and normalization, differential expression analysis, or meta-analysis with complex study designs. The results are presented via various interactive visualizations such as volcano plots, heatmaps, networks, and ridgeline charts, with built-in functional enrichment analysis to allow flexible data exploration and understanding. ExpressAnalyst currently contains built-in support for 29 common organisms. For non-model organisms without good reference genomes, it can perform comprehensive transcriptome profiling directly from RNA-seq reads. These common tasks are covered in 11 Basic Protocols. © 2023 The Authors. Current Protocols published by Wiley Periodicals LLC.

Basic Protocol 1 : RNA-seq count table uploading, processing, and normalization

Basic Protocol 2 : Differential expression analysis with linear models

Basic Protocol 3 : Functional analysis with volcano plot, enrichment network, and ridgeline visualization

Basic Protocol 4 : Hierarchical clustering analysis of transcriptomics data using interactive heatmaps

Basic Protocol 5 : Cross-species gene expression analysis based on ortholog mapping results

Basic Protocol 6 : Proteomics and microarray data processing and normalization

Basic Protocol 7 : Preparing multiple gene expression tables for meta-analysis

Basic Protocol 8 : Statistical and functional meta-analysis of gene expression data

Basic Protocol 9 : Functional analysis of transcriptomics signatures

Basic Protocol 10 : Dose-response and time-series data analysis

Basic Protocol 11 : RNA-seq reads processing and quantification with and without reference transcriptomes

INTRODUCTION

With the fast progress in sequencing and mass spectrometry technologies, studies involving omics data collection are becoming ubiquitous in life sciences. Making sense of these large, complex omics datasets require advanced and specialized analysis pipelines, and many researchers do not have the bioinformatics or programming skills to handle these data (Alyass et al., 2015). There is an urgent demand for user-friendly software to relieve the omics data analysis bottleneck. Here we provide detailed protocols on using ExpressAnalyst, a web-based platform that provides end-to-end support for common tasks involved in transcriptomics data analysis (Liu et al., 2023). While many of the modules were originally designed for RNA-seq or microarray data (Zhou et al., 2019), we have added proteomics-specific annotation libraries and normalization methods so that the differential expression and functional analysis methods can be used to analyze abundance tables from proteomics.

The core statistical and functional analysis modules were originally part of the NetworkAnalyst tool, and our previous protocol (Xia et al., 2015) covers some of this functionality. The general statistical and functional analysis modules were split from the network analysis module to form the basis of ExpressAnalyst. ExpressAnalyst was expanded to include bulk RNA-seq processing, annotation and functional libraries for ecological species (common reference transcriptomes and Seq2Fun ortholog IDs), as described in our recent publication (Liu et al., 2023). All modules were further modified to support complex metadata, including continuous variables and the ability to consider multiple factors during differential expression analysis, and to support proteomics intensity/abundance tables. Finally, ExpressAnalyst integrates the FastBMD workflow to enable dose-response and time-series analysis (Ewald et al., 2021). The web interface has also been configured to display a live R command history throughout the analysis. Users with basic R scripting skills can install the ExpressAnalystR package (see Internet Resources) for batch processing, transparent, and reproducible analysis. The command history can also be reported as supplementary materials in any publications using ExpressAnalyst.

A general transcriptomics analysis has four main steps: raw data processing, filtering and normalization, statistical analysis, and functional analysis (Fig. 1) (Conesa et al., 2016). Each step produces different results: raw data processing generates a table of expression values; the filtering and normalization step produces a clean, normalized table; statistical analysis generates a list of significant features; and functional analysis produces a list of impacted pathways and biological processes. ExpressAnalyst has different modules that serve as “entry points” to the general pipeline: if researchers download and save their results, they can start the analysis at any of the steps indicated in Figure 1. The various protocols that address each of the four steps are outlined in Figure 1. While RNA-seq read processing is chronologically first, we present it last (Basic Protocol 11) as it has the most complicated hardware and software requirements and is not performed frequently. RNA-seq read quantification is usually performed only once and often by dedicated bioinformaticians at a core facility, and the resulting count table is provided to researchers as the most common starting point for exploratory analysis. Basic Protocols 1 to 4 cover filtering and normalization, statistical analysis, and functional analysis of a standard RNA-seq count table. Basic Protocol 5 covers the same main steps but with a cross-species dataset that includes multiple non-model species. It highlights features within ExpressAnalyst that were designed specifically for species without high-quality reference genomes or transcriptomes. Basic Protocol 6 covers filtering and normalization methods specific for microarray or proteomics tables. Basic Protocols 7 and 8 introduce meta-analysis of a set of expression tables including filtering, normalization, and statistical and functional analysis. Basic Protocol 9 briefly covers how lists of significant features generated by previous statistical analyses can be uploaded for comparison and functional analysis. Basic Protocol 10 introduces a specialized statistical and functional analysis for dose-response or time-series expression data. Finally, Basic Protocol 11 describes a unified workflow for processing RNA-seq FASTQ files from both model and non-model species. Together, these protocols introduce how ExpressAnalyst empowers researchers to comprehensively analyze their own transcriptomics or proteomics datasets, without programming skills or advanced bioinformatics experience.

Overview of the basic protocol scope. The four main steps of a transcriptomics pipeline are outlined on the right side, along with the relevant basic protocols that cover each step. The icons on the left side are the modules in ExpressAnalyst that accept different input formats.

Here, we present 11 basic protocols to introduce readers to the different ExpressAnalyst modules that can be used for raw data processing, statistical analysis, and functional analysis, outlining which workflows are appropriate for model vs non-model species, transcriptomics vs proteomics datasets, and for various common data input formats. They are summarized below.

Basic Protocol 1: How to upload, process, and normalize an RNA-seq count table in preparation for statistical and functional analysis. Basic Protocol 2: How to perform differential expression analysis for simple and complex experimental designs. Basic Protocol 3: How to perform functional analysis and interpret the results with volcano plots, enrichment networks, and ridgeline charts. Basic Protocol 4: How to use hierarchical clustering and heatmaps to perform an unsupervised, exploratory analysis. Basic Protocol 5: How to perform statistical and functional analysis of a cross-species RNA-seq count table generated by ortholog mapping with Seq2Fun. Basic Protocol 6: How to filter and normalize microarray and proteomics intensity tables. Basic Protocol 7: How to upload, process, and normalize a set of gene expression tables for meta-analysis. Basic Protocol 8: How to perform statistical and functional meta-analysis of gene expression data. Basic Protocol 9: How to analyze single or multiple gene expression signatures. Basic Protocol 10: How to perform dose-response and time-series analysis. Basic Protocol 11: How to process FASTQ files to obtain a gene count table with or without using a reference transcriptome.

Basic Protocol 1: RNA-seq COUNT TABLE UPLOADING, PROCESSING, AND NORMALIZATION

The objective of this protocol is to prepare the data for downstream differential expression and functional analysis using ExpressAnalyst. This includes formatting the input files, mapping transcript identifiers to the internal annotation database, performing a basic quality check on the data, and applying filtering and normalization to remove non-informative genes and to correct for systematic technical differences. This protocol assumes that RNA-seq reads have already been aligned to a transcriptome and summarized in a count table, which is the case for most researchers. If this is not the case and you must start from FASTQ files, please see Basic Protocol 11.This protocol is also specifically written for RNA-seq count data. ExpressAnalyst also accepts abundance tables produced from microarray or proteomics experiments. Many of the overall concepts are the same; however, count data requires specific normalization techniques. For a discussion of microarray intensity and proteomics abundance data processing, please see Basic Protocol 6.

Basic Protocols 1 to 4 use the same dataset, an RNA-seq count file measured in mouse liver (Diamante et al., 2021). It has been previously shown that bisphenol-A (BPA) exposure during pregnancy leads to cardiometabolic disease in offspring. The objective of the original study was to elucidate the mode of action underlying this outcome. The authors exposed pregnant mice to BPA and collected RNA-seq data in the liver from offspring of both sexes, along with bodyweight, insulin secretion, and targeted lipids in the liver and plasma samples. Differential gene expression analysis was conducted between the exposed and control groups to understand the observed phenotypic differences and metabolic outcomes.

Necessary Resources

Hardware

A computer with internet access

Software

An up-to-date web browser such as Google Chrome, Mozilla Firefox, or Safari, with JavaScript enabled (see Internet Resources)

Files

None

1.Go to the ExpressAnalyst home page (https://www.expressanalyst.ca) and click the “Tutorials” link at the top menu bar to visit the tutorial page. Scroll to the bottom of the page and find the “Dataset for the ExpressAnalyst Current Protocol” data section. Download the two text files labeled “mouse_counts.csv” and “mouse_metadata.csv.” Open them in a spreadsheet program or a text editor to view the data format (Fig. 2).

Note

The most frequent help requests that we receive are related to data and metadata formatting. The input files are displayed in Figure 2, including a gene count table for 16 samples (Fig. 2A), and a metadata table describing these samples (Fig. 2B). The first two columns in the metadata table show the main study design: 4 BPA exposed and 4 controls for both male and female mice. Liver and plasma lipids were also measured, which are the continuous values displayed in the remainder of the metadata columns.

Example omics data and metadata tables. The gene count table (A) shows the required format for RNA-seq count table, with sample names in columns and transcript identifiers (here, Official Gene Symbol) in rows. The row surrounded by the red dashed box is an example of a transcript with no counts detected for any sample. These rows will be automatically removed during data upload. The metadata table (B) has a different format, with metadata variables in columns and sample names in rows. Notice the shared sample names in A and B (highlighted with the solid blue boxes). Exactly matching sample names are required for ExpressAnalyst to correctly pair information across input files. Missing values in the metadata should be left blank, as shown in the two empty cells in B surrounded by the dashed orange box. In both A and B, the column headers (sample names and metadata variable names) should not contain spaces or special characters (i.e., %, /, etc.), which could lead to errors during data upload.

2.Go back to the ExpressAnalyst home page and click “Start Here” to access the Module Overview page. Locate the “Statistical & Functional Analysis” section and click the “Start Here” button underneath the single gene expression table input type. On the Data Upload page, set the organism to “ M. musculus (mouse),” leave the “Analysis Type” as “Differential Expression,” set the data type to “Counts (bulk RNA-seq),” and the ID type to “Official Gene Symbol.”

Note

Users must specify the correct organism and ID type so that ExpressAnalyst can map feature IDs to its internal annotation databases. Upon data upload, all IDs are converted to Entrez IDs. When multiple features are mapped to the same Entrez ID (for example, different transcript isoforms for the same gene), their expression values are summed if the data are counts and averaged if the data are intensities (microarray, proteomics). If your data are pre-normalized counts, you should upload them as intensities.

Note

It is possible to skip the annotation step by leaving the organism and ID type as “Not specified.” This may be desirable if your species or ID type is not supported, or if you'd like to retain the transcript-level resolution. Please note that in this case, functional analysis will be disabled as functional analysis requires gene level annotation.

3.Choose the “mouse_counts.csv” file for the data file, and the “mouse_metadata.csv” for the metadata file. Leave the “Metadata included” box unchecked and click “Submit.” Once the upload has finished, various summary messages will be displayed in the top right corner. Click “Proceed” to view this information in more detail on the next page.

Note

ExpressAnalyst accepts files in either comma-separated (.csv) or tab-delimited (.txt) format. Users have the option of either embedding metadata in the count table and uploading a single file, or formatting metadata in a separate table and uploading two files. The former strategy can be used for simple experimental designs with one or two metadata variables, while the latter is suitable for datasets with complex metadata.

4.In the Data Quality Check page, examine the text summary of the uploaded datasets in the gray box at the top of the “Omics data overview” tab. It shows the sample size, the percentage of features that are matched to the annotation database, as well as the number and type of experimental factors.

Note

The annotation libraries in ExpressAnalyst are updated about once per year, based on the latest ID versions available from NCBI (Entrez, RefSeq), Ensembl, and Uniprot (Brown et al., 2015; Consortium, 2019; Zerbino et al., 2018). If your data were annotated many years ago, you may have a lower percentage of features that map to the ExpressAnalyst database. Also, Official Gene Symbols generally have a lower mapping rate than the other ID types since there can be many synonyms for the same gene, not all of which may be present in our database.

5.Scroll down to view various diagnostic graphics, the first of which is the “Box plot.” Since the expression values range from zero to >10,000, it is clear that these are unnormalized count values. Click each of the additional tabs to view the “Count sum” (displays the total counts from all genes for each sample, also called the sequencing depth), “PCA plot” (scatterplots of the top two principal components), and “Density plot” (distribution of count values for each sample) of the uploaded data. The density plot appears in the shape of an “L,” which is caused by the large range and right-skewed distribution of raw count values.

Note

The figures shown under the dataset summary are useful for visually identifying outlier samples, assessing whether the data are normalized or not, determining appropriate filtering thresholds, and providing a benchmark to compare the effects of normalization. Deciding whether a sample is an outlier that should be removed is not a straightforward process. In general, we wish to remove samples that are substantially different from other samples based on technical reasons. A sample might be different due to biological reasons, in which case it should not be removed as this will bias the downstream statistical analysis and potentially lead to incorrect interpretation of the results. Unfortunately, it is not usually possible to determine whether an outlier is due to technical or biological reasons from the data alone. One guiding principle is that biological variability tends to have a smaller range than technical variability, hence if an outlier is extreme, we can usually assume it is a technical outlier and safely remove it without compromising our statistical inference. Examples of extreme outliers are when the first principal component (PC1) explains >70% of the variability and has a single or small number of samples on the extreme end of PC1, or when the count sum of a sample is several orders of magnitude smaller than the other samples (i.e., sequencing effects).

6.Go back to the top of the page and click the “Metadata overview” tab. Scroll down to the metadata table (Fig. 3) and verify that each variable has been correctly recognized as either “Discrete” (Treatment, Sex) or “Continuous” (4 liver and 7 plasma lipid variables). The classification of each metadata variable can be updated using the dropdown menus below individual variable names. Depending on your screen size, some metadata variables may not be visible. To see the additional columns, simply scroll to the right within the table area.

Note

Scrolling to the far right reveals a pencil can and a garbage can icon for each row, allowing users to edit or delete any metadata values or sample names. If you remove a sample from the metadata table, the corresponding sample in the omics data table will also be removed automatically.

Metadata editor. ExpressAnalyst automatically infers the type of each metadata variable (‘Discrete’ or ‘Continuous’). Users can use the dropdown menus below each variable name (1) to make corrections. Other tools for editing the metadata information, for example the order of categorical factor classes, can be accessed through the ‘Edit metadata column’ button (2). The pencil icon (3) can be used to edit individual metadata values.

7.Click the “Edit metadata column” button above the metadata table (Fig. 3). Navigate to the “Order (factor-level)” tab and make sure the “Treatment” variable is selected in the dropdown. By default, discrete metadata classes are sorted alphabetically in all downstream plots. However, in some cases, a different order might make more sense. Here, we wish to always plot the control samples on the left and the BPA-exposed samples on the right. Click the “Control” value and use the up-arrow button on the left to move it above “BPA” and click “Update.” Click “Proceed.”

Note

The other tabs allow users to include/exclude metadata variables from being displayed as options during the downstream statistical analysis, as well as to specify the primary metadata. The primary metadata is used to annotate most visualizations throughout the remainder of the analysis, so users should select the metadata that they are most interested in.

8.Click “Proceed.” A dialog will appear, warning that a few missing values were detected in the metadata and explaining how they will be handled in the downstream steps. Click “OK.”

Note

The downstream differential expression requires complete metadata for any variables included in the analysis. For example, if both the “Sex” and “liver_TG” variables are included in differential expression analysis (DEA), any sample with at least one missing value for either “Sex” or “liver_TG” will be excluded prior to computing the DEA statistics for that comparison.

9.Leave the default filtering settings (“Filter unannotated features” checked, “Low abundance” threshold to 4, and “Variance filter” to 15), change the normalization method to “Relative log expression normalization,” and click “Submit.”

Note

Filtering out transcripts that are of low confidence or uninformative to the research context can increase the statistical power of the downstream DEA (Bourgon et al., 2010). We can use summary statistics that are agnostic to the metadata (unsupervised), such as average abundance and variability across all samples, to flag transcripts for exclusion. Transcripts with a low average abundance near the detection limit are likely unreliable, while transcripts with a very low variability across all samples are unlikely to correlate with any metadata labels. Note, one should avoid using metadata labels relevant to downstream analysis to decide which transcripts to exclude, which will introduce bias to the DEA and other supervised methods (Bourgon et al., 2010).

Note

The purpose of normalization is to make expression profiles more comparable across samples, and to transform them to be more suitable for statistical analysis and visualization. All normalization options for RNA-seq counts (other than ‘None’) are from the “voom” methods in the limma R package which transform the data to “Log2-counts per million” or logCPM (Law et al., 2014). This transforms data to the log scale and normalizes for sequencing depth, which often varies across samples. Another potential issue with RNA-seq data relates to its compositional nature (Lovell et al., 2015), and the large range of abundances across different transcripts. As we saw in the boxplots and the density plots on the data overview page, most transcripts have counts in the low 10s to 100s range, but a small percentage have many more than this (>15,000). This means that a small number of transcripts can account for >50% of the total counts. If these highly abundant transcripts vary substantially across experimental conditions, they can influence the relative values (such as counts or logCPM) of many other transcripts, even if these transcripts do not change on an absolute scale (Lovell et al., 2015). This is a challenging issue to correct for and impacts different datasets to different extents. The last three normalization methods (“Upper Quantile Normalization,” “Trimmed Mean of M-values,” and “Relative log expression normalization”) implement different strategies to address it (Law et al., 2014).

10.Scroll down to the figures in the lower half of the page and consider the “Box plot” and “Density plot” tabs.

Note

Both plots look very different after normalization. The normalized expression values are now below 15, do not have a right-skew distribution at the sample level, and have very similar distributions across samples. This indicates that the data have been normalized for sequencing depth and transformed to a log scale.

11.Click on the “PCA plot” tab to examine the data patterns based on principal component analysis.

Note

Principal component analysis (PCA) is a widely used dimensionality reduction method that can summarize main variability trends in high-dimensional omics data into a few dimensions for intuitive visualization. In the PCA plot based on the first two principal components, we see that the data fall into four clear clusters (Fig. 4D). The samples are colored according to the primary metadata (Treatment), which reveals that the “Control” and “BPA” samples are separated along PC2. Inspecting the sample names, we can see that the samples are separated according to Sex along PC1, with male on the right and female on the left. This is a sign that there is a strong biological signal with respect to our main metadata of interest. Sometimes the PCA shows that the main trends in the data are related to technical variables such as batch or sample preparation protocols. This does not mean that there are no meaningful patterns in the data related to our metadata of interest, but that they explain less variability in the data than the technical parameters. In these cases, the technical parameters should be accounted for during the statistical analysis. This topic will be covered in Basic Protocol 2.

Diagnostic plots before and after normalization. The box plot in (A) and PCA plot in (B) show the uploaded count data, while (C) and (D) show the data after filtering and normalization. Note the difference in metadata order in (B) and (D) after re-ordering the treatment classes from BPA-Control to Control-BPA.

12.Click on the “Mean-variance plot” tab to explore the relationship between the mean and variance of transcript expression values.

Note

There is typically a relationship between the mean and variance of transcript expression values (Liu et al., 2015). While trends may vary across datasets, a typical relationship for unfiltered RNA-seq data is shown in Figure 5A: moving from left to right along the x-axis, the transcript standard deviation increases for a short section, peaks, and then decreases with increasing mean expression values. The initial increasing section is where transcripts are at or near the detection limit, hence the standard deviation is lower than we would expect based on the mean expression, due to the high numbers of zero values. The goal is to eliminate the initial upswing (see the dashed red box in Fig. 5A) by setting an appropriate abundance filter, to produce a consistently decreasing mean-variance trend as in Figure 5B. If the upswing area is not removed, the abundance filter should be increased.

Mean-variance trend of RNA-seq data. Plots showing the mean expression values on the x-axis and standard deviation of expression values on the y-axis for all transcripts, both before (A) and after (B) applying abundance and variance filters. The dashed red box in (A) highlights the portion of the trend with a positive mean-variance association.

13.Click the “Show R Commands” link in the top right corner to view the R commands history.

Note

The R functions used in ExpressAnalyst are publicly available on the GitHub page (https://github.com/xia-lab/ExpressAnalystR) and can be installed as an R package for local analysis (Fig. 6B). Throughout the analysis, the executed functions are tracked in the R command history. These features are implemented in ExpressAnalyst for transparency and reproducibility, so that users can see exactly which analyses have been performed.

ExpressAnalystR command history. Throughout the analysis, click the “Show R Commands” link in the top right corner to display this list of R commands used during the analysis (A). Instructions for installing and using ExpressAnalystR (B) are available under the “ExpressAnalystR” tab on the homepage.

Basic Protocol 2: DIFFERENTIAL EXPRESSION ANALYSIS WITH LINEAR MODELS

The general objective of differential expression analysis (DEA) is to identify genes or transcripts associated with specific experimental factors of interest, while accounting for other major sources of variability within the data (Law et al., 2016). The observed expression patterns can be explained by a combination of technical, biological, environmental, and experimental sources. Technical sources can include different sample preparation or sequencing depths across samples. Biological sources include factors such as sex, age, and circadian rhythm, while examples of environmental sources may encompass the geographic locations of sample collection or lifestyle parameters such as smoking or diet. Finally, experimental sources include any independent variable imposed by the researcher, such as chemical treatments or gene knockouts. In this protocol, we introduce the concepts behind using generalized linear models for performing DEA of gene expression data, explain differences between the main DEA algorithms, and describe how to configure DEA for common experimental designs.