RNA-Seq Data Analysis: A Practical Guide for Model and Non-Model Organisms

Enrique Pola-Sánchez, Enrique Pola-Sánchez, Karen Magdalena Hernández-Martínez, Karen Magdalena Hernández-Martínez, Rafael Pérez-Estrada, Rafael Pérez-Estrada, Nelly Sélem-Mójica, Nelly Sélem-Mójica, June Simpson, June Simpson, María Jazmín Abraham-Juárez, María Jazmín Abraham-Juárez, Alfredo Herrera-Estrella, Alfredo Herrera-Estrella, José Manuel Villalobos-Escobedo, José Manuel Villalobos-Escobedo

Published: 2024-05-29 DOI: 10.1002/cpz1.1054

assembly

differential expression

non-model plants

plotting

reproducibility

RNA-seq

AI 解读

Abstract

RNA sequencing (RNA-seq) has emerged as a powerful tool for assessing genome-wide gene expression, revolutionizing various fields of biology. However, analyzing large RNA-seq datasets can be challenging, especially for students or researchers lacking bioinformatics experience. To address these challenges, we present a comprehensive guide to provide step-by-step workflows for analyzing RNA-seq data, from raw reads to functional enrichment analysis, starting with considerations for experimental design. This is designed to aid students and researchers working with any organism, irrespective of whether an assembled genome is available. Within this guide, we employ various recognized bioinformatics tools to navigate the landscape of RNA-seq analysis and discuss the advantages and disadvantages of different tools for the same task. Our protocol focuses on clarity, reproducibility, and practicality to enable users to navigate the complexities of RNA-seq data analysis easily and gain valuable biological insights from the datasets. Additionally, all scripts and a sample dataset are available in a GitHub repository to facilitate the implementation of the analysis pipeline. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC.

Basic Protocol 1 : Analysis of data from a model plant with an available reference genome

Basic Protocol 2 : Gene ontology enrichment analysis

Basic Protocol 3 : De novo assembly of data from non-model plants

INTRODUCTION

The development of high-throughput sequencing technologies has made it possible to uncover many organisms' transcriptional profiles at the whole-genome level. The technology of RNA-seq, or messenger RNA (transcriptome) sequencing, has permitted researchers to explore gene expression patterns with great precision in hundreds of model and non-model organisms. Even without a reference genome, a wealth of understanding related to processes such as cellular development, gene function, and responses to environmental stimuli, among others, has been uncovered (Stark et al., 2019). The RNA-seq methodology has often set the basis for developing molecular genetic analysis in non-model organisms and has become an essential tool. Researchers focused mainly on wet lab and field work sometimes struggle to exploit the data available from “next-generation sequencing” because they lack experience in bioinformatics, which is perceived to require in-depth computational and programming skills. This guide is intended to help bridge this gap.

Conventionally, an RNA-seq experiment involves subjecting organisms or cells to at least two experimental conditions and subsequently freezing the tissue or cell group of interest to halt transcriptional activity and initiate the RNA extraction process. Once the RNA is purified, cDNA libraries are constructed for all transcripts present at a given moment in the cell. These transcripts commonly include different classes of RNAs, such as tRNAs, ribosomal RNA, noncoding RNA, and messenger RNAs (mRNAs; Stark et al., 2019). However, given their importance, researchers are mainly interested in mRNAs that give rise to proteins. In recent years, there has also been a growing interest in determining the differential expression of other types of RNAs, such as small interfering RNAs (siRNAs) and microRNAs (Kasschau et al., 2007; Liu et al., 2019; Mehdi et al., 2021), which also play important roles in the regulation of different cellular processes. Once the cDNA libraries are built, they are sequenced using any of various different possible sequencing technologies; some of the technologies most widely used currently are those available commercially from Illumina.

In general, these protocols present the quality analysis of the FastQ files, the trimming and mapping process, and the differential gene expression analysis. Furthermore, we have included global analyses to facilitate the interpretation of the analyzed data, such as functional category enrichment analysis and data visualization through easily interpretable graphics. We also present a processing approach for non-model organisms for which no assembled and annotated genome is available. We encourage users of these protocol to share their scripts and output data on freely accessible platforms such as GitHub, providing the scientific community with the steps to obtain the results presented in scientific reports such as theses or indexed papers. All scripts and flowcharts provided here are documented in a GitHub repository, making it easier for users to implement these protocols and ensure repeatability when analyzing this type of data.

STRATEGIC PLANNING

When planning a differential gene expression experiment using RNA-seq, it is essential to consider several factors. First, the experimental design must clearly define the conditions that will be contrasted to correspond to the biological question of interest: the more precisely defined these are, the better the results. Second, having at least three biological replicates per condition is advisable. It is crucial to strive for as much consistency as possible between these replicates during the experiments. Given that RNA-seq is highly susceptible to small fluctuations in experimental conditions, experimental noise can often be reflected in the results.

Additionally, it is essential to consider the number of reads obtained after sequencing. Typically, having around 10 million reads per library is sufficient for most organisms. However, in the case of plants with large genomes, such as maize, which is ∼2.4 Gb, it is advisable to aim for at least 20-30 million reads for a meaningful analysis of differential gene expression.

The sequencing files for each sample are recorded in FastQ format, which typically contains the read identifier, the sequence expressed in the four bases (CGAT; and in some cases “N” to indicate an undefined base), the read position (+ or −), and the quality with which each base was read. The quality is expressed in ASCII code and refers to Sacred values (Ewing & Green, 1998). These values are used to assess sequencing quality, and conventionally, RNA-seq analysis begins by examining read quality using the FastQC program. After this analysis, a trimming process to eliminate low-quality sequences and remove adapters is highly recommended. For this process, Trimmomatic has been one of the most versatile and valuable programs developed so far (Bolger et al., 2014). After data are trimmed and high-quality FASTQ files are obtained, the next stage is mapping the RNA-seq sequences to an assembled and annotated genome if this option is available. When these resources are not available, it is necessary to perform a de novo assembly.

Once the read counts per gene are obtained, the user can analyze differential expression levels to identify those genes whose expression changes (or differs) significantly between the two conditions being compared. These protocols present a differential expression analysis based on the DESeq2 program, which runs in the R environment (Love et al., 2014). As an alternative, we also recommend using edgeR (Robinson et al., 2010), which provides various examples for different cases in its manual and essentially performs the same processing as DESeq2. We have not extensively covered this program, for practical purposes, but the variations are minor. Figure 1 proposes a general workflow for model and non-model organisms.

Proposed workflows for model (A) and non-model organisms (B).

Basic Protocol 1: ANALYSIS OF DATA FROM A MODEL PLANT WITH AN AVAILABLE REFERENCE GENOME

This protocol meticulously outlines RNA-seq data analysis using the Arabidopsis thaliana reference genome. A typical RNA-seq analysis involves quality control and filtering of reads based on quality measures. Additionally, it includes alignment with the reference genome and, ultimately, the generation of the count matrix. These steps are executed using various programs, and their key characteristics and essential parameters are detailed below. Following this protocol as described will enable the analysis of differential expression in the dataset and the visualization of results through graphs such as principal component analysis (PCA) plots, M (log ratio)-A (mean average) (MA) plots, volcano plots, and heatmaps, though which differentially expressed genes can be identified.

Efficient progress in these tasks can be achieved by utilizing the command line within a Unix/Linux environment. We conducted our work on a 64-bit Ubuntu 23.04 operating system. For Windows users, a Windows system employing the Windows Subsystem for Linux (WSL; https://learn.microsoft.com/en-us/windows/wsl/install) is a viable option. For macOS users, it is feasible to execute identical commands in the terminal because of the shared Unix foundation of both operating systems.

Necessary Resources

Software

FastQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Trimmomatic: http://www.usadellab.org/cms/?page=trimmomatic
HISAT2: https://daehwankimlab.github.io/hisat2/manual/
featureCounts: https://subread.sourceforge.net/featureCounts.html
R: https://www.r-project.org/

Hardware

At least 4 GB of RAM and 50 GB of storage space
Scripts and Docker containers (detailed in the steps) available at the authors’ GitHub repository, https://github.com/jmvillalobos/RNA-seq-protocol
Dataset(s) to be analyzed: the datasets used in the example here (detailed in the steps) are available in a public repository on Zenodo: https://zenodo.org/records/10537097

Part 1: Installing packages and creating directories

1.To install the bioinformatics tools that we will use in this protocol, go to the terminal and create a general folder in which the analysis will be carried out:

user:∼$ mkdir project_2023

2.Move into the folder project_2023 using the cd command, and create subdirectories using the mkdir command:

user:∼$ cd project_2023/
user:∼/project_2023$ mkdir bioinformatics_tools
user:∼/project_2023$ mkdir raw_data
user:∼/project_2023$ mkdir trimming_data
user:∼/project_2023$ mkdir alignment_hisat2
user:∼/project_2023$ mkdir quantification_featureCounts
user:∼/project_2023$ mkdir genome_arabidopsis
user:∼/project_2023$ mkdir index_hisat2

3.To look at the folders created, use the ls command:

user:∼/project_2023$ ls

4.The sudo command enables users to execute programs with the security privileges of another user, typically the root user, granting temporary superuser access. To execute this action, enter each of the following commands separately in the terminal:

user:∼/project_2023/bioinformatics_tools$ sudo apt update
user:∼/project_2023/bioinformatics_tools$ sudo apt install zip unzip
user:∼/project_2023/bioinformatics_tools$ sudo apt install fastqc
user:∼/project_2023/bioinformatics_tools$ sudo apt install r-base
user:∼/project_2023/bioinformatics_tools$ sudo apt install hisat2
user:∼/project_2023/bioinformatics_tools$ sudo apt install subread

Note

When using a MacOS system, users can search for the program in the Conda repository (https://anaconda.org/bioconda/repo) and execute the specified installation command.

5.To check the version or request help, you can execute the following command:

user:∼/project_2023/bioinformatics_tools$ fastqc --version
user:∼/project_2023/bioinformatics_tools$ fastqc --help

6.Trimmomatic installation : To install Trimmomatic, download the available zip file from http://www.usadellab.org/cms/?page=trimmomatic using the wget command:

user:∼/project_2023$ cd bioinformatics_tools/
user:∼/project_2023/bioinformatics_tools$ wget
http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip

7.Decompress:

user:∼/project_2023/bioinformatics_tools$ unzip Trimmomatic-0.39.zip

Note

If an additional folder named Trimmomatic-0.39 is now visible, you have successfully installed version 0.39, which is the most recent version as of the date of this article.

8.To install the complete R system, use:

user:∼/project_2023/bioinformatics_tools$ sudo apt-get install r-base

Note

You have now finished installing all the tools needed to carry out Basic Protocol 1.

Part 2: Data download

The data we will use as our example here pertain to a study published in The Plant Journal (Villalobos-Escobedo et al., 2020) characterizing the roles of the Nox genes of the fungus Trichoderma atroviride during its interaction with A. thaliana. The raw RNA-seq data are publicly available in the NCBI SRA database at www.ncbi.nlm.nih.gov/sra/ with the accession number PRJNA575031.To simplify the process, we will use the European Nucleotide Archive (https://www.ebi.ac.uk/ena/browser/home) to find compressed versions of the FASTQ files.

9a. To download the files, copy each file link address by right-clicking on each file and, using the wget command, proceed to download into the raw_data folder (once per file):

user:∼/project_2023/raw_data$ wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR102/004/SRR10207204/SRR10207204_1.fastq.gz

9b. Another way to download sequencing files is using the SRA Toolkit. To utilize this method, you must first install the SRA Toolkit by following the instructions on GitHub (https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit). Once that is installed, you can download the FASTQ files as follows:

user:∼/project_2023/raw_data$ fasterq-dump --split-files SRR10207204

Note

This process will take some time, but in the end, you should find the corresponding FASTQ files.

10.In the end, all 12 available library files should have been downloaded:

SRR10207204_1.fastq.gz
SRR10207204_2.fastq.gz
SRR10207206_1.fastq.gz
SRR10207206_2.fastq.gz
SRR10207210_1.fastq.gz
SRR10207210_2.fastq.gz
SRR10207212_1.fastq.gz
SRR10207212_2.fastq.gz
SRR10207216_1.fastq.gz
SRR10207216_2.fastq.gz
SRR10207218_1.fastq.gz
SRR10207218_2.fastq.gz

Note

These files correspond to the biological replicates of the control, “Only Arabidopsis growing five dpi (without Trichoderma),” and the treatment with Trichoderma, “Arabidopsis with wild-type strain five dpi.”

Part 3: Quality check of the raw reads with FastQC

One initial stage in an RNA-seq data analysis is assessing data quality. To accomplish this, we use FastQC v0.11.9, a widely recognized tool enabling quality checks on sequencing data (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/).

11.To evaluate the quality of the raw data, navigate to the raw_data directory and execute the following command:

user:∼/project_2023/raw_data$ fastqc *.gz

Note

With this command, we instruct FastQC to perform quality analysis for all files with a .gz extension.

12.If you need to determine the quality of a particular file, simply run fastqc followed by the file name:

user:∼/project_2023/raw_data$ fastqc SRR10207204_1.fastq.gz

Note

The quality analysis carried out by FastQC will yield two files, a .zip and a .html file. The .html file contains the quality analysis performed by FastQC, which can be viewed using a web browser.

13.Maintaining order in your working directories is essential, so we suggest creating a folder for the raw data quality into which the files generated from the quality analysis can be moved:

user:∼/project_2023/raw_data$ mkdir quality_raw
user:∼/project_2023/raw_data$ mv *.zip quality_raw/
user:∼/project_2023/raw_data$ mv *.html quality_raw/

Part 4: Read trimming with Trimmomatic

Now that we have assessed the quality of the raw data, a common next task is read trimming. For this purpose, various bioinformatics tools have been developed, including Trimmomatic, Trim Galore (available at https://github.com/FelixKrueger/TrimGalore), Cutadapt, Skewer, SOAPnuke, and fastp. These tools are designed to remove low-quality bases, adapter sequences, and bases that exceed a predefined threshold (Bolger et al., 2014; Chen, Chen, et al., 2018; Chen, Zhou, et al., 2018; Jiang et al., 2014; Martin, 2011). We carry out the read-trimming process for this protocol using Trimmomatic, which can handle both paired-end and single-end Illumina sequencing data.

14.Trimmomatic: With Trimmomatic, perform smooth read trimming into the trimming_data folder. The command is the following:

user:∼/project_2023/trimming_data$ java-jar /home/enriquepola/project_2023/bioinformatics_tools/Trimmomatic-0.39/trimmomatic-0.39.jar PE -phred33 ../raw_data/SRR10207204_1.fastq.gz ../raw_data/SRR10207204_2.fastq.gz SRR10207204_P_1.fastq.gz SRR10207204_U_1.fastq.gz SRR10207204_P_2.fastq.gz SRR10207204_U_2.fastq.gz ILLUMINACLIP:/home/enriquepola/project_2023/bioinformatics_tools/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:15 MINLEN:36

Note

With this command, we ask Trimmomatic to Scan the read with a 4-base wide sliding window and to “trim” or remove data when the average quality per base drops below 15 (SLIDINGWINDOW:4:15), to drop reads below 36 bases long (MINLEN:36), and to remove adapters (ILLUMINACLIP: TruSeq3-PE.fa:2:30:10).

15.The Trimmomatic output should look like this:

TrimmomaticPE: Started with arguments:

-phred33 ../raw_data/SRR10207204_1.fastq.gz ../raw_data/SRR10207204_2.fastq.gz SRR10207204_P_1.fastq.gz SRR10207204_U_1.fastq.gz SRR10207204_P_2.fastq.gz SRR10207204_U_2.fastq.gz ILLUMINACLIP:/home/enriquepola/project_2023/bioinformatics_tools/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:15 MINLEN:36

Multiple cores found: Using 4 threads

Using PrefixPair: ‘TACACTCTTTCCCTACACGACGCTCTTCCGATCT’ and ‘GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT’
ILLUMINACLIP: Using 1 prefix pair, 0 forward/reverse sequences, 0 forward-only sequences, 0 reverse-only sequences
Input Read Pairs: 11645096 Both Surviving: 11117409 (95.47%) Forward Only Surviving: 178980 (1.54%) Reverse Only Surviving:265455 (2.28%) Dropped: 83252 (0.71%)

TrimmomaticPE: Completed successfully

16.The last part of the Trimmomatic output shows the percentage of reads that survived the process, which is reasonably satisfactory. To optimize the execution, we can implement a “for” loop, in which we set the last three numbers of the sequencing data as a sample because they are the only ones that vary between each file:

user:∼/project_2023/trimming_data{sample}_1.fastq.gz ../raw_data/SRR10207{sample}_P_1.fastq.gz SRR10207 $Double subscripts: use braces to clarify{sample}_U_1.fastq.gz SRR10207$ {sample}_P_2.fastq.gz SRR10207${sample}_U_2.fastq.gz ILLUMINACLIP:/home/enriquepola/project_2023/bioinformatics_tools/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 SLIDINGWINDOW:4:15 MINLEN:36; done|sh

Note

The results of read trimming with Trimmomatic are shown in Figure 2.

Quality analysis performed by FastQC. (A and B) Per-base sequence quality of the SRR10207218_1.fastq.gz library before (A) and after (B) read trimming with Trimmomatic.

Part 5: Alignment of reads

Once the read-trimming process is completed, the next stage in RNA-seq analysis is to align reads to a reference. When RNA-seq data come from an organism for which a reference genome is available, it is possible to infer the identity of the expressed transcripts by mapping the reads to the genome or transcriptome (Conesa et al., 2016). To align reads to a reference genome, tools such as HISAT2, STAR, TopHat2, Bowtie 2, and Rsubread can be used (Dobin et al., 2013; Kim et al., 2013; Kim et al., 2019; Liao et al., 2019; Langmead & Salzberg, 2012). Conversely, Kallisto or Salmon can be employed for pseudoalignment/quasi-mapping toward a reference transcriptome (Bray et al., 2016; Patro et al., 2017). In this protocol, we will only use HISAT2 (version 2.2.1), but we have provided an alternative protocol for the use of Kallisto and Salmon on GitHub (https://github.com/jmvillalobos/RNA-seq-protocol).

Part 5.1: Alignment to a reference genome using HISAT2

The aligner we will use is HISAT2, which employs a graph-based data structure and alignment algorithm for rapid and sensitive alignment of sequencing reads to a genome and a comprehensive collection of minor variants (Kim et al., 2019). Before commencing read mapping, it is essential to have an index of the reference genome. For this purpose, the specific reference genome is necessary; for the example data, we use the TAIR10 version of the A. thaliana genome, which can be obtained from EnsemblPlants (https://plants.ensembl.org/info/data/ftp/index.html). Subsequently, proceed to decompress the genome using the gunzip command and create the index into the genome_arabidopsis folder.

17.Download genome data:

user:∼/project_2023/genome_arabidopsis$ wget
https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-57/plants/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz

18.Decompress:

user:∼/project_2023/genome_arabidopsis$ gunzip Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz

Note

If you want to keep the original file, the “.gz” adds a “-k” between gunzip and the file to be decompressed.

19.Create the index in the index_hisat2 folder with HISAT2:

user:∼/project_2023/index_hisat2$ hisat2-build -p 4 ../genome_arabidopsis/Arabidopsis_thaliana.TAIR10.dna.toplevel.fa genome

Where:

hisat2-build is the HISAT2 command used to build the index;
-p specifies the number of computer cores to be used—in this case, four are used;
../genome_arabidopsis indicates the genome path, including the specific file (Arabidopsis_thaliana.TAIR10.dna.toplevel.fa);
genome is the name assigned to the files resulting from the index.

20.The process should generate a directory containing eight files:

user:∼/project_2023/index_hisat2$ ls:
genome.1.ht2
genome.2.ht2
genome.3.ht2
genome.4.ht2
genome.5.ht2
genome.6.ht2
genome.7.ht2
genome.8.ht2

21.Once the index has been successfully created, it is possible to align the clean reads to the reference genome (per each sample) in the alignment_hisat2 folder:

user:∼/project_2023/alignment_hisat2$ hisat2 -p 4 -x ../index_hisat2/genome -1 ../trimming_data/SRR10207204_P_1.fastq.gz -2 ../trimming_data/SRR10207204_P_2.fastq.gz -S SRR10207204.sam

Where:

-x is the index path and base name of the reference genome index;
-1 y -2 are paired-end files to be aligned with their respective paths;
-S is the output of the SAM alignment file.

22.Generate a loop to speed up the analysis:

user:∼/project_2023/alignment_hisat2{sample}_P_1.fastq.gz -2 ../trimming_data/SRR10207 $Double subscripts: use braces to clarify{sample}_P_2.fastq.gz -S SRR10207$ {sample}.sam; done|sh

23.Each alignment will generate an alignment output, which is essential as it indicates the percentage of reads that are aligned to the genome:

11117409 reads; of these:

11117409 (100.00%) were paired; of these:
415513 (3.74%) aligned concordantly 0 times
10136746 (91.18%) aligned concordantly exactly 1 time
565150 (5.08%) aligned concordantly >1 time
—-
415513 pairs aligned concordantly 0 times; of these:
59514 (14.32%) aligned discordantly 1 time
—-

355999 pairs aligned 0 times concordantly or discordantly; of these:

711998 mates make up the pairs; of these:
556751 (78.20%) aligned 0 times
141225 (19.84%) aligned exactly 1 time
14022 (1.97%) aligned >1 times
97.50% overall alignment rate

Part 5.2: Convert SAM to BAM file and sort by coordinates

The alignment result produces SAM (sequence alignment map) files, which are text files containing sequence alignment information in the reference genome (Li et al., 2009). BAM (binary alignment map) files are compressed versions of SAM files containing the same information but in a non-human-readable format. BAM files are more efficient and smaller, making them easier to process. After generating a BAM file, it is essential to sort and index it because reads are randomly assigned in the original file. The choice to sort by sequence identifier or genomic coordinates depends on the application, with sorting by coordinates being common for genomic data. Samtools is a commonly used tool to perform these tasks (Danecek et al., 2021).

24.To convert SAM to BAM format and sort it, execute the following command in the folder containing the SAM alignment files (per sample):

user:∼/project_2023/alignment_hisat2$ samtools sort SRR10207204.sam -o SRR10207204_sorted.bam

This should generate a sorted file:

user:∼/project_2023/alignment_hisat2$ ls
SRR10207204.sam SRR10207204_sorted.bam

25.Generate a loop to speed up the analysis.

user:∼/project_2023/alignment_hisat2{sample}.sam -o SRR10207${sample}_sorted.bam; done|sh

Note

Once all the sorted .bam files have been generated, the .sam files are not needed for the later stage (quantification), so deleting them is a good option to free up storage space.

Part 6: Quantification of the number of mapped reads

Read quantification is a crucial task in RNA-seq analysis. The central operation in this context involves counting the number of reads overlapping specific genomic features. Specialized tools such as featureCounts and HTSeq have emerged as critical elements for this task, enabling efficient counting of reads mapped to particular genomic features (Anders et al., 2015; Liao et al., 2014).

26.Before starting the quantification, it is essential to download the EnsemblPlants GTF annotation file and unzip it in the genome_arabidopsis directory as follows:

user:∼/project_2023/genome_arabidopsis$ wget
https://ftp.ebi.ac.uk/ensemblgenomes/pub/release-57/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.57.gtf.gz

27.Decompress:

user:∼/project_2023/genome_arabidopsis$ gunzip Arabidopsis_thaliana.TAIR10.57.gtf.gz

Part 7: Quantifying with featureCounts

We will use the files sorted by coordinates we obtained previously to quantify the number of mapped reads.

28.Navigate to the quantification_featureCounts folder and execute the following command:

user:∼/project_2023/quantification_featureCounts$ featureCounts -p --count ReadPairs -t exon -g gene_id -a ../genome_arabidopsis/Arabidopsis_thaliana.TAIR10.57.gtf -o SRR10207204.txt ../alignment_hisat2/SRR10207204_sorted.bam

Note

featureCounts version 2.0.3 was used.

29.We can use a “for” loop to automate the quantification of all samples, following a similar approach to that used with Trimmomatic, HISAT2, and Samtools. To simplify the process further, we propose creating a shell script that dynamically operates on the sorted.bam files in the alignment_hisat2 directory. This script can be inserted into a text editor such as Nano or Vim and saved with the name featureCounts.sh. Subsequently, necessary permissions are granted using the command chmod +x featureCounts.sh, and finally, the script is executed in the quantification_featureCounts directory.

#!/bin/bash
Input folder containing sorted BAM files
input_folder="../alignment_hisat2/"
GTF file
gtf_file="../genome_arabidopsis/Arabidopsis_thaliana.TAIR10.57.gtf"
Output folder for featureCounts results
output_folder="featureCounts_output/"
Create the output folder if it does not exist
mkdir -p $output_folder
Loop over each BAM file in the input folder
for bam_file in $input_folder/*.bam; do
Get the base name of the file without the .bam extension
base_name=bam_file .bam)
Run featureCounts for the current file
featureCounts -p --count ReadPairs -t exon -g gene_id -a output_folder/bam_file
done

30.The user can save this script from the command line:

vim featureCounts.sh

Note

To close and save the modifications in the text editor, use Ctrl+W+Q.

31.Run the script:

user:∼/project_2023/quantification_featureCounts$./featureCounts.sh

Note

In this script, the “for” loop iterates over each BAM file in the specified folder (

output_folder). The output of featureCounts for each sample is a .txt file containing the count numbers associated with the feature and a .summary file containing quantification result metrics. With these results, it is possible to generate a count matrix that allows us to determine gene expression between the evaluated conditions of the study.

32.To achieve this, run a series of commands that generate a count matrix for all samples with their respective identifiers. For this purpose, install two data manipulation packages:

user:∼/project_2023/quantification_featureCounts$ sudo apt install moreutils
user:∼/project_2023/quantification_featureCounts$ sudo apt install parallel

Note

As with other installations, we recommend using the Conda repository for macOS systems.

33.Extract counts (column 7):

user:∼/project_2023/quantification_featureCounts$ ls -1 *.txt | parallel ‘cat {} | sed ‘1d’ | cut -f7 {} > {/.}_clean.txt’

34.Extract id (column 1):

user:∼/project_2023/quantification_featureCounts$ ls -1 *.txt | head -1 | xargs cut -f1 > genes.txt

35.Arrange data in a matrix:

user:∼/project_2023/quantification_featureCounts$ paste genes.txt *clean.txt > matriz_arabidopsis_2023.txt

Note

Now, it is necessary to edit the headers of the count matrix; we can do this manually, matching the accession number with the treatment (Table 1). The result should look as shown in Table 2.

Table 1. Accession and Treatment Numbers of the Samples Used for Analysis

Accession number	Library name	Replicates	Treatment
SRR10207204	Only Arabidopsis growing 5 dpi (without Trichoderma)	Biological replicate 1	Control_1
SRR10207210	Only Arabidopsis growing 5 dpi (without Trichoderma)	Biological replicate 2	Control_2
SRR10207216	Only Arabidopsis growing 5 dpi (without Trichoderma)	Biological replicate 3	Control_3
SRR10207206	Arabidopsis with wild-type strain 5 dpi	Biological replicate 1	Treatment_1
SRR10207212	Arabidopsis with wild-type strain 5 dpi	Biological replicate 2	Treatment_2
SRR10207218	Arabidopsis with wild-type strain 5 dpi	Biological replicate 3	Treatment_3

Table 2. Preview of the Count Table Built With the Files Generated by featureCounts

Geneid	Control_1	Control_2	Control_3	Treatment_1	Treatment_2	Treatment_3
AT1G30814	0	1	0	0	1	0
AT1G78930	54	88	25	71	44	7
AT1G71695	339	297	90	372	157	37
AT1G58983	0	0	0	0	0	0
AT1G12980	0	0	0	0	0	0
AT1G45223	0	0	0	0	0	0
AT1G56250	3	4	0	149	164	30
AT1G66852	0	0	0	0	0	0
AT1G69810	756	911	319	893	434	74
AT1G72450	159	186	63	510	273	52

Part 8: Differential gene expression analysis

Differential expression analysis is essential in interpreting RNA-seq data, allowing the determination of quantitative changes in gene expression levels between experimental groups (Fig. 3). Specialized tools such as edgeR and DESeq2, based on negative binomial distributions, are prominent in differential expression analysis (Love et al., 2014; Robinson et al., 2010). In this specific protocol, we will focus on implementing DESeq2 to explore changes in gene expression of A. thaliana plants inoculated with T. atroviride (Treatment) versus non-inoculated plants (Control).

Graphics generated from the analysis of differential gene expression. (A) PCA plot illustrates the variation between treatment and control replicates. (B) MA plot displays the log<sub>2</sub>(fold change) of the treatment over the mean of normalized counts. (C) Volcano plot highlights in red the genes differentially expressed in the treatment versus control condition. (D) Heatmap showcases the 10 most prominent differentially expressed genes in the treatment-versus-control comparison.

To perform the differential expression analysis, we will use R and the packages indicated in the script described here, provided on GitHub, as a supplementary script (https://github.com/jmvillalobos/RNA-seq-protocol).

36.Install packages to use in the R session:

Library Installation
Installing DESeq2
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")
Installing ggplot2
install.packages("ggplot2")
Installing EnhancedVolcano
BiocManager::install("EnhancedVolcano")
Installing pheatmap
install.packages("pheatmap")

37.Load the library for differential gene expression (DGE) analysis and plotting:

Loading the library for DGE
library(DESeq2)
Loading the library for plots
library("ggplot2")
Loading the library for the volcano plot
library("EnhancedVolcano")
Loading the library for heatmap
library("pheatmap")

38.Load the contingency table of the accounts from featureCounts into the R session:

Setting the path to the featureCounts count matrix
setwd("C:/project_2023/quantification_featureCounts")
Viewing Files in the Directory
list.files()
Reading the count matrix
countData <- read.delim("./matriz_arabidopsis_2023.txt", header = TRUE, row.names = 1)
head(countData)
Define the new order of the replicates
column_order <- c("control_1", "control_2", "control_3", "treatment_1", "treatment_2", "treatment_3")
Rearrange the columns
countData <- countData[, column_order]
Displays the header of the new table
head(countData)

39.Create the “data.frame” object to perform differential expression analysis:

Description of samples
condition <- factor(c("Control", "Control", "Control", "Treatment", "Treatment", "Treatment"))
colData <- data.frame(row.names = colnames(countData), condition)
head(colData)
Creating a DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = countData, colData = colData, design = ∼ condition)
dds

40.Generate a PCA plot to verify consistency between replicates.

Note

PCA reveals sample variations and flags potential inconsistencies among replicates (see the Understanding Results section for details).

Generating a PCA Plot
rld <- rlog(dds, blind = F)
plotPCA(rld, intgroup = "condition") + geom_text(aes(label=name),
vjust=0.2)

41.Filter the genes to be considered in the analysis by the number of reads.

Note

It is not necessary to perform filtering before applying DESeq2 functions; however, the developers of DESeq2 recommend pre-filtering low-count genes before applying some of its functions. This practice offers two significant advantages: first, by removing rows with a very low number of reads, it reduces the memory size of the dds data object, thus improving computational efficiency. Second, this pre-filtering can enhance the clarity of visualizations by eliminating features that do not provide relevant information. Therefore, in our protocol, we have chosen to pre-filter to retain only those rows that have a count >10.

Filtering genes with very low expression
dds <- dds[rowSums(counts(dds)) > 10,]
dds
Performing DGE analysis
dds <- DESeq(dds)
dds
Extracting results
res <- results(dds)
res
Summary of DGE
summary(res)
Sorting the summary list by p-adj
res <- res[order(res$padj),]
head(res)

42.Set the comparison of interest.

Note

In this section, we define the comparison of interest as “Treatment vs. Control.” If working with additional conditions, the user would need to adjust the corresponding values in the function “contrast”; for example, contrast = c("condition," "Light," "Control") or contrast = c("condition," "Infection," "Control"). It is also possible to adjust cutoff parameters such as p_adj, which is a transformation of the p-value after conducting multiple tests. The “padj” values are adjusted using methods such as Benjamini-Hochberg correction, which controls the false discovery rate (FDR) when multiple hypothesis tests are conducted simultaneously. Additionally, it is possible to adjust the fold change (FC), which does not necessarily have to be treated as a filter but rather is a widely used reference in the literature, for example, log₂FC = 1. However, it is not necessary to impose a specific FC value if you decided against that. This allows the analysis to be adapted to different experimental designs without altering the code structure.

Extracting contrasts between conditions
Treatment_vs_Control <- results(dds, contrast = c("condition", "Treatment", "Control"))
summary(Treatment_vs_Control)
Obtaining a list of differentially expressed genes (DEGs) with stricter filtering
deg <- subset(Treatment_vs_Control, padj < 0.05 & abs(log2FoldChange) > 1)
print(deg)
summary(deg)

43.Export results:

Exporting the DEGs table
write.csv(deg, file = "DEG_Treatment_vs_Control_strict.csv")
Printing and exporting up-DEGs
up <- subset(deg, log2FoldChange > 1)
print(up)
summary(up)
write.csv(up, file = "Up_Treatment_vs_Control_strict.csv")
Printing and exporting down-DEGs
down <- subset(deg, log2FoldChange < (-1))
print(down)
summary(down)
write.csv(down, file = "Down_Treatment_vs_Control_strict.csv")

Note

Differential expression tables can be downloaded from https://github.com/jmvillalobos/RNA-seq-protocol. The output of a differential expression table generated with DESeq2 should look like Table 3.

Table 3. Preview of the Differential Expression Tables Generated With DESeq2 (Up- and Downregulated Genes)

	baseMean	log2FoldChange	lfcSE	stat	p-value	padj
AT1G56250	54.351119	5.87533399	0.65010368	9.03753377	1.60E-19	7.14E-18
AT1G72450	163.108104	1.02177379	0.20647886	4.94856367	7.48E-07	8.23E-06
AT1G79700	53.0739385	−1.7029217	0.34330325	−4.9604009	7.03E-07	7.79E-06
AT1G52200	1506.68968	−1.3297206	0.16382535	−8.116696	4.79E-16	1.61E-14
AT1G80870	310.280349	−1.1105098	0.21694316	−5.1188976	3.07E-07	3.61E-06
AT1G24575	68.0515638	2.75544694	0.35408813	7.78181113	7.15E-15	2.15E-13
AT1G53625	128.236024	2.25057089	0.28583177	7.87376033	3.44E-15	1.07E-13
AT1G51440	9.75718656	1.98154125	0.70965146	2.7922739	.0052339	0.02253924
AT1G80840	91.2133689	3.4596998	0.3633223	9.52239873	1.69E-21	8.81E-20
AT1G68470	54.7916789	1.87428097	0.32386298	5.78726522	7.15E-09	1.08E-07

44.To generate a basic volcano plot:

Generating a volcano plot with EnhancedVolcano
EnhancedVolcano(res, lab = rownames(res), x = ‘log2FoldChange’, y = ‘pvalue’)

45.To generate an MA plot:

Setting the contrast of interest
plotMA(Treatment_vs_Control, alpha = 0.05, main = "Inoculated with Trichoderma vs Control", xlab = "mean of normalized counts")

46.To generate a heatmap:

Selecting the top 10 DEGs in the Treatment_vs_Control comparison
res_ordered <- Treatment_vs_Control[order(Treatment_vs_Control$padj),]
top_genes <- row.names(res_ordered)[1:10]
Extracting and normalizing counts
counts <- counts(dds, normalized = TRUE)
counts_top <- counts[top_genes,]
Applying a logarithmic transformation to counts
log_counts_top <- log2(counts_top + 1)
Creating an annotation data frame based on condition information (colData)
df <- colData
Displaying the annotation data frame
df
Generating a heatmap using the pheatmap library
heatmap_20 <- pheatmap(log_counts_top, annotation = df)

Note

The results of differential expression analysis are shown in Figure 3.

Basic Protocol 2: GENE ONTOLOGY ENRICHMENT ANALYSIS

In Basic Protocol 2, we will perform a functional enrichment analysis to comprehensively understand the underlying biological processes (Gene Ontology or GO terms; Fig. 4). Various tools are available for this type of analysis, such as CAMERA, GOseq, topGO, clusterProfiler, and gprofiler2 (Alexa & Rahnenfuhrer, 2023; Robertson, H., & Robertson, N., 2024; Kolberg et al., 2020; Wu & Smyth, 2012; Wu et al., 2021; Young et al., 2010). In this protocol, we focus on using clusterProfiler because of its user-friendly interface and efficient integration with the differential expression output generated by DESeq2.clusterProfiler performs over-representation analysis (ORA; Boyle et al., 2004), a widely used approach to determine whether known biological functions or processes are over-represented (enriched) in an experimentally derived list of genes, such as a list of differentially expressed genes. To perform this analysis, follow the R script described here and on GitHub (https://github.com/jmvillalobos/RNA-seq-protocol).

GO term enrichment results obtained using clusterProfiler display enriched categories for genes in the treatment versus control contrast (upregulated genes). (A) Dot plot, (B) bar plot, and (C) enrichment map (emap) plot showcase the top enriched GO terms based on over-representation analysis.

In terms of systems resources, efficient progress in this task can be achieved by utilizing the command line within a Unix/Linux environment. We conducted this work on a 64-bit Ubuntu 23.04 operating system. For Windows users, a Windows system employing the Windows Subsystem for Linux (WSL; https://learn.microsoft.com/en-us/windows/wsl/install) is a viable option. For macOS users, it is feasible to execute identical commands in the terminal because of the shared Unix foundation of both operating systems. This protocol is compatible with the R desktop application, contingent on your operating system.