A Computational Workflow for Analysis of 3′ Tag-Seq Data
Akshay D. Paropkari, Akshay D. Paropkari, Priyanka S. Bapat, Priyanka S. Bapat, Suzanne S. Sindi, Suzanne S. Sindi, Clarissa J. Nobile, Clarissa J. Nobile
Abstract
RNA-sequencing (RNA-seq) is a gold-standard method to profile genome-wide changes in gene expression. RNA-seq uses high-throughput sequencing technology to quantify the amount of RNA in a biological sample. With the increasing popularity of RNA-seq, many variations on the protocol have been proposed to extract unique and relevant information from biological samples. 3′ Tag-Seq (also called TagSeq, 3′ Tag-RNA-Seq, and Quant-Seq 3′ mRNA-Seq) is one RNA-seq variation where the 3′ end of the transcript is selected and amplified to yield one copy of cDNA from each transcript in the biological sample. We present a simple, easy-to-use, and publicly available computational workflow to analyze 3′ Tag-Seq data. The workflow begins by trimming sequence adapters from raw FASTQ files. The trimmed sequence reads are checked for quality using FastQC and aligned to the reference genome, and then read counts are obtained using STAR. Differential gene expression analysis is performed using DESeq2, based on differential analysis of gene count data. The outputs of this workflow are MA plots, tables of differentially expressed genes, and UpSet plots. This protocol is intended for users specifically interested in analyzing 3′ Tag-Seq data, and thus normalizations based on transcript length are not performed within the workflow. Future updates to this workflow could include custom analyses based on the gene counts table as well as data visualization enhancements. © 2023 The Authors. Current Protocols published by Wiley Periodicals LLC.
Basic Protocol : Running the 3′ Tag-Seq workflow
Support Protocol : Generating genome indices
INTRODUCTION
RNA sequencing (RNA-seq) is a widely used method to detect genome-wide changes in gene expression (Wang, Gerstein, & Snyder, 2009). It was first used in Saccharomyces cerevisiae to identify the gene expression patterns of all genes, exons, and their boundaries across the yeast genome (Nagalakshmi et al., 2008). For humans as well as many model organisms including yeast, fruit flies, and mice, RNA-seq has been instrumental in providing high-resolution, functionally relevant genome annotations (Cherry et al., 2012; Gnerre et al., 2011; IHGSC, 2004; Matthews et al., 2015). Briefly, the RNA-seq protocol begins with the isolation of total RNA, which is typically enriched/selected for polyA+ RNA or alternatively depleted for rRNA (Zhao, Zhang, Gamini, Zhang, & von Schack, 2018). After this step, double-stranded cDNA is synthesized via reverse transcription from the RNA, resulting in cDNA libraries. The cDNA molecules are fragmented and sequence adapters are added to the cDNA fragments. Then, the cDNA fragments are subjected to high-throughput sequencing to generate short sequence reads that are used for downstream analyses.
Whole-transcript RNA-seq workflows produce data providing information on quantifying gene expression, novel transcripts, alternatively spliced genes, and allele-specific expression (Wang et al., 2009). Although this information is valuable, the goal of many biological research projects is simply to identify changes in gene expression patterns between conditions of interest. Given this simplified goal, classical RNA-seq protocols are overly complex and more costly than is necessary to obtain genome-wide gene expression changes (Ma et al., 2019; Wang et al., 2009). To simplify classical RNA-seq protocols, recent advances in the field have provided alternative RNA-seq methods that are used to address specific biological questions (Moll, Ante, Seitz, & Reda, 2014; Morrissy et al., 2011). One of these alternative methods is 3′ Tag-Seq (also called TagSeq, 3′ Tag-RNA-Seq, and Quant-Seq 3′ mRNA-Seq). In this method, cDNA libraries are reverse-transcribed from only the 3′ end of the mRNA, resulting in a single copy of cDNA from each transcript (Ma et al., 2019; Moll et al., 2014; Torres, Metta, Ottenwälder, & Schlötterer, 2008). Compared to classical RNA-seq methods, 3′ Tag-Seq is simpler, quicker, and less costly while providing sufficient sequencing depth for differential gene expression analysis (Ma et al., 2019). These benefits make 3′ Tag-Seq an ideal choice for researchers whose goal is to identify changes in patterns of gene expression between two or more conditions.
Analysis of sequencing datasets from classical RNA-seq data was complex when this technology was first introduced (Wang et al., 2009). Even today, analysis of RNA-seq data can be challenging because it is highly dependent on the experimental design used to create the sequencing libraries (Conesa et al., 2016). Consequently, there is no one-size-fits-all workflow for analysis of RNA-seq output reads. Here, we present a simple, easy-to-use, and publicly available computational workflow to analyze 3′ Tag-Seq data, which allows the user to use default analysis parameters or adjust parameters to fit their unique experimental design considerations. The workflow begins by trimming sequence adapters from raw FASTQ files (Cock, Fields, Goto, Heuer, & Rice, 2010; Conesa et al., 2016). The trimmed sequence reads are checked for quality using FastQC and aligned to the reference genome, and read counts are obtained using STAR (Dobin et al., 2013). Differential gene expression analysis is then performed using DESeq2, based on differential analysis of gene count data (Love, Huber, & Anders, 2014). The outputs of this workflow are MA plots, tables of differentially expressed genes, and UpSet plots. This workflow is well-suited for users with minimal computational expertise.
Basic Protocol: RUNNING THE 3′ Tag-Seq WORKFLOW
In the following section, we describe in detail our 3′ Tag-Seq analysis workflow. Figure 1 provides a summary of the computational workflow. Briefly the pipeline begins with the processing of raw RNA-seq FASTQ files and ends with a table output of differential gene expression (Love et al., 2014).

Necessary Resources
Hardware
- An internet-connected machine running Linux 64-bit Ubuntu version 20.04.1 with at least 32 GB RAM. The number of threads can be provided by the user, but 16 threads is recommended.
Software
- The workflow uses the Conda command line tool environment to install all required software and tools. Conda software can be accessed at https://docs.conda.io/en/latest/miniconda.html. The workflow is saved in a bash script file called pipeline.sh. The source code and documentation can be found on GitHub at https://github.com/akshayparopkari/RNAseq/wiki. The Conda environment file is provided in Supplemental file 1 (see Supporting Information).
Other requirements
- Access to a computational cluster and login information
- Basic knowledge of Unix
- Raw FASTQ sequencing data
- Sample metadata
Download the RNA-seq workflow on a local machine
In Linux and MacOS, use the built-in Terminal application. In Windows, download and use Git Bash (https://gitforwindows.org/).
1.Navigate to the desired directory to download this folder on your machine.
2.Make script files executable.
- cd RNAseq/
- chmod u+x pipeline.sh
- chmod u+x format_counts_table.py
Load the Conda virtual environment
Conda enables virtual environments that contain the required software packages/libraries to be installed and set up. In this instance, the RNA-seq Conda environment contains the BBMap suite, STAR alignment software, and FASTQC tool (Bushnell, 2014; Dobin et al., 2013). Additionally, required Python and R libraries and their dependencies are also installed.
3.Create Conda environment using Supplemental File 1.
- conda env create -f Supplemental_file_1.yml -n RNAseq
The environment only needs to be created once. For subsequent analysis, the environment can be activated to run the analysis using the command:
- conda activate RNAseq
Create an input data folder
4.The main script of 3′ Tag-Seq is the pipeline.sh file. This single bash script contains all the preprocessing steps: QC filtering with BBDuk, generating QC summaries with FastQC, and alignment and gene counting with STAR. The pipeline.sh script takes in a single input, which is a folder/directory containing:
- all raw FASTQ sequence files
- the sample metadata Excel file
The raw FASTQ sequence files may be either compressed (using gzip) or uncompressed. The file names must start with the sample ID, followed by an underscore and the rest of the file name. For example, projectname_date_L001.fastq.gz should be named sampleid_projectname_date_L001.fastq.gz. The first part of the file name before the first underscore dictates the sample the script is processing. The sample metadata file contains all metadata associated with the input samples including sample ID, genotype, condition, treatment, time, etc. For this repository, the sample metadata file must contain at least two columns: SampleID and Condition. Table 1 is an example of a sample metadata file, where the first two columns (SampleID and Condition) are required and the third and subsequent columns (e.g., FASTQ_file) are optional but highly recommended. A comprehensive metadata file also enables convenient sample submission to a sequence read archive (SRA) once your manuscript is published.
SampleID | Condition | FASTQ_file | Other_Sample_Info |
---|---|---|---|
Sample1A | WT | Sample1A_S8_L001_R1_001.fastq.gz | … |
Sample1B | Mutant | Sample1B_S8_L001_R1_001.fastq.gz | … |
Sample2A | WT | Sample2A_S8_L001_R1_001.fastq.gz | … |
Sample2B | Mutant | Sample2B_S8_L001_R1_001.fastq.gz | … |
Sample3A | WT | Sample3A_S8_L001_R1_001.fastq.gz | … |
Sample3B | Mutant | Sample3B_S8_L001_R1_001.fastq.gz | … |
… | … | … | … |
-
This file can be used as a template to generate a metadata file.
Transfer data to/from a cloud computing resource to a local machine via command line
5.Below is a common usage of the secure copy scp function, which is one of the commands used to transfer files to/from a cloud computing resource. The other command is a secure file transfer protocol sftp. Please refer to the cloud computing resource wiki for detailed instructions on sftp function.
- scp FROM TO
where FROM is the source location and TO is the destination location.
Run the RNA-seq pipeline
The RNA-seq Conda environment must be activated before attempting to execute the pipeline.
6.Run the RNA-seq pipeline.
- INPUTFOLDER=“path/to/your/input/folder” # enter your data folder with FASTQ files here
- bash pipeline.sh "
INPUTFOLDER"/preprocess.log
Output files:
- pipeline.sh creates multiple output files, which can be useful to gain insights into specific samples to address any discrepancy in the data. The three important files to check are:
- gene_raw_counts.txt: a tab-separated file of raw gene counts for all samples, with gene names as rows and samples as columns
- deseq2_lfc.txt: a tab-separated file from the DESeq2 analysis
- MA_plot.pdf: a .pdf file depicting volcano plots of log fold changes against mean gene expression
Additional information about other output files:
- All files ending “_trimmed.fastq” are trimmed sequences from BBmap and are saved in the trim_log directory
- All files ending “.bam” are alignment files generated by STAR and are saved in the STAR_log directory
- All files ending “ReadsPerGene.out.tab” are gene count files for each sample generated by STAR and are saved in the STAR_log directory
- All files ending “Log.out”, “Log.final.out”, and “Log.progress.out” are intermediary alignment files generated by STAR and are saved in the STAR_log directory
Visualize overlaps in multiple experimental conditions
7.Use the overlap_upsetR.R script to visualize overlaps in genes for multiple experimental conditions. The overlap is represented as an UpSet plot (Lex, Gehlenborg, Strobelt, Vuillemot, & Pfister, 2014). UpSet plots are an extension of Venn diagrams and are useful when there are more than three categories/sets of conditions/samples to consider. The overlap_upsetR.R takes one input (either “up” or “down”) to calculate overlap between various samples/conditions. Users need to supply an input directory in the code on line 38 and run the following command to get the output UpSet plot:
To visualize genes upregulated in multiple conditions/samples:
- overlap_upsetR.R up
To visualize genes downregulated in multiple conditions/samples:
- overlap_upsetR.R down
Figure 2 is an example of an UpSet plot output of Candida albicans RNA-seq data showing upregulated genes overlapping various conditions.

Support Protocol: GENERATING GENOME INDICES
During the alignment step, STAR utilizes genome index files for mapping sequenced reads to a reference genome. This protocol describes how to generate genome indices for the C. albicans assembly 21 genome as an example. These steps can be used to generate genome indices for your reference genome of choice.
Necessary Resources
- See Basic Protocol.
1.In your home folder, download C. albicans chromosomal sequences from the Candida Genome Database (CGD; http://www.candidagenome.org/; Skrzypek et al., 2017).
- wget http://www.candidagenome.org/download/sequence/C_albicans_SC5314/Assembly21/current/C_albicans_SC5314_A21_current_chromosomes.fasta.gz
- gunzip C_albicans_SC5314_A21_current_chromosomes.fasta.gz
2.Download the C. albicans genome annotation GTF file from the CGD.
- wget http://www.candidagenome.org/download/gff/C_albicans_SC5314/Assembly21/C_albicans_SC5314_A21_current_features.gtf
- gunzip C_albicans_SC5314_A21_current_features.gtf
3.Activate the 3′ Tag-Seq Conda environment.
- module load anaconda3
- source activate RNA-seq
4.Generate STAR genomes.
- mkdir ca_genome/
- cd ca_genome/
- STAR --runMode genomeGenerate --genomeDir./ --genomeFastaFiles ∼/ C_albicans_SC5314_A21_current_chromosomes.fasta
STAR will generate output index files in the ca_genome folder.
COMMENTARY
Background Information
We present a simple and user-friendly workflow to analyze 3′ Tag-Seq experimental data. This workflow allows one to determine differentially expressed genes in the experimental conditions of interest. The workflow also includes a useful UpSet plot script to allow further analysis of differential gene expression data.
Given that the experimental methodology behind 3′ Tag-Seq is focused on the 3′ end of the transcript, information related to splicing events is not captured, and this is a limitation of this methodology.
We note that this workflow is distinct from other pipelines that analyze traditional RNA-seq experimental data in terms of gene counting. In traditional RNA-seq analyses, gene counts are normalized by the length of the transcript or fragment to obtain comparable count values across genes and across sampling conditions. For traditional RNA-seq experiments, transcripts are fragmented and reverse-transcribed into cDNA. As a result, longer transcripts generate more cDNA, necessitating length-based normalization procedures during analysis. For 3′ Tag-Seq experiments, only the 3′-end of the transcript is reverse-transcribed, eliminating this need for length-based normalizations. This advantage of 3′ Tag-Seq facilitates an overall simpler analysis workflow compared to traditional RNA-seq in obtaining accurate read count estimates from biological samples.
Critical Parameters and Troubleshooting
The script only takes in a single input—a folder with raw FastQ files. The FastQ files can be either compressed or uncompressed.
See Table 2 for troubleshooting tips.
Issue | Possible cause | Solution |
---|---|---|
No gene count table generated by STAR | STAR was unable to locate the reference genome | Follow steps listed in Section 2 of STAR documentation |
https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf | ||
No output from DESEq2 analysis | Metadata file was formatted incorrectly | Follow steps listed in GitHub Wiki |
https://github.com/akshayparopkari/RNAseq/wiki/4.-Creating-input-data-folder |
Understanding Results
The outputs of this pipeline are MA plots, tables of differentially expressed genes, and UpSet plots. The output files are saved in the input folder supplied with the script. The DGE output table consists of six numerical columns: baseMean, log2FoldChange, lfcSE, stat, pvalue, and padj. The details of these columns are explained in “Analyzing RNA-seq data with DESeq2” (https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html). To obtain a list of genes that are differentially regulated between two conditions, users can focus on the log2FoldChange and padj columns. The log2FoldChange indicates the magnitude of change in expression values between the two experimental conditions and the padj value provides statistical significance towards the calculated magnitude change. The availability of these output files including the MA-plot indicates a successful run of this pipeline.
Acknowledgments
We thank all past and present members of the Nobile and Sindi laboratories for feedback on the manuscript. We are especially grateful to Deepika Gunasekaran for constructive comments and support on the computational workflow as well as to the anonymous reviewers for providing feedback that significantly improved the utility of the workflow. This work was supported by the National Institutes of Health (NIH) National Institute of General Medical Sciences (NIGMS) award number R35GM124594 and by the Kamangar family in the form of an endowed chair to C.J.N. The content is the sole responsibility of the authors and does not represent the views of the funders. The funders had no role in the design of the study; the collection, analysis, or interpretation of the data; the writing of the manuscript; or the decision to publish the results.
Author Contributions
Akshay Paropkari : Conceptualization, data curation, formal analysis, investigation, methodology, resources, software, validation, visualization, writing (original draft). Priyanka Bapat : Data curation, formal analysis, investigation, methodology, validation. Suzanne Sindi : Resources, supervision, writing (review and editing). Clarissa Nobile : Conceptualization, funding acquisition, project administration, resources, supervision, writing (review and editing).
Conflict of Interest
C.J.N. is a cofounder of BioSynesis, Inc., a company developing diagnostics and therapeutics for biofilm infections. BioSynesis, Inc. was not involved in the content of this manuscript.
Open Research
Data Availability Statement
The source files that support this protocol are freely available from GitHub (https://github.com/akshayparopkari/RNAseq). The example RNA-seq data used to generate Figure 2 can be obtained from the corresponding authors upon request. The test dataset for the TF028 (rme1 mutant) strain versus the wildtype strain can be accessed at NCBI GEO (https://www.ncbi.nlm.nih.gov/gds) under accession number GSE200778.
Supporting Information
cpz1664-sup-0001-SuppMat.yml
Conda environment file for running the 3′ Tag-Seq workflow.
Supporting Information
Filename | Description |
---|---|
cpz1664-sup-0001-SuppMat.yml5.2 KB | Supporting Information |
Please note: The publisher is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.
Literature Cited
- Bushnell, B. (2014). BBMap: A fast, accurate, splice-aware aligner (report no. LBNL-7065E). Lawrence Berkeley National Lab (LBNL), Berkeley, Calif. Retrieved from https://www.osti.gov/biblio/1241166
- Cherry, J. M., Hong, E. L., Amundsen, C., Balakrishnan, R., Binkley, G., Chan, E. T., … Wong, E. D. (2012). Saccharomyces Genome Database: The genomics resource of budding yeast. Nucleic Acids Research , 40(D1), D700–D705. doi: 10.1093/nar/gkr1029
- Cock, P. J. A., Fields, C. J., Goto, N., Heuer, M. L., & Rice, P. M. (2010). The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucleic Acids Research , 38(6), 1767–1771. doi: 10.1093/nar/gkp1137
- Conesa, A., Madrigal, P., Tarazona, S., Gomez-Cabrero, D., Cervera, A., McPherson, A., … Mortazavi, A. (2016). A survey of best practices for RNA-seq data analysis. Genome Biology , 17(1), 1–19. doi: 10.1186/s13059-016-0881-8
- Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., … Gingeras, T. R. (2013). STAR: Ultrafast universal RNA-seq aligner. Bioinformatics , 29(1), 15–21. doi: 10.1093/bioinformatics/bts635
- Gnerre, S., MacCallum, I., Przybylski, D., Ribeiro, F. J., Burton, J. N., Walker, B. J., … Jaffe, D. B. (2011). High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences of the United States of America , 108(4), 1513–1518. doi: 10.1073/pnas.1017351108
- International Human Genome Sequencing Consortium (IHGSC). (2004). Finishing the euchromatic sequence of the human genome. Nature , 431(7011), 931–945. doi: 10.1038/nature03001
- Lex, A., Gehlenborg, N., Strobelt, H., Vuillemot, R., & Pfister, H. (2014). UpSet: Visualization of intersecting sets. IEEE Transactions on Visualization and Computer Graphics , 20(12), 1983–1992. doi: 10.1109/TVCG.2014.2346248
- Love, M. I., Huber, W., & Anders, S. (2014). Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology , 15(12), 1–21. doi: 10.1186/s13059-014-0550-8
- Ma, F., Fuqua, B. K., Hasin, Y., Yukhtman, C., Vulpe, C. D., Lusis, A. J., & Pellegrini, M. (2019). A comparison between whole transcript and 3′ RNA sequencing methods using Kapa and Lexogen library preparation methods. BMC Genomics , 20(1), 1–12. doi: 10.1186/s12864-018-5393-3
- Matthews, B. B., Dos Santos, G., Crosby, M. A., Emmert, D. B., St Pierre, S. E., & Gramates, L. S., … FlyBase Consortium. (2015). Gene model annotations for Drosophila melanogaster : Impact of high-throughput data. G3: Genes/Genomes/Genetics , 5(8), 1721–1736. doi: 10.1534/g3.115.018929
- Moll, P., Ante, M., Seitz, A., & Reda, T. (2014). QuantSeq 3′ mRNA sequencing for RNA quantification. Nature Methods , 11(12), i–iii. doi: 10.1038/nmeth.f.376
- Morrissy, S., Zhao, Y., Delaney, A., Asano, J., Dhalla, N., Li, I., … Marra, M. (2011). Tag-Seq: Next-generation tag sequencing for gene expression profiling. In M. Harbers & G. Kahl (Eds.), Tag-Based Next Generation Sequencing (pp. 211–241). Wiley-VCH. doi: 10.1002/9783527644582.ch13
- Nagalakshmi, U., Wang, Z., Waern, K., Shou, C., Raha, D., Gerstein, M., & Snyder, M. (2008). The transcriptional landscape of the yeast genome defined by RNA sequencing. Science , 320(5881), 1344–1349. doi: 10.1126/science.1158441
- Skrzypek, M. S., Binkley, J., Binkley, G., Miyasato, S. R., Simison, M., & Sherlock, G. (2017). The Candida Genome Database (CGD): Incorporation of assembly 22, systematic identifiers and visualization of high throughput sequencing data. Nucleic Acids Research , 45(D1), D592–D596. doi: 10.1093/nar/gkw924
- Torres, T. T., Metta, M., Ottenwälder, B., & Schlötterer, C. (2008). Gene expression profiling by massively parallel sequencing. Genome Research , 18(1), 172–177. doi: 10.1101/gr.6984908
- Wang, Z., Gerstein, M., & Snyder, M. (2009). RNA-Seq: A revolutionary tool for transcriptomics. Nature Reviews Genetics , 10(1), 57–63. doi: 10.1038/nrg2484
- Zhao, S., Zhang, Y., Gamini, R., Zhang, B., & von Schack, D. (2018). Evaluation of two main RNA-seq approaches for gene quantification in clinical RNA sequencing: PolyA+ selection versus rRNA depletion. Scientific Reports , 8(1), 4781. doi: 10.1038/s41598-018-23226-4
Internet Resources
Candida albicans assembly 21 chromosomal sequence FASTA file.
Candida albicans assembly 21 chromosomal features GTF file.
Babraham Bioinformatics—FastQC A Quality Control tool for High Throughput Sequence Data. (n.d.). Retrieved November 7, 2021.
FileZilla software.
CyberDuck software.
Citing Literature
Number of times cited according to CrossRef: 1
- Shu-Yun Zhang, Xiufeng Gan, Baoguo Shen, Jian Jiang, Huimin Shen, Yuhang Lei, Qiuju Liang, Chenglian Bai, Changjiang Huang, Wencan Wu, Ying Guo, Yang Song, Jiangfei Chen, 6PPD and its metabolite 6PPDQ induce different developmental toxicities and phenotypes in embryonic zebrafish, Journal of Hazardous Materials, 10.1016/j.jhazmat.2023.131601, 455 , (131601), (2023).