Analysis of CRAC Datasets

Clémentine Delan-Forino, David Tollervey

Published: 2021-09-03 DOI: 10.17504/protocols.io.bntjmekn

Abstract

The RNA exosome complex functions in both the accurate processing and rapid degradation of many classes of RNA in eukaryotes and Archaea. Functional and structural analyses indicate that RNA can either be threaded through the central channel of the exosome or more directly access the active sites of the ribonucleases Rrp44 and Rrp6, but in most cases, it remains unclear how many substrates follow each pathway in vivo. Here we describe the method for using an UV cross-linking technique termed CRAC to generate stringent, transcriptome-wide mapping of exosome–substrate interaction sites in vivo and at base-pair resolution.

We present a protocol for the identification of RNA interaction sites for the exosome, using UV cross-linking and analysis of c DNA (CRAC) [ 1 , 2 ]. A number of related protocols for the identification of sites of RNA–protein interaction have been reported, including HITS-CLIP, CLIP-Seq, iCLIP, eCLIP, and others [ 3 , 4 , 5 , 6 ]. These all exploit protein immunoprecipitation to isolate protein–RNA complexes. CRAC is distinguished by the inclusion of tandem affinity purification and denaturing purification, allowing greater stringency in the recovery of authentic RNA–protein interaction sites.

To allow CRAC analyses, strains are created that express a “bait” protein with a tripartite tag. This generally consists of His6, followed by a TEV-protease cleavage site, then two copies of the z-domain from Protein A (HTP). The tag is inserted at the C terminus of the endogenous gene within the chromosome. The fusion construct is the only version of the protein expressed and this is under the control of the endogenous promoter. Several alternative tags have been successfully used, including a version with N-terminal fusion to a tag consisting of 3× FLAG-PreSission protease (PP) cleavage site-His6 (FPH) [ 7 ]. This is a smaller construct and is suitable for use on proteins with structures that are incompatible with C-terminal tagging. An additional variant is the insertion of a PP site into a protein that is also HTP tagged. This allows the separation of different domains of multidomain proteins. Importantly, the intact protein is cross-linked in the living cell, with domain separation in vitro. This has been successfully applied to the exosome subunit Rrp44/Dis3 to specifically identify binding sites for the PIN endonuclease domain [ 8 ].

Briefly, during standard CRAC analyses, covalently linked protein–exosome complexes are generated in vivo by irradiation with UV-C (254 nm). This generates RNA radicals that rapidly react with proteins in direct contact with the affected nucleotide (zero length cross-linking). The cells are then lysed and complexes with the bait protein are purified using an IgG column. Protein–RNA complexes are specifically eluted by TEV cleavage of the fusion protein and cross-linked RNAs trimmed using RNase A/T1, leaving a protected “footprint” of the protein binding site on the RNA. Trimmed complexes are denatured using 6 M Guanidinium, immobilized on Ni-NTA affinity resin and washed under denaturing conditions to dissociate copurifying proteins and complexes. The subsequent enzymatic steps are all performed on-column, during which RNA 3′ and 5′ ends are prepared, labeled with ³²P (to allow RNA–protein complexes to be followed during gel separation) and linkers ligated. Note, however, that alternatives to using ³²P labeling have been reported (e.g., [ 6 ]). The linker-ligated, RNA–protein complexes are eluted from the Ni-NTA resin and size selected on a denaturing SDS-PAGE gel. Following elution, the bound RNA is released by degradation of the bait protein using treatment with Proteinase K. The recovered RNA fragments are identified by reverse transcription, PCR amplification and sequencing using an Illumina platform.

Relative to CLIP-related protocols, CRAC offers the advantages of stringent purification, that substantially reduces background, and on-bead linker ligation that simplifies separation of reaction constituents during successive enzymatic steps. It also avoids the necessity to generate high-affinity antibodies needed for immunoprecipitation. Potential disadvantages are that, despite their ubiquitous use in yeast studies, tagged constructs may not be fully functional. This can be partially mitigated by confirming the ability of the tagged protein to support normal cell growth and/or RNA processing, or by comparing the behavior of N- and C-terminal tagged constructs. Additionally, because linkers are ligated to the protein–RNA complex, a possible disadvantage is that UV-cross-linking of the RNA at, or near, the 5′ or 3′ end it may sterically hinder on-column (de)phosphorylation and/or linker ligation. With these caveats, CRAC has been successfully applied to >50 proteins in budding yeast, and in other systems ranging from pathogenic bacteria to viral infected mouse cells [ 7 , 9 ].

Before start

Analysis of sequences obtained from exosome subunits CRAC experiments is done using custom scripts and software packages. The pyCRAC [11] software, a suite of python scripts which can be used to analyze sequencing data obtained from protein–RNA UV cross-linking protocols, includes most of the necessary tools. Here, we will describe the main steps of processing and the most commonly used modules of the pyCRAC software for our analysis.

Steps

Preprocessing Step: Demultiplexing, Quality Filtering, Trimming of Adapters

Note

The 5′ adapters mentioned in previous sections contain barcodes allowing multiplexing of several samples in a sequencing lane. In addition to barcodes, 5′ adapters contain three random nucleotides allowing removal of PCR duplicates. This allows detection of reads with the same start and end positions that arise from PCR duplication of a single cDNA rather than independent linker ligation events.

For multiplexed samples, first split the output file from sequencing by barcodes, using pyCRAC package:

$ pyBarcodeFilter.py -b barcodes.list–f multiplexed_input.fastq
```where barcodes.list is a tab-delimited text file containing the list of barcodes used in the experiment with corresponding names of samples, used in output files names.

<Note title="Citation" type="success" ><span>Here is an example of how the file should appear:</span><span>  NNNTAAGCRrp44-HTP_L5AaNNNATTAGCRrp6-HTP_L5AbNNNGTGAGCRrp44-exo-HTP_L5Bb</span><span></span></Note>

The random nucleotides will be stripped in this step and will be placed into the header of each sequence of the output fastq files. Later steps can make use of this information in order to collapse PCR duplicates (see step 5).



<Note title="Note" type="warning" ><span>It is important to note that the standard version of this script requires the adapters to be designed as shown in Table 1.</span></Note>

Sequencing data can then be quality filtered and adapters trimmed using Flexbar [12] with parameters –at 1 –ao 4:

$ flexbar–r input.fastq–f solexa–as TGGAATTCTCGGGTGCCAAGG–at 1–ao 4–u 3–m 7–n 16–t flexbar.fastq
```where input.fastq and flexbar.fastq are the input and output fastq files names respectively.

When useful, for instance when proportion of 3′ oligoadenylated reads must be calculated (see step 9), “-g” parameter can be added to tag reads with 3′ adapter. Then “grep” can be used to retain only these reads:

$ grep -A 3 --no-group-separator removal flexbar.fastq > flexbar_adaptercontaining.fastq; done &

Collapsing

Then, sequences can be collapsed, thanks to the random nucleotides present in 5′ linker as mentioned in step 2, using pyFastqDuplicateRemover.py script from pyCRAC software, so that reads having identical ends and identical random nucleotides in the 5′ barcode are counted as one:

$ nohup pyFastqDuplicateRemover.py -f flexbar.fastq -o flexbar_comp.fasta &
```where flexbar_comp.fasta in the collapsed output file.



<Note title="Note" type="warning" ><span>This step can be skipped if the analysis aims to study ribosomal RNA. Indeed, with the linkers mentioned above, collapsing allows to keep only 64 alternatives sequences (3 random nucleotides = 4<sup>3</sup>possibilities); since the exosome strongly binds to pre-rRNA, collapsing would lead to flattening exosome binding peaks across pre-RNA. However, this step is essential for study of exosome binding on RNA polymerase II transcripts.</span></Note>

Alignment

Reads should then be aligned to the Saccharomyces cerevisiae genome (SGD v64) using Novoalign (Novocraft) with genome annotation from Ensembl (EF4.74) [13], supplemented with noncoding sequences as described [14], with parameters -r Random:

$ novoalign -f flexbar_comp.fasta -s 1 -r Random -d Saccharomyces_cerevisiae.EF4.74.novoindex > flexbar_comp.novo
```where Saccharomyces_cerevisiae.EF4.74.novoindex is the genome-specific index file generated by novoindex, and flexbar_comp.novo is the output file name.

<Note title="Note" type="warning" ><span>The “-r Unique” or “-r All” parameters are useful especially for study of exosome binding across tRNAs which share common sequences [<b>The “-r Unique” or “-r All” parameters are useful especially for study of exosome binding across tRNAs which share common sequences [10]. “–r” Unique will lead to preferential loss of a subset of sequences (e.g., ribosomal sequences which are represented by two identicalRDN37sequences in the yeast reference genome).</b>]. “–r” Unique will lead to preferential loss of a subset of sequences (e.g., ribosomal sequences which are represented by two identical<i>RDN37</i> sequences in the yeast reference genome).</span></Note>

<Note title="Note" type="warning" ><span>By default, NovoAlign filters out all reads shorter than 17 nt (as shorter reads are unlikely to map uniquely to the yeast genome). For datasets obtained from Rrp44 CRAC, it was useful to align shorter sequences [<b>By default, NovoAlign filters out all reads shorter than 17 nt (as shorter reads are unlikely to map uniquely to the yeast genome). For datasets obtained from Rrp44 CRAC, it was useful to align shorter sequences [15] enriched for species targeted to Rrp44 exonuclease site and bypassing the exosome channel (Rrp44 protects 9 nt while exosome + Rrp44 protects 31–33 nt). In some analyses, we then used “–l 9” parameter (instead of –l 17 default).</b>] enriched for species targeted to Rrp44 exonuclease site and bypassing the exosome channel (Rrp44 protects 9 nt while exosome + Rrp44 protects 31–33 nt). In some analyses, we then used “–l 9” parameter (instead of –l 17 default).</span></Note>

Counting Overlaps with Genomic Features

To study distribution of reads across the genome, use pyReadCounters.py from the pyCRAC package. A GTF format file for genome annotation is required by the pyCRAC software and is critical to the interpretation of the output of the pyCRAC pipeline. pyCRAC is sensitive to the formatting within the GTF file and we find it useful to check the annotated GTF file using the pyCheckGTFfile.py command to ensure that the GTF file is suitable for use with the pyCRAC software:

$ pCheckGTFfile.py --gtf annotation.gtf –o annotation_checked.gtf
```where annotation.gtf is a GTF format file of the genome annotation.

$ pyReadCounters.py -f flexbar_comp.novo --gtf=annotation_checked.gtf --rpkm


<Note title="Citation" type="success" ><span>The output files are </span><span>a gtf file that can be used as input files in numerous analyses within pyCRAC package.</span><span>a hit table file presenting the counts of reads mapped to each genomic feature within each defined RNA class in absolute value and read number normalized per kilobase per millions (if –rpkm parameter is specified in the command line).</span></Note>

Distribution along Genes

To observe binding distribution of exosome subunits across individual genes, use pyPileup.py from the pyCRAC package. The output is a tab-delimited file that can be plotted to obtain a visual overview of binding along the gene of interest. This gives particularly good quality plots for RNAs that are strongly targeted by the exosome:

$ pyPileup.py -f flexbar_comp.novo --gtf=annotation_checked.gtf --tab=sequence.tab -g gene.list & -r 0
```where sequence.tab is a tab-delimited file with genes name and sequences and gene.list is a text file with the names of genes for which you want to generate output files.

<Note title="Note" type="warning" ><span>-r parameter allows the user to indicate the length of flanks to be added on 5′ and 3′ ends of genes.</span></Note>

<Note title="Note" type="warning" ><span>To study binding across a particular class of RNA, metagene plots are generated. We used custom-made scripts, still not available online. However, the computeMatrix, plotProfile, and plotHeatmap modules of the deepTools software allow for similar analyses [<b>To study binding across a particular class of RNA, metagene plots are generated. We used custom-made scripts, still not available online. However, the computeMatrix, plotProfile, and plotHeatmap modules of the deepTools software allow for similar analyses [16].</b>].</span></Note>

Oligo-A Reads

Selection of reads containing 3′ nonencoded A tracks, allows identification of targets oligoadenylated by TRAMP prior binding of the exosome. We use custom-made scripts giving as output files:

a fasta file containing only oligo-A reads, used for downstream analyses
a text file with the ratio of oligo-A to total reads, and
a text file with the list of nonencoded 3′ tails.