The COPILOT Raw Illumina Genotyping QC Protocol

Hamel Patel, Hamel Patel, Sang-Hyuck Lee, Sang-Hyuck Lee, Gerome Breen, Gerome Breen, Stephen Menzel, Stephen Menzel, Oyesola Ojewunmi, Oyesola Ojewunmi, Richard J.B. Dobson, Richard J.B. Dobson

Published: 2022-04-22 DOI: 10.1002/cpz1.373

Abstract

The Illumina genotyping microarrays generate data in image format, which is processed by the platform-specific software GenomeStudio, followed by an array of complex bioinformatics analyses that rely on various software, different programming languages, and numerous dependencies to be installed and configured correctly. The entire process can be time-consuming, can lead to reproducibility errors, and can be a daunting task for bioinformaticians. To address this, we introduce the COPILOT protocol, which has been successfully used to transform raw Illumina genotype intensity data into high-quality analysis-ready data on tens of thousands of human patient samples that have been genotyped on a variety of Illumina genotyping arrays. This includes processing both mainstream and custom content genotyping chips with over 4 million markers per sample.

The COPILOT QC protocol consists of two distinct tandem procedures to process raw Illumina genotyping data. The first protocol is an up-to-date process to systematically QC raw Illumina microarray genotyping data using the Illumina-specific GenomeStudio software. The second protocol takes the output from the first protocol and further processes the data through the COPILOT (C ontainerised wO rkflow for P rocessing IL lumina genO typing daT a) containerized QC pipeline, to automate an array of complex bioinformatics analyses to improve data quality through a secondary clustering algorithm and to automatically identify typical Genome-Wide Association Study (GWAS) data issues, including gender discrepancies, heterozygosity outliers, related individuals, and population outliers, through ancestry estimation. The data is returned to the user in analysis-ready PLINK binary format and is accompanied by a comprehensive and interactive HTML summary report file which quickly helps the user understand the data and guides the user for further data analyses. The COPILOT protocol and containerized pipeline are also available at https://khp-informatics.github.io/COPILOT/index.html. © 2022 The Authors. Current Protocols published by Wiley Periodicals LLC.

Basic Protocol 1 : Processing raw Illumina genotyping data using GenomeStudio

Basic Protocol 2 : COPILOT: A containerised workflow for processing Illumina genotyping data

INTRODUCTION

A genome-wide association study (GWAS) is an approach to identify genetic variants associated with a particular disease or phenotypic trait. Microarray-based GWAS remains a common approach for identifying these genetic associations across the whole genome. The Illumina genotyping arrays accomplish this by using pre-defined oligonucleotide probes designed to hybridize specific regions of genomic DNA, followed by extension using chemically labeled nucleotides. The probe extension binds either red or green fluorescent agents, which can be interpreted by the Illumina-specific software GenomeStudio. This software determines the identity of alleles by automated clustering of samples based on the similarity of fluorescent intensity. However, the default clustering algorithm can fail to identity valid clusters and can also assign the wrong genotype to samples due to abnormal intensity patterns. This can be addressed by manually reviewing and recalling SNPs to increase the reliability, confidence, and overall quality of the data (SNP/sample call rates), making this an extremely crucial quality control (QC) procedure prior to further QC using PLINK or genetic interpretation.

Through processing tens of thousands of human patient samples from various tissue sources and on a variety of Illumina genotyping arrays, including both mainstream and custom content genotyping chips with over 4 million markers per sample, we have accumulated extensive hands-on experience in processing raw Illumina genotype data for numerous genetic investigations (Coleman et al., 2016; Fabbri et al., 2018; Gardner et al., 2018; Harrison et al., 2017; Santoro et al., 2018; Traylor et al., 2017; Vassos et al., 2017; Voyle et al., 2017). We translate this knowledge and experience in this article into a detailed easy-to-follow step-by-step procedure, allowing users to effortlessly transform raw Illumina genotype intensity data to high-quality analysis-ready data.

Basic Protocol 1 walks a user through the initial QC processing of raw Illumina genotype intensity data using the Illumina-specific software called GenomeStudio. The protocol includes the initial setup of the software, loading of data, QC of internal standards, QC of genotype data using up-to-date criteria to identify and address problematic samples and SNPs, and exporting the processed data. Basic Protocol 2 describes how to use the COPILOT container to automate an array of complex bioinformatics analyses to further process and improve data quality, generate PLINK format analysis-ready data, and create the interactive HTML summary report.

STRATEGIC PLANNING

If the project consists of multiple batches that are genotypes at different time points, then it would be advisable to duplicate a few samples across batches to identify and address discrepancies in genotyping of the same sample across batches. When processing different batches, the clustering positions from a processed batch can be used to cluster the next batch. This will significantly speed up the QC process and will be further explained in Basic Protocol 1.

In addition, users will require a minimum of 100 samples to reliably cluster the data in GenomeStudio, or a minimum of 100 female and 100 male samples if investigating the sex chromosomes. If sample numbers are low for a given project, then samples from another project genotyped in the same laboratory and using the same genotyping array (including version) can be merged with the data of interest to increase sample numbers to allow for reliable clustering of the intensity data in GenomeStudio.

Basic Protocol 1: PROCESSING RAW ILLUMINA GENOTYPING DATA USING GenomeStudio

This protocol provides the user with complete instructions to load and thoroughly QC raw Illumina genotyping data using the Illumina GenomeStudio software. Genotype arrays contain several thousand or even millions of genetic markers on each array, which makes it impractical to visually verify every single marker. Therefore, different intensity-related criteria are used to identify genetic markers that deviate from their expected pattern and should be manually reviewed.

Necessary Resources

Hardware

As recommended by Illumina, the following hardware recommendations are required to run GenomeStudio software:
CPU Speed–2.0 GHz or greater
Processor–64-bit, with 2 or more cores
Memory–8 GB or more
Hard Drive–100 GB or larger
Video Display–1280 × 1024
Operating System–Windows 7 or higher
Specific OS Requirements–Microsoft .NET Framework 3.5
Network Connection–1 GbE or faster

Software

GenomeStudio genotyping (current version at writing this article is GenomeStudio v2.0.5):
- https://emea.support.illumina.com/array/array_software/genomestudio/downloads.html

Files

The input files for GenomeStudio are:

Illumina iDat files (.dat)–These are the raw intensity files generated by the Illumina microarrays
Illumina genotyping array manifest (.bpm)–This contains information on the target region
Illumina sample sheet (.csv)–This contains information on the sample and is usually provided by the genotyping laboratory. Additional phenotypic information can be added to this file to aid in the QC process. The sample sheet file contains the following columns:
- 1.Sample ID
- 2.SentrixBarcode (chip barcode on which the sample has been genotyped)
- 3.SentrixPosition (position on the chip of where the sample has been genotyped)
- 4.Sample_Plate (plate ID)
- 5.Sample_Well (well position of the sample)
- 6.Sample_Group (optional: Tissue source, i.e., saliva, buccal, blood, etc…)
- 7.Gender (optional: F, M, Female or Male, with the first letter always in uppercase. Any other nomenclature will be imported as “Unknown”. Without gender information the sex chromosomes cannot be reliably processed.)
- 8.Sample_Name (optional: alias name)
- 9.Replicate (optional: specify the “Sample_Name” of any replicated samples)
- 10.Parent 1 (optional: Father's “Sample_Name” if genotyped–can be used to identify parent-child or parent-parent-child SNP discrepancies)
- 11.Parent 2 (optional: Mother's “Sample_Name” if genotyped–parent-child or parent-parent-child SNP discrepancies)
- 12.Path (Full path to data directory containing the IDAT file for the sample)

1.Configuring GenomeStudio.

GenomeStudio uses the GenCall algorithm to cluster the raw intensity data and assign genotype calls. The software has default parameters and setting which can be adjusted to speed up the internal calculations.

From the main window, select “Tools” > “Options” > “Project”:Check the “Exclude Female Y-SNPs from SNP Statistics” box.Ensure the “No-call Threshold” is set to 0.15.Click “Use for all New Projects”.
From the main window, select “Tools” > “Options” > “Module” > “Genotyping”: Check the “Use memory-based storage” box. This will massively speed up the clustering process; however, sufficient memory is required for this option. We have successfully used this option with a 32 GB RAM machine, clustering thousands of samples.

GenomeStudio will often reset these settings to default. Users are strongly advised to check these settings every time a project is created/opened.

2.Creating a new genotyping project.

From the main window, select “File” > “New Project” > “Genotyping” and follow the GenomeStudio Project Wizard.Under “Projects Repository”, navigate and select the folder where you want to create and store the GenomeStudio project.Under “Project Name”, create a project name, e.g., we suggest[PROJECT_NAME]_[DATE]_01.bscand click “Next”. The “_01” file will be the raw data without QC. We will save any modifications to this data as a separate file later in this protocol, which will have the same project name except it will end in “_02”, to represent a processed project.Select “use sample sheet to load sample intensities” and click “Next”.Specify sample sheet,.idatfile location, and manifest location, and click “Next”.Here we have an option to “import cluster positions from a cluster file” to cluster the data. From experience, we have discovered that using the default Illumina cluster file, or a cluster file generated from a different laboratory from which the data was generated, can lead to poor clustering. This is generally due to laboratory-specific variation leading to intensity data variations which lead to cluster drifts. We recommend not using a cluster file, but clustering the data using the data itself. Note, the clustering algorithm requires a minimum of 100 samples to work effectively. Once a Genotype project has undergone QC, the cluster positions can be exported and re-used to cluster new data genotyped on the same genotype chip. To use an existing cluster file, check the “Import cluster positions form a cluster file” box and specify the cluster file (.egt) location. Otherwise, check “Cluster SNP's”. Generating a cluster file after QC is further detailed later in this protocol.Check “Calculate Sample and SNP Statistics” and “Pre-calculate” boxes.Ensure Gen Call Threshold is set at 0.15 and click “Finish” to create the project. Depending on the computing power and number of samples/SNPs, this process may take several minutes to several hours to cluster the data and calculate the statistics.
Once the project has been created, check that the sample numbers in GenomeStudio match those in the sample sheet. If there are any errors in the sample sheet, a sample is not present at the location specified in the sample sheet, or the.idatfiles are corrupt for a given sample, then GenomeStudio will not upload the sample in question, and may not prompt an error. Therefore, it is vital to check that sample numbers in GenomeStudio correspond to those specified in the sample sheet.

3.Overview of GenomeStudio windows.

An overview of the software window is shown in Figure 1.The three main panes that will be used are (A) “SNP Graph”, which displays the intensity (green and red) of samples for a given SNP that is selected in the “SNP Table”. Samples are colored according to their genotype, where red represents AA homozygote genotype, purple represents AB heterozygote genotype, blue represents BB homozygote genotype, and black represents uncalled. The number of samples called within each cluster is displayed below the cluster. (B) “SNP Table” displays the SNP information, which includes clustering statistics, and (C) “Sample Table” displays sample information. The green boxes within the “SNP Table” and “Sample Table” panes contain options that will be used during the QC process, such as sorting of tables and filtering of rows. The remaining panes are the “Project”, “Errors Table”, and ”Log”, and should be inspected for any errors; as they are not used thereafter, they can be closed to make the remaining panes bigger and easier to view.

4.Validating internal genotyping controls.

All Illumina genotyping arrays are equipped with internal probes to verify the accuracy and reproducibility of the assay. These internal controls should be checked before the individual sample or SNP QC to verify the reliability of the data. The internal controls do not have specific threshold criteria due to differing intensities across genotype assays. Therefore, the internal controls are based on relative intensities and should be checked according to the Illumina “evaluation of Infinium genotyping assay control training guide”, which is available on the Illumina website https://static.yanyin.tech/literature/current_protocol/10.1002/cpz1.373/original_pdf/cpz1.373.pdf (Illumina, 2012).

To view the internal controls, from the main window, click on “Analysis” > “View Controls dashboard”.

5.Manually editing clusters.

During this protocol, SNPs may need to be manually edited to correctly cluster samples. To manually edit a cluster:

From the “SNP Graph” pane, press and hold the Shift key, and move the cursor to the center of a cluster. Once you are at the middle of the cluster, over the “+”, the cursor head will turn from an arrow to a directional “+” symbol.
Once the cursor head has turned into a “+”, left click and hold, and drag the cluster to the desired location

Alternatively, users can select samples they would like to exclude/include from a cluster, and GenomeStudio will attempt to exclude/include these samples.

c.From the “SNP Graph” window, draw a box (left click and drag the cursor) around the sample(s) you want to include/exclude from a cluster.
d.Right-click on the highlighted sample(s) and choose from the following self-explanatory options:
- i.Define AA cluster using selected samples
- ii.Define AB cluster using selected samples
- iii.Define BB cluster using selected samples
- iv.Exclude selected samples
- v.Include selected samples

6.Manually removing SNPs.

During this protocol, some SNPs will be identified as incorrectly clustered, which cannot be rectified by manual re-clustering. These SNPs should be removed from the data. To manually exclude SNPs:

From the “SNP table”, right-click on the SNP that you want to exclude (many can be selected simultaneously by using the shift key).
Select “Zero Selected SNP”.

Alternatively:

c.From the “SNP table”, left click to highlight the SNP you want to exclude.
d.

Press the “F2” button.

Users can highlight several SNPs and zero them all simultaneously; however, the F2 button does not work on multiple SNPs.

7.Excluding poor performing samples by GenCall score.

GenomeStudio assigns a GenCall (GC) score to each sample. The GC score is calculated by the sample clustering algorithm where each SNP is evaluated based on the angle of the clusters, dispersion of the clusters, the overlap between clusters, and intensity (Illumina, 2014). The GC score ranks samples based on how often the sample appears at the center of a cluster, with scores ranging from 0 to 1 where low values represent samples that do not cluster well with other samples. These poor clustering samples interfere with the clustering algorithm and are usually samples with low DNA quality. It is not unusual to remove up to 10% of samples. Once these samples are removed, users can re-cluster the data using good quality samples, which will create cleaner, tighter clusters and ultimately increase the sample/SNP call rate for the project.

To remove samples based on low GC scores:

From the main window, select “Analysis” > “Exclude samples by best run” > “Use GC10” > “Use sample ID”.

The p10GC score for each sample represents the 10^th percentile of the distribution of GC scores across all genotypes and is used to evaluate the quality and performance of samples. To remove samples on low p10GC scores:

b.Identify samples with a p10GC <0.4 and remove these samples.
- i.From the “Samples Table”, click on the “p10 GC” column header to highlight the column.
- ii.Click the “sort column (Ascending)”.
- iii.Highlight and remove samples with a low p10GC score (0.4) by right-clicking on these samples selecting “Exclude selected sample”.
- iv.When prompted, “Do you want to update SNP statistics for all SNPs”, click “No”.

8.Excluding poor performing samples by call rate.

Low sample call rates are a good indication of poorly clustering samples. These samples tend to be of low-quality DNA and do not cluster well with other samples, causing larger more diffused clusters, which can interfere with neighboring clusters. Removing these samples and re-clustering the data can lead to tighter, more defined clusters and can improve call rates for the remaining samples. From experience, samples with a call rate below 90% cannot be rescued to a call rate above 98%, while samples with a call rate above 90% are more likely to improve above 98% following the completion of this protocol. To remove sample with a call rate below 98%:

From the “Samples Table”, arrange samples by “Call Rate” and highlight all samples with a call Rate of <0.90.
Right-click and “Exclude Selected Samples”.
When prompted, “Do you want to update SNP statistics for all SNPs”, click “No”.

9.Re-clustering SNPs based on good clustering samples.

Following the removal of poorly performing sample(s), it is important to re-cluster the data to define clusters based on good data. If no sample was removed, then this step can be skipped. To re-cluster the data and re-calculate statistics:

From the “SNP Table”:Click on the “Select all” tab.Right-click on SNPs and “Cluster Selected SNPs”.When prompted, click “yes” to update SNP statistics for the selected SNPs.

10.Updating sample statistics

Removal of samples will change some of the columns in the samples table to appear red. This indicates that the sample statistics has changed and has not been updated. The sample statistics can be updated by:

From the “Samples Table”:Click on the “Select all tab”Click the “Calculate” tab

The sample statistics will now be updated and a zero call rate will be assigned to any samples removed.

11.Assigning gender-specific colors to SNP graph.

Throughout the QC process, gender information can be used for QC purposes and to identify sex-specific effects. Each dot in the “SNP Graph” represents a sample, and these dots can be given custom colors to distinguish them by any given phenotype, such as gender, disease status (case vs control), sample source, ethnicity, etc… To assign colors based on gender:

From the “Samples Table”:Click the “Filter rows” tab:Select “Gender” from the “Columns” sub-window. Keep the operation function as “=” and enter the Value box, enter “Female”Select “Call rate” from the “Columns” sub-window. Change the “operations” to “!=” and enter the value “0”.Ensure the “Action” under the “Sub-Statement” section is set to “AND”.Click the “→” button and then click “OK”.
From the “Samples Table”:Click the “Select all” tab to highlight all samples.Right-click anywhere on the highlighted rows > Mark Selected Rows > Add New > write “Female” and change the color to “Yellow” > select “OK”.

This will highlight female samples in the “SNP Graph” as yellow. If there are samples with unknown genders, this process can be repeated to highlight males or the unknown samples as a different color, so that the user can differentiate between the three classes when performing QC.

12.Quality control of Y chromosome.

Females do not have a Y chromosome, and therefore should not be included in the Y chromosome clusters. GenomeStudio does not exclude female samples when clustering the Y chromosome SNPs, and therefore they are inadvertently clustered along with the male samples, leading to incorrect cluster formations. Female samples fail to bind the Y chromosome probe and generally lie at the bottom of the SNP graph. These samples need to be manually removed from clusters. To speed up this process, users can remove all female samples, re-cluster the Y chromosome based on male sample intensities alone, and then reintroduce the female samples.

From the “SNP Table”:Select “Clear filter” to remove any predefined filters
Then, re-cluster the Y chromosome SNPs using male samples only. From the “Samples Table”:Click the “filter rows” tab and select “Gender” from the “Columns” sub-window. Keep the operation function as “!=” and enter the value “Male”Select “Call rate” from the “Columns” sub-window, change “operations” to “!=”, and enter the value “0”.Ensure that the “Action” under the “Sub–Statement” is set to “AND”.Click the “→” button and then click “OK”.
The samples at this stage will be filtered to samples that are not male. These would be female and samples with unknown gender assignments. These will now be temporarily labeled and removed to leave only male samples. From the “Samples Table”:Click the “select all” tab and right-click on the highlighted rows > “Mark selected Rows” > “Add New”. Label these samples as “temp_removed” and click “OK”. This will label all the non-male samples before removal, which will ensure they do not become mixed with previously removed samples (if any).Click the “select all” tab, right-click on the highlighted rows, and click “Exclude Selected Samples”. When prompted “Do you wish to update SNP statistics for all SNPs”, select “No”.
The Y chromosome SNPs can then be re-clustered using male samples only. From the “SNP Table”:Click the “filter rows” button and select “Chr” from the “Columns” sub-window. Select the operation “=” and enter the value “Y” (this is case sensitive), then click the “→” tab and then click “OK”.Click the “select all” button to highlight all Y chromosome SNPs and then right-click and select “Cluster Selected SNPs”. If prompted, do not update statistics.
The non-male samples can now be reintroduced. From the “Samples Table”:Right-click > “Select Marked Rows > select “temp_removed”.Right-click and “Include Selected Samples”. When prompted “Do you wish to update SNP statistics for all SNPs”, select “Yes”.
Now the Y chromosome SNPs can be processed. Female samples should not bind the Y chromosome probe, and as a result will exhibit a low binding intensity, which is represented in the “SNP Graph” with a low NormR intensity (<0.2). However, due to the repetitive nature of the Y chromosome and the fact that probe sequences are only 50 bp, probes targeting the Y chromosome may bind a different region of the genome, resulting in unusually high NormR intensity (>∼0.2). In the previous step, the female samples were highlighted to appear yellow within the SNP Graph pane. Using this color differentiation between males and females, the Y chromosome SNPs are processed. From the “SNP Table”, scroll through all Y chromosome SNPs and use the following criteria to manually re-cluster or zero any unreliable Y chromosome SNP:Female samples are expected to have a NormR intensity <0.2. If female samples have an unexpectedly higher value, then the Y chromosome probe has most likely bound to a different region of the genome and therefore needs to be manually removed. An example is provided in Figure2.If female samples have been clustered and the NormR intensity is <0.2, manually remove these samples from the cluster. An example is provided in Figure3.No samples should be assigned to the AB Cluster.

Y chromosome SNP with high female sample intensities. Y chromosome SNPs should not include female samples in any cluster. In this scenario, the female samples are highlighted in yellow and have intensities similar to the male samples (red dots). Given the repetitive nature of the Y chromosome and the fact that probe sequences are only 50 bp, this Y chromosome probe has most likely bound to a different region of the genome, and is, therefore, an unreliable SNP and should be removed.

Y chromosome unclustered SNP. In this scenario, the female samples are highlighted in yellow. (A) Demonstrates GenomeStudio attempting to cluster the female samples, causing the clusters to overlap; due to the closeness of the clusters the SNP is given a no call (indicated by the pale coloring of the background and no numbers under the clusters). (B) The same SNP can be rescued by manually moving the clusters to exclude the female samples.

13.Quality control of X chromosome.

Males have a single X chromosome; therefore, X SNPs should have no male subjects in the heterozygote cluster (AB). From the “SNP Table”:

From the “SNP Table”:Select the “Clear filter” button to remove any predefined filters.Click the “filter rows” button > select “Chr” from the Columns sub-window > select the operation “=” > enter the value “X” (this is case sensitive) > Click the “→” tab > click “OK”.Arrange SNPs by descending AB Frequency by selecting the “AB Freq” column followed by the “sort column (descending)” tab.Review these SNPs and ensure that no Male sample is assigned to the heterozygote (AB) cluster. If the male samples cannot be manually removed from the heterozygote cluster, then the SNP should be zeroed. An example is provided in Figure4.

X chromosome unreliable SNP. This X chromosome SNP has female samples highlighted in yellow and male samples are red, purple, or blue depending on their clustering location. Male samples have a single X chromosome and therefore cannot be heterozygote as shown above. If the male samples cannot be manually removed from the heterozygote cluster, then the SNP should be zeroed.

14.Quality control of MT chromosome.

The MT chromosome is maternally inherited and should not show any heterozygote clusters.

From the “SNP Table”:Select the “Clear filter” button to remove any predefined filters.Use the “filter rows” button to select only MT chromosome SNPs and arrange by descending AB Frequency by selecting the “AB Freq” column followed by the “sort column (descending)” tab. Manually review all SNPs, ensuring there are no heterozygote clusters.

15.Quality control of XY chromosome.

The XY SNPs are known as the Pseudo autosomal (PAR) SNPs, and are present on both X and Y chromosomes; therefore, these SNPs may show male heterozygotes. There is no specific QC process for XY SNPs, and these should be processed along with the autosomal chromosomes.

16.Quality control of autosomal chromosomes.

The following steps will identify unreliably clustered SNPs that require manual reviewing and validation. Since the X, Y, and MT chromosomes have already been reviewed, these SNPs should be excluded from all the following filters; however, the XY chromosomes can be processed along with the autosomal chromosome. The following filters are all applied from the “SNP Table” window by selecting the “Filter Rows” tab. Some filters need to be selected from the “Sub Columns” window (within the “Filter Rows” option), which can be activated by selecting the chip name (usually 3^rd last selection) in the “Columns” window.

Exclude the X, Y, and MT chromosomes. This filter should remain active when applying the autosomal chromosome filters. From the “SNP Table”:Use the “filter rows” button to select “Chr != X” and “Chr != Y” and “Chr != MT”. Click “OK”.
Review all SNPs that have failed to cluster:Use the “filter rows” button to add an additional filter where the “Call Freq = 0”. These SNPs have failed to cluster. Review these SNPs to see if any can be rescued by manual reclustering.
Review low-intensity SNPs. Remove the previous filter and individually add the following filters to identify SNPs with unexpected low intensities (example provided in Fig.5). Review SNPs after each filter, remove, and then add the next filter.Use the “filter rows” button to select “AB Freq != 0” and select the chip name from the “Columns” window to activate the “Sub Columns” window. Then, from the “Sub Columns” window, select “AB R Mean”, and select operation “<” and value “0.2”.Remove the previous filter. Use the “filter rows” button to select “AA Freq != 0” and select the chip name from the “Columns” window to activate the “Sub Columns” window. Then, from the “Sub Columns” window, select “AA R Mean”, and select operation “<” and value “0.2”.Remove the previous filter. Use the “filter rows” button to select “BB Freq != 0” and select the chip name from the “Columns” window to activate the “Sub Columns” window. Then, from the “Sub Columns” window, select “BB R Mean”, and select operation “<” and value “0.2”.
Review clusters that are closer to neighboring clusters than expected. Occasionally, clusters will be overlapping. If the clusters cannot be separated, then these SNPs should be zeroed. An example is provided in Figure6.Remove the previous filter. Use the “filter rows” button to select “Cluster Sep < 0.35". The data can be arranged by selecting the sort by ascending the “Cluster sep” column to have the SNPs with most likely overlapping clusters at the top of the table.
Review unusual levels of samples in the heterozygote (AB) cluster (Figs.7and8). Remove the previous filter, individually add the following filters, and review the filtered SNPs.Use the “filter rows” button to select “AB frequency > 0.6”. This will identify SNPs with excess samples in the heterozygote cluster.Use the “filter rows” button to select “AB frequency = 0 and minor allele frequency > 0”. This will identify SNPs with a lack of samples in the heterozygote cluster.Use the “filter rows” button to select “AA frequency = 1 and Call Freq < 1”. This will identify missed AB calls. If the number of SNPs to review is high, then call Freq can be reduced to a large value such as Call Freq = 0.98.Use the “filter rows” button to select “BB frequency = 1 and Call Freq < 1”. This will Identify missed AB calls. If the number of SNPs to review is high, then call Freq can be reduced to a large value such as Call Freq = 0.98.Use the “filter rows” button to select “Het excess > 0.3”. This will identify SNPs with excess samples in the heterozygote cluster to the expected Hardy-Weinberg Equilibrium.Use the “filter rows” button to select “Het excess < -0.3”. This will identify SNPs with a lack of samples in the heterozygote cluster to the expected Hardy-Weinberg Equilibrium.
Review SNPs where their clusters are outside their expected theta position (Fig.9). Remove the previous filter and individually add the following filters and review the filtered SNPs.Use the “filter rows” button to select “AA theta mean >0.3 and AA Freq !=0”.Use the “filter rows” button to select “BB theta mean <0.7 and BB Freq !=0”.Use the “filter rows” button to select “AB theta mean <0.3 and AB Freq !=0”.Use the “filter rows” button to select “AB theta mean >0.7 and AB Freq !=0”.
Review unusually large clusters (Fig.10). Remove the previous filter and individually add the following filters and review the filtered SNPs.Use the “filter rows” button to select “AA theta deviation >0.025”.Use the “filter rows” button to select “AB theta deviation ≥0.07”.Use the “filter rows” button to select “BB theta deviation >0.025”.
Review SNPs where the GenomeStudio algorithm has low confidence in clustering. The GenTrain score ranges from 0-1 where high values indicate high confidence in clustering.

Remove the previous filter and use the “filter rows” button to filter SNPs with “GenTrain scores of <0.7 and Edited !=1”. The edited column records SNPs that have been manually edited. In this filter, manually edited clusters are excluded from the filtered list.

Low-intensity SNP. This shows an example of a low-intensity SNP where the heterozygote and homozygote clusters have an unexpected Norm R intensity value below 0.2 and the clusters cannot be clearly differentiated from one another. This SNP should be zeroed.

Poor cluster separation. Using the cluster separation filter can help identify overlapping clusters. In this SNP, the heterozygote and homozygote clusters cannot be clearly distinguished from one another; therefore this SNP should be zeroed.

High levels of samples in the heterozygote cluster: The “Het excess >0.3” or “AB frequency >0.6” filter can identify clusters with excess samples in the heterozygote cluster, which can be merged clusters. This example illustrates two distinct clusters within the heterozygote cluster that can be manually separated into a homozygote and a heterozygote cluster.

Low levels of samples in the heterozygote cluster. Using the Het excess <0.3 filter can identify deficient heterozygote clusters based on what is expected by Hardy-Weinberg equilibrium. This SNP has two distinct clusters within the AA homozygote cluster that can be manually separated into a homozygote and heterozygote cluster. If the clustering algorithm does not allow the clusters to be separated, then the SNP should be zeroed.

Unexpected cluster position: (A) The AA homozygote cluster has shifted to the right, where the AB cluster is expected, causing the AA homozygote and heterozygote cluster to be classified as AA homozygote. (B) The clusters can be manually moved to correct the misclassification.

Unusually large clusters. This SNP has diffused or “fat” clusters which cannot be clearly differentiated from one another; therefore, this SNP should be zeroed.

17.Quality control of replicated samples.

Samples can be replicated across batches for various reasons. GenomeStudio can identify discrepancies in SNPs between replicated samples and can flag them for the user to review. This can be extremely useful to identify batch effects or unreliable SNPs. Replicated samples are specified in the sample sheet, under the “replicate” column when loading data. However, users can specify these samples during QC by:

From the main window, select “Analysis” > “Edit Replicates”, select the replicate samples under the “Sample_ID1” and “Sample_ID2” sub-windows, and then click “Add”. Once all replicates have been added, click “OK” and GenomeStudio will update the statistics.

Discrepancies in replicated samples will be indicated by a square box in the “SNP Graph” window (Fig. 11). To QC based on replicate samples:

b.From the “SNP Table”, remove all previous filters and use the “filter rows” button to filter SNPs with “Rep errors = “.

Replicate error. A sample has been repeated twice in this assay and is indicated by the yellow square box. Both samples have a different genotype for the same SNP, and therefore this SNP is unreliable and should be zeroed.

18.Quality control of parent-child samples.

Similar to replicated samples, parental information can be entered into the sample sheet when loading data into GenomeStudio or can be specified in GenomeStudio once a project has been created as follows:

From the main window, select “Analysis” > “Edit Parental Relationships” and select the appropriate samples under “Parent 1”, “Parent 2”, and “Child” sub-windows. Click “Add” and then click “OK”.

Discrepancies in parent-child (P-C errors) or parent-parent-child (P-P-C errors) will be indicated by a small “O” and “X” in the “SNP Graph” window, where “O” represents the parent and “X” represents the child. (Fig. 12). To QC based on relationship:

b.From the “SNP Table”, remove all previous filters and use the “filter rows” button to filter SNPs with “P-P-C errors =1”, or if only one parent information is available, then filter by “P-C errors = 1”.

Relationship errors. This SNP has the parents represented as a yellow “O” and the child represented as a yellow “X”. The SNP has been filtered due to P-P-C issues, as both parents are homozygote (AA) and the child is a heterozygote (AB). Furthermore, the AA cluster has three distinct clusters; the AB cluster consists of 2-3 clusters and the BB cluster is not well defined. Multiple clusters within a cluster can indicate failure in the assay, and as such should be removed.

19.Predicting gender for samples.

GenomeStudio uses the X chromosome to estimate the gender of samples. Therefore, the gender of samples should only be predicted after processing the X chromosome:

From the “SNP Table”:Select “Clear filter”.Click the “Select all” button to highlight all SNPs, right-click, and “Update Selected SNP Statistics”.
From the “Samples Table”:Right-click and “Recalculate Statistics for all Samples”.Click the “Select all” button to highlight all samples, right-click, and select “Estimate Gender for selected samples”. When prompted “populate Gender column”, select “No”.Select the “Column Chooser” icon and show “Gender Est” from Hidden Columns.

20.Updating statistics.

Following QC, the statistics for samples and SNPs should be updated. This step can be skipped if the gender was predicted in the previous step, as the gender prediction process incorporates updating statistics. To update statistics:

From the “SNP Table”:Click “Select all”.Right-click on the highlighted SNPs.Click “Update Selected SNP Statistics”.
From the “Samples Table”:Click “Select all”.Right-click on the highlighted samples.Click “Recalculate Statistics for all Samples”.

21.Saving cluster positions.

Once a project has been processed, the cluster positions for the SNPs can be exported and used to process the next batch or a new project. This can significantly reduce the time required for QC. However, it is recommended to merge and process batches together, as more samples lead to clearer and more well-defined clusters. Nevertheless, this process can be used to process different batches or projects that have been genotyped on the same genotype array (including version number).

To export the cluster positions:

From the main window, click “File” → “Export Cluster Positions” → “For all SNPs”.
Select a location to store the cluster file (.egt) and save.

This file can be used to cluster new data that have been genotyped on the same genotyping array. When creating a new genotyping project, check the “Import cluster positions form a cluster file” box and specify the cluster file (.egt) location when prompted. This is further detailed in this protocol under step 2, “Creating a new genotyping project”.

22.Saving project.

It is strongly advised to save changes to the project while QC is ongoing and after QC has finished.

From the main window, click on “File” → “Save Project Copy As”.
Under “projects Repository”, browse to the location where you would like to save the data.
Under “Project Name”, create a name for the project. Keeping in line with the naming convention used to create the project, we recommend using[PROJECT_NAME]_[DATE]_02.bsc, where the “02” indicates processed data.
Click “OK”.

23.Creating input file format for COPILOT container.

The COPILOT container will automate a series of bioinformatics analyses that will apply a secondary clustering algorithm to improve data quality, while also identifying potential issues with the data. The following steps will generate the required input format from GenomeStudio:

Remove all filters from the “Samples Table” and the “SNP Table”.
Select the “Full Data Table” tab.
Select “Column Chooser”.
From the “Displayed Columns” window select:IndexAddressGen Train ScoreFrac AFrac CFrac GFrac T
Click Hide.
From the “Displayed Subcolumns” window select:ScoreThetaR
Click Hide
From the “Hidden Subcolumns” window select:XY
Click Show.
Ensure that the columns are in the order as illustrated in Table1.
Click the “Select all” icon to highlight all the SNPs.
Click the “Export displayed data to file” icon.
Browse to a location where you want to save the data, and save as[PROJECT_NAME]_[DATE]_intensity_data.reportin tab-delimited format.

24.Creating PLINK data (optional).

Users can generate PLINK-formatted data directly from GenomeStudio and follow guidelines in Coleman et al. (2016) to perform a typical GWAS analysis. However, it is strongly advised to generate the COPILOT output format data as described above and further process the data through the COPILOT container to further increase data quality. Nevertheless, if users would like to generate PLINK-format data at this stage, then:

Download and install the PLINK plugin from the GenomeStudio website (version used in this protocol is v2.1.4) athttps://support.illumina.com/downloads/genomestudio-2-0-plug-ins.html.
From the main window, click “Analysis” > “Reports” > “Report Wizard”.Click “Custom Report” and select “PLINK InputReport 2.1.4 by Illumina, Inc. from Illumina, Inc.”.Users are then provided with options on how to handle repeated samples (if present) by either selecting best sample by GC Score or assigning unique IDs to each sample.If multiple sample groups are available and have been specified in the sample sheet, GenomeStudio will provide an option to export by sample group. Select the group you would like to export and click “Next”.Next, users are provided options on how to handle zeroed samples. They can be removed at this stage or users can have them exported into the PLINK file with a zero call rate. We recommend removing these zeroed samples at this stage to lower the computational burden.This is followed by the options on how to handle zeroed SNPs. Similar to zeroed samples, we recommend removing these zeroed SNPs to lower the computational burden.Select the “Output Path” and provide a name in the “Report Name”, then Click “Finish”.

Table 1. Input Format Required for COPILOT Container

Name	Chr	Position	Sample 1			Sample 2			…
			G Type	X	Y	G Type	X	Y	…

Basic Protocol 2: COPILOT: A CONTAINERISED WORKFLOW FOR PROCESSING ILLUMINA GENOTYPING DATA

This protocol deploys a containerized workflow (COPILOT) that will effortlessly further process genotype data that has been pre-processed in GenomeStudio. The COPILOT container starts by performing pre-analysis checks of the input data, calculates basic statistics, and prepares for the zCall rare variant–calling algorithm. The zCall caller is an established software (Goldstein et al., 2012), which attempts to assign genotype calls to SNPs that have been missed by the GenomeStudio GenCall algorithm, which commonly represent the rare variants. This process essentially increases the overall sample call rates, which effectively increases data quality. COPILOT then converts data to PLINK format and uses the manifest file to update all alleles in AB format to Illumina TOP strand, and then performs multiple analyses as recommended in Coleman et al. (2016). These include iteratively removing SNPs and samples to a user-defined threshold, pruning data for linkage disequilibrium (LD), and removing high regions of LD and non-autosomal regions, identifying potential sex discrepancies, calculating identity-by-descent (IBD), performing a heterozygosity test, and calculating ancestry of each sample based on the 1000 Genomes reference panel (Altshuler et al., 2012).

Genotyping data often consists of duplicate sample IDs, which can cause issues during data processing. COPILOT handles duplicate sample IDs by assigning temporary unique sample IDs to duplicate samples, allowing each duplicate sample to be independently treated during the data processing stage. The resulting processed data is provided back to the user in PLINK format, with any duplicate sample IDs reverted to the original ID, but with a record kept for users to re-identify them if required. An outline of the COPILOT process is provided in Figure 13.