Recovering Misidentified Samples Through Genetic Discordance Clustering

Ingrid Kockum, Ingrid Kockum, Jesse Huang, Jesse Huang, Pernilla Stridh, Pernilla Stridh

Published: 2024-01-29 DOI: 10.1002/cpz1.972

Abstract

The many logistical and technical challenges associated with sample and data handling in largescale genotyping studies can increase the risk of sample misidentification, which may compromise subsequent analyses. However, the standard quality assurance methods typical for large genotyping arrays can often be further utilized to identify and recover problematic samples. This article emphasizes the importance of identifying and correcting underlying sample misidentification rather than simply excluding known discrepancies, which may potentially include undetected issues. Lastly, we provide a screening protocol to complement standard quality assessments as a guideline for identifying mismatched samples and a tool for assessing the most common causes of sample misidentification. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC.

INTRODUCTION

Genome-wide association studies are crucial for investigating the genetic contribution to differences in disease susceptibilities and developmental traits. The surge of these studies in the last decade has provided insight into the interplay between genetic and environmental factors, allowing for a more comprehensive understanding of many complex diseases (Beck et al., 2020; Visscher et al., 2017). Implementing such studies requires a significant investment of resources, often brought forth through international collaboration and motivated by the need for a higher-resolution landscape of genetic influences. However, the logistical challenges associated with sample and data handling can often accumulate errors that counteract the benefits of establishing larger cohorts (Anderson et al., 2010; Laurie et al., 2010; Turner et al., 2011). Such mistakes can originate from several sources (e.g., sampling, biobank management, plating), and previous studies have indicated that >10% of samples in microarray data may be mislabeled (Broman et al., 2015; Khan et al., 2001; Zhang et al., 2009). Sample misidentification often generates noise, reducing the overall statistical power for identifying genetic associations; however, more systematic misidentifications may produce skewed data representations leading to biased findings (Broman et al., 2015). Studies of quantitative or secondary phenotypes are often more sensitive to detecting such biases than case-control studies due to challenges with normalization. Although the quality control of samples often leads to the exclusion of discrepancies, the sources of many issues relating to sample misidentification may not be thoroughly investigated (Ekstrom & Feenstra, 2012; Laurie et al., 2010) or corrected (Hunter-Zinck et al., 2020). This article highlights the importance of identifying discrepancies and correcting underlying issues rather than filtering out flagged samples.

Standard protocols and tools typically assess the quality of genotype data by both the individuals and markers (Fig. 1; Anderson et al., 2010; Purcell et al., 2007). This is a crucial process for minimizing potential biases and errors that may affect the accuracy and reliability of genetic associations. These standard protocols have been detailed extensively elsewhere (Anderson et al., 2010; Hunter-Zinck et al., 2020; Kockum et al., 2023; Turner et al., 2011); however, we have provided a brief overview of the standard procedures for assessing genotype data and a summary of the overall process in Figure 1.

Flow chart of standard quality assessment pipeline and cluster analysis of systematic misidentifications in genotype data. (A) The flow chart outlines the current standard quality assessment protocol used for genotype data included in genetic association studies (Anderson et al., 2010; https://meyer-lab-cshl.github.io/plinkQC/). The data is assessed for the quality of both markers (yellow) and individuals (blue) to filter out data that could affect the reliability of any genetic association. The advantages of the standard quality assurance protocol are that it removes bias and errors that could influence the genetic association to be tested, it is standardized between studies, and there are tools available that make the process computationally feasible. The disadvantage is that mislabeled samples may not be detected if they are within normal ranges on the quality assessment parameters. (B) Adding cluster analysis of quality flags before standard data analysis can fill this gap. Standard individual-based analytical approaches used for cluster analysis (green) include abnormal heterozygosity rate, discordance between genotype-predicted and reported sex, relatedness between samples, and divergent ancestry. The quality flags are clustered by plate to identify systematic patterns of errors and suggest appropriate corrections. The concordance of individual-based analyses for corrected samples is then tested to determine which misidentified samples can be corrected. (C) The advantages and disadvantages of the different analytical approaches are described.

In short, per-individual quality assessments usually include four steps (Anderson et al., 2010). (1) Discordances between genetic and clinical/self-reported sex are often identified and excluded as cases of sample misidentification. (2) A high rate of missing genotypes or outlying mean heterozygosity rate, which is the proportion of heterozygous genotypes relative to the sample population, likely indicates poor sample quality. Variations in DNA quality can impact the genotype call rate and accuracy. Therefore, individuals with high genotype missingness rate (undetermined genotypes for more than 3% to 7% of all markers) and abnormal heterozygosity should be removed. Heterozygosity rates <23% and >30% indicate inbreeding and contamination, respectively. (3) Relatedness between individuals is often assessed using identity-by-descent (IBD), the proportion of shared alleles among independent SNPs between sample pairs, to identify duplicate samples and related individuals. IBD estimates >0.1875 are often used to identify samples from relatives, and IBD >0.98 indicates duplicates or monozygotic twins. In population-based studies, one individual should be removed from each flagged pair to prevent bias due to over-representation. (4) Divergent ancestry is usually assessed by principal component analysis (PCA) to prevent confounding due to population stratification. Population outliers may be excluded to prevent deviations from the study population (e.g., immigration). Furthermore, such genotypic differences due to population origins between cases and controls may generate increased variability and spurious associations if uncorrected in the analysis (Cardon & Palmer, 2003).

Quality control of the genetic markers is also critical to ensure the overall accuracy of genetic findings. Each marker's quality depends on the reliability of its genotyping call and any potential biases that may be introduced with the assessed phenotypes. Per-marker quality assessments usually include four steps. However, the cutoffs used to identify low-quality markers are less standardized and may differ between studies. (1) Markers with excessive missing genotypes, with a typical call rate threshold of 95%, indicate low genotyping quality and reliability and are often filtered out. A call rate threshold of up to 99% is sometimes used for markers with lower minor allele frequency (MAF, <5%) (Wellcome Trust Case Control Consortium, 2007). (2) Markers with a significant deviation from Hardy-Weinberg equilibrium (HWE) may suffer from errors in genotype calling and should be excluded. However, HWE is an indicator of selection pressure, and deviations could be due to the assessed phenotype; therefore, only controls should be used to assess marker quality. (3) Discrepancies in the call rate between comparison groups (e.g., case/control), which may be due to inadequate sample randomization, are assessed to prevent the introduction of bias associated with differences in genotyping quality. (4) Reliability is also assessed and filtered based on a MAF of at least 2%, as low-frequency alleles are more difficult to call accurately.

It is common practice to flag and exclude problematic samples before the analysis, and additional protocols to remove misidentified samples have been proposed (Hunter-Zinck et al., 2020). However, samples flagged by standard quality assessments often represent systematic mistakes affecting other samples that may have gone undetected (Table 1, Fig. 2). Recognizing patterns among misidentified samples during the quality control process may not only salvage excluded samples but can potentially identify and correct additional undetected issues.

Table 1. Potential Problems That Can Result in Systematic Misidentification of Samples

Error	Issue	Result	Action
Contamination	Samples are contaminated with DNA from another sample or sample handler	High heterozygosity, genotype errors	Contaminated samples should be excluded
Duplication	Samples are unintentionally placed twice either by replacing another sample or shifting remaining samples sequentially	Bias by overrepresentation, breech of genotype-phenotype relationship	Only one of any replicated sample should be included; shifted samples should be corrected and included
Shifts	Samples are shifted to a neighboring location due to either an unintentional overlap, skip in position, duplication, or deletion	Misidentification leading to a breach of genotype-phenotype relationship	Potential off-by-one and off-by-two pipetting errors should be investigated and corrected
Flips	Sample rows and columns are unintentionally flipped due to plating, pipetting, or digital errors	Sample misidentification; only samples on two opposing corners (first and last sample placed) will likely remain correct	Row-column flips should be investigated and corrected; blank wells and control samples are helpful
Plate orientation	The plate is misorientated (e.g., 180°) resulting in samples being mirrored	None of the samples on the plate will be correctly labeled	90°, 180°, or 270° rotations to correct orientation of samples
Plate swaps	Two plates are accidentally swapped resulting in sample identity swaps but intact plate position	None of the samples on the plate will be correctly labeled	Concordance tests can identify correct plate identity and sample ID should be corrected and included

Sample errors clustered by plate to detect misidentified samples. Examples of plates that illustrate common errors resulting in misidentification. Compared to the original plate (A), duplicate samples (purple) can cause off-by-one shifts (B), resulting in sample misidentification for part of the plate [(G), black circles)]. Flipped columns and rows can originate from pipetting errors in the lab (C) or sorting/reshuffling within the data files (D), resulting in incorrect positioning of samples (H) or identities (I), respectively. The final example illustrates a 180° plate rotation (E), causing all samples to be repositioned (J). The misidentified samples detected by conventional QC methods in these examples range from 23%-34% (F). Samples are colored according to reported sex (pink = female, blue = male). The samples that remain correct after the error occurs are labeled in green; misidentified samples are indicated with black circles; and samples that would be flagged in conventional genetic QC are labeled in red. Sections of rows/columns that are flipped are labeled in gray for original positions and black for new positions.

CONSIDERATIONS FOR GENETIC QUALITY ASSESSMENT

Cluster analyses of flagged problems are a complementary method that can improve current quality assessment protocols to identify patterns and rescue problematic samples (Fig. 1).

Clustering patterns of sex discordance

Sex discordance is an effective method for detecting significant systematic issues due to its reliability and frequent availability. Clustering sex discordance by sample handling parameters (e.g., pipetting, plating, batch, sample handler, analytical order) can identify the sources of possible mistakes along with patterns that may assist in deducing the cause of observed discrepancies. For example, we re-examined the genotype quality assurance procedures for a Swedish case control cohort of multiple sclerosis patients and population-matched controls (International Multiple Sclerosis Genetics Consortium, 2019). By comparing genotype-predicted sex with sex derived from government-issued personal identification numbers, we could confirm sex discordances previously removed as part of the standard protocol. However, further classification by plate showed a clustering of 32 discrepancies within a single plate. Examining the patterns revealed a flip in the sample layout between columns and rows, leading to systematic misidentification for this plate. The sample identities were then corrected and confirmed using an independent genetic dataset (Olafsson et al., 2017). It is worth noting that only a third of the incorrect samples were initially identified and removed by standard sex concordance analysis, likely leading to the remaining samples introducing noise in any following association analyses (Fig. 2).

Systematic approaches for identifying sample/data mishandling

To facilitate similar analyses, we provide a supplementary Microsoft Excel-based tool where users can assess phenotype and genotype-derived data in a 96-well plate format (Samples Verification and Quality Control Tool, SVAQC; Huang et al., 2021). Data suspected of containing misidentified samples can be analyzed and corrected for a wide range of common sample and data handling mistakes. The concordance rate following each correction may be cross-compared by plate or sample layouts, along with the matching sample identity for each genotype. As illustrated in the example, the sex concordance rate within a plate can help determine systematic errors. However, the observed rate will depend on the type of mistake and the proportion of males and females on the plate. If we assume samples are randomly distributed and have an equal chance of mismatch, the predicted sex concordance rate (Rsc), given the number (n_M/n_F) or proportion (p_M/p_F) of males and females, respectively, on the plate would be:

[Equation 1] Rsc = [(p_M × n_M) + (p_F × n_F)]/(n_M + n_F) [Equation 2] Rsc = p_M² + p_F² = p_M² + (1 - p_M)² = p_F² + (1 - p_F)²

This relationship is illustrated in Figure 3.

Predicted sex concordance rate. Heatmap illustrating distribution density of sex concordance rate based on the percentage of male/female in the sample population. Distribution was determined by simulated sex shuffling of 92 samples (p = 1000) for each percentage possibility. Solid and dashed line illustrate the median and 95% confidence interval, respectively.

However, inconsistencies may occur in clinical or self-reported sex due to clerical errors. In self-reported sex, inconsistencies may also be due to miscommunication, e.g., reporting gender rather than biological sex, or related to privacy concerns.

Cross-validating sample identity in genetic data

Other phenotypes derived from genetic data, such as ethnicity, may also be useful for determining systematic errors. Genotypes projected onto a known reference, such as the 1000 Genomes Project, can be used to estimate ancestry (Prive et al., 2020). A set of markers can be retyped to confirm the identity of the samples, so-called fingerprinting (Kofanova et al., 2014). The number of SNPs required for fingerprinting depends on desired accuracy and the allele frequency of each SNP. As evidenced in Figure 4, we recommend three to seven SNPs not in linkage disequilibrium and with a minor allele frequency >30%.

Predicted genotype discordance rate by number of SNPs. Line plot illustrating the average genotype discordance rate based on the number of compared SNPs and the minor allele frequency for each SNP. Distribution was determined by simulation (1000 samples, p = 100, assuming Hardy-Weinberg equilibrium, no linkage disequilibrium between SNPs).

Another approach for identifying major sample misidentifications by plate is to construct polygenic risk scores for known traits, such as the primary disease of study, and analyze the correlation between risk scores and the corresponding phenotype within each plate. Although the risk score is not predictive on an individual level for complex traits, a degree of correlation is expected between the risk score and phenotype, a lack of which may indicate broad sample misidentification within the plate. Similarly, the heritability of known traits can be estimated from the genotype data set and compared to expected estimates. A deviation in estimated heritability would suggest the presence of sample misidentification and the magnitude of the error.

Identifying mishandling with sample contamination

An excessive heterozygosity rate may indicate contamination resulting from improper sample handling or overlapping samples during plating. One may identify contaminated plates or systematic errors by assessing for high heterozygosity rates with sample handling parameters. For example, dense clustering of samples with a high heterozygosity rate and inconsistencies between samples and genotypes, such as missing data or empty wells, may indicate shifts during sample plating. In such cases, the identities of affected samples may be predicted and corrected during data preprocessing.

STUDY DESIGN AND ANALYTICAL CONSIDERATIONS

Data sets with corrected sample identities should be tested for concordance with independent parameters. Ideally, overlapping markers from an independent genotype array would provide the ideal check for individual consistency, although this is not commonly available. Furthermore, matching individuals based on genotype, even among those that should not overlap, may help identify potential duplication and sample misplacement. The genotype concordance rate between non-duplicated individuals is typically <70%. In comparison, the same individuals or monozygotic twins tend to have >98% observed genotype concordance with sparse cases in between this range. Therefore, even without a validation dataset, assessing the presence of intentional and unintentional duplicates (IBS ∼1) can help assess certain plating issues.

Defensive practices and preventative measures can limit mistakes and allow for easier detection of errors. A simple preventive measure is placing blanks in the last two wells of each plate. The asymmetric placement can be useful for identifying systematic mistakes in sample ordering or plate misorientation, including the previous example. Another strategy is to repeat samples between plates as a control measure (Hunter-Zinck et al., 2020), facilitating the correction of misidentified samples due to plate layout errors, although this may be cost-prohibitive. Automated pipetting may provide an appropriate solution to reduce the chance of human error; however, complete automation may not be feasible depending on the situation, as reviewed by Gut et al. (2001). The effort to salvage sample errors can be time-consuming and limited by the data and resources available. However, the further assessment of misidentification patterns and clusters utilizes already generated flags to expose systematic errors that would have remained in the data after standard protocols. As a guide for correcting sample misidentification in future studies, we provided a checklist (Table 2), which accompanies the supplemented tool (SVAQC), for different approaches to assessing the most common causes of misidentification.

Table 2. Checklist for Troubleshooting

Parameters	Data	Analysis	Application to data/samples
QC	Sex discordance	Cluster sex discordance by plate to identify a systematic pattern, e.g., off-by-one shifts, flips, or other sources of sample misidentification (Fig. 1); identify the breakpoints or buffer zones to determine which samples are affected and which can be corrected	Assign the corrected identity to the sample(s) and flag/exclude samples in buffer zones; compare the plate layouts of original and corrected sample identity to understand the source of misidentification (i.e., pipetting errors, digital flips in plate layout, etc.); test additional parameters for concordance in the corrected sample plates to confirm the corrections and fine map the breakpoints in buffer zones
	Heterozygosity rate	Cluster heterozygosity outliers by plate to identify potentially contaminated samples as breakpoints for systematic shifts	If no pattern exists, the source of contamination is most likely a lab or pipetting error that only affects the sample, which can be excluded; if discordances follow a contaminated sample, it may be a breakpoint for a shift
	Sample relatedness (IBS)	Cluster samples showing high IBS scores by plate to identify unintentional duplications; additionally, note control samples with low relatedness to identify patterns	If duplicated samples are located on the same plate, they may represent a breakpoint for errors; blocks of duplicated samples can reveal sources of errors, e.g., swapped plates
	Principal component analysis (PCA)	Cluster discordance between 1000 Genomes Project estimated ethnicity and reported ethnicity by plate to identify a systematic pattern, such as off-by-one shifts, flips, or other sources of sample misidentification (Fig. 1); identify the breakpoints or buffer zones to determine which samples are affected and which can be corrected	Assign a corrected identity to the sample(s) and flag/exclude samples in buffer zones; compare the plate layouts of original and corrected sample identity to understand the source of misidentification (i.e., pipetting errors, digital flips in plate layout, etc.); test additional parameters, e.g., sex, for concordance in the corrected sample plates to confirm the corrections and fine map the breakpoints in buffer zones
Alternative	Sample confirmation by IBS	If data from an independent genotype array is available, calculate relatedness between samples from the same individual to confirm the identity or flag a discrepancy	High concordance between samples on two independent arrays can resolve concerns regarding identity; this can also confirm previous corrections; conversely, low relatedness between samples can flag misidentification, and identify an alternative sample as the correct match
	Alternative phenotypes	Compare genotype-derived and actual phenotype measures to confirm sample identities; the phenotypes must have a clear genetic component, e.g., ethnicity or ABO blood type	Cluster genotype-phenotype discordances by plate to reveal patterns of inconsistency
	Digital flips and rotations	Assess if the pattern of errors is due to data file mishandling (sorting, A > Z, etc.) or potential misorientation of the plate when pipetting	Check concordance rate by plate and sample sheet layout and observe if the rate improves with flipping or rotating systematically
	Blank wells	Check positions of blank wells in genotype data to confirm correct layout	Unexpected blanks may indicate shifts or errors; blanks can used as anchors to correctly reposition sample layout
	Fingerprinting	A set of markers can be re-genotyped to confirm the identity of all or some of the samples	Samples with confirmed identities can then be used to reassess the accuracy of the plate layout and test concordance with other parameters, e.g., sex

CONCLUDING REMARKS

In conclusion, poor quality of genotyping data can increase the risk of producing spurious genetic associations in exploratory studies, which may be erroneously pursued at the cost of both time and resources. Although standard quality control protocols can identify certain discrepancies, it may often overlook many affected samples. Therefore, along with establishing preventative measures, it is important to actively identify sources of sample misidentification and contamination to ensure the overall quality and reliability of findings.

AUTHOR CONTRIBUTIONS

Jesse Huang : Conceptualization; data curation; formal analysis; methodology; validation; visualization; writing – original draft; writing – review and editing. Ingrid Kockum : Conceptualization; data curation; funding acquisition; investigation; project administration; resources; writing – original draft; writing – review and editing. Pernilla Stridh : Conceptualization; data curation; formal analysis; methodology; validation; visualization; writing – original draft; writing – review and editing.

ACKNOWLEDGMENTS

This work was supported in part by Margaretha af Ugglas Foundation, MS Research Fund, the Neuro Foundation, an endMS Doctoral Studentship (EGID:3045) from the Multiple Sclerosis Society of Canada (to J.H.), and Horizon 2020 MultipleMS grant number 733161 (to I.K.).

CONFLICT OF INTEREST

The authors declare no competing interests.

Open Research

DATA AVAILABILITY STATEMENT

The referenced tool has been made publicly available through a repository (https://doi.org/10.17605/OSF.IO/5KPUC; Huang et al., 2021).

LITERATURE CITED

Anderson, C. A., Pettersson, F. H., Clarke, G. M., Cardon, L. R., Morris, A. P., & Zondervan, K. T. (2010). Data quality control in genetic case-control association studies. Nature Protocols , 5(9), 1564–1573. https://doi.org/10.1038/nprot.2010.116
Beck, T., Shorter, T., & Brookes, A. J. (2020). GWAS Central: A comprehensive resource for the discovery and comparison of genotype and phenotype data from genome-wide association studies. Nucleic Acids Research , 48(D1), D933–D940. https://doi.org/10.1093/nar/gkz895
Broman, K. W., Keller, M. P., Broman, A. T., Kendziorski, C., Yandell, B. S., Sen, S., & Attie, A. D. (2015). Identification and correction of sample mix-ups in expression genetic data: A case study. G3 Genes|Genomes|Genetics , 5(10), 2177–2186. https://doi.org/10.1534/g3.115.019778
Cardon, L. R., & Palmer, L. J. (2003). Population stratification and spurious allelic association. Lancet , 361(9357), 598–604. https://doi.org/10.1016/S0140-6736(03)12520-2
Ekstrom, C. T., & Feenstra, B. (2012). Detecting sample misidentifications in genetic association studies. Statistical Applications in Genetics and Molecular Biology , 11(3), 13. https://doi.org/10.1515/1544-6115.1772
Gut, I. G. (2001). Automation in genotyping of single nucleotide polymorphisms. Human Mutation , 17(6), 475–492. https://doi.org/10.1002/humu.1131
Huang, J., Kockum, I., & Stridh, P. (2021). Sample Verification and Quality Check Tool (SVAQC Tool). https://doi.org/10.17605/OSF.IO/5KPUC
Hunter-Zinck, H., Shi, Y., Li, M., Gorman, B. R., Ji, S. G., Sun, N., Webster, T., Liem, A., Hsieh, P., Devineni, P., Karnam, P., Gong, X., Radhakrishnan, L., Schmidt, J., Assimes, T. L., Huang, J., Pan, C., Humphries, D., Brophy, M., … Pyarajan, S. (2020). Genotyping array design and data quality control in the million veteran program. American Journal of Human Genetics , 106(4), 535–548. https://doi.org/10.1016/j.ajhg.2020.03.004
International Multiple Sclerosis Genetics Consortium. (2019). Multiple sclerosis genomic map implicates peripheral immune cells and microglia in susceptibility. Science , 365(6460), eaav7188. https://doi.org/10.1126/science.aav7188
Khan, J., Wei, J. S., Ringnér, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., & Meltzer, P. S. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine , 7(6), 673–679. https://doi.org/10.1038/89044
Kockum, I., Huang, J., & Stridh, P. (2023). Overview of genotyping technologies and methods. Current Protocols , 3, e727. https://doi.org/10.1002/cpz1.727
Kofanova, O. A., Mathieson, W., Thomas, G. A., & Betsou, F. (2014). DNA fingerprinting: A quality control case study for human biospecimen authentication. Biopreservation and Biobanking , 12(2), 151–153. https://doi.org/10.1089/bio.2013.0063
Laurie, C. C., Doheny, K. F., Mirel, D. B., Pugh, E. W., Bierut, L. J., Bhangale, T., Boehm, F., Caporaso, N. E., Cornelis, M. C., Edenberg, H. J., Gabriel, S. B., Harris, E. L., Hu, F. B., Jacobs, K. B., Kraft, P., Landi, M. T., Lumley, T., Manolio, T. A., McHugh, C., … GENEVA Investigators. (2010). Quality control and quality assurance in genotypic data for genome-wide association studies. Genetic Epidemiology , 34(6), 591–602. https://doi.org/10.1002/gepi.20516
Olafsson, S., Stridh, P., Bos, S. D., Ingason, A., Euesden, J., Sulem, P., Thorleifsson, G., Gustafsson, O., Johannesson, A., Geirsson, A. J., Thorsson, A. V., Sigurgeirsson, B., Ludviksson, B. R., Olafsson, E., Kristjansdottir, H., Jonasson, J. G., Olafsson, J. H., Orvar, K. B., Benediktsson, R., … Stefansson, K. (2017). Fourteen sequence variants that associate with multiple sclerosis discovered by meta-analysis informed by genetic correlations. NPJ Genomic Medicine , 2, 24. https://doi.org/10.1038/s41525-017-0027-2
Prive, F., Luu, K., Blum, M. G. B., McGrath, J. J., & Vilhjalmsson, B. J. (2020). Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics , 36(16), 4449–4457. https://doi.org/10.1093/bioinformatics/btaa520
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller, J., Sklar, P., de Bakker, P. I., Daly, M. J., & Sham, P. C. (2007). PLINK: A tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics , 81(3), 559–575. https://doi.org/10.1086/519795
Turner, S., Armstrong, L. L., Bradford, Y., Carlson, C. S., Crawford, D. C., Crenshaw, A. T., de Andrade, M., Doheny, K. F., Haines, J. L., Hayes, G., Jarvik, G., Jiang, L., Kullo, I. J., Li, R., Ling, H., Manolio, T. A., Matsumoto, M., McCarty, C. A., McDavid, A. N., … Ritchie, M. D. (2011). Quality control procedures for genome-wide association studies. Current Protocols in Human Genetics , 68, 1.19.1–1.19.18. https://doi.org/10.1002/0471142905.hg0119s68
Wellcome Trust Case Control Consortium. (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature , 447(7145), 661–678. https://doi.org/10.1038/nature05911
Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., & Yang, J. (2017). 10 Years of GWAS Discovery: Biology, Function, and Translation. American Journal of Human Genetics , 101(1), 5–22. https://doi.org/10.1016/j.ajhg.2017.06.005
Zhang, C., Wu, C., Blanzieri, E., Zhou, Y., Wang, Y., Du, W., & Liang, Y. (2009). Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model. Bioinformatics , 25(20), 2708–2714. https://doi.org/10.1093/bioinformatics/btp478

Preparation of selective organ-targeting (SORT) lipid nanoparticles (LNPs) using multiple technical methods for tissue-specific mRNA delivery

Cytosine and adenosine base editing in human pluripotent stem cells using transient reporters for editing enrichment

Directed differentiation of human pluripotent stem cells into diverse organ-specific mesenchyme of the digestive and respiratory systems

MOF-derived nanoporous carbons with diverse tunable nanoarchitectures

查看全部

Sections

Figures

References

Abstract
INTRODUCTION
CONSIDERATIONS FOR GENETIC QUALITY ASSESSMENT
STUDY DESIGN AND ANALYTICAL CONSIDERATIONS
CONCLUDING REMARKS
Open Research
LITERATURE CITED

Figure 1
Flow chart of standard quality assessment pipeline and cluster analysis of systematic misidentifications in genotype data. (A) The flow chart outlines the current standard quality assessment protocol used for genotype data included in genetic association studies (Anderson et al., 2010; https://meyer-lab-cshl.github.io/plinkQC/). The data is assessed for the quality of both markers (yellow) and individuals (blue) to filter out data that could affect the reliability of any genetic association. The advantages of the standard quality assurance protocol are that it removes bias and errors that could influence the genetic association to be tested, it is standardized between studies, and there are tools available that make the process computationally feasible. The disadvantage is that mislabeled samples may not be detected if they are within normal ranges on the quality assessment parameters. (B) Adding cluster analysis of quality flags before standard data analysis can fill this gap. Standard individual-based analytical approaches used for cluster analysis (green) include abnormal heterozygosity rate, discordance between genotype-predicted and reported sex, relatedness between samples, and divergent ancestry. The quality flags are clustered by plate to identify systematic patterns of errors and suggest appropriate corrections. The concordance of individual-based analyses for corrected samples is then tested to determine which misidentified samples can be corrected. (C) The advantages and disadvantages of the different analytical approaches are described.
Figure 2
Sample errors clustered by plate to detect misidentified samples. Examples of plates that illustrate common errors resulting in misidentification. Compared to the original plate (A), duplicate samples (purple) can cause off-by-one shifts (B), resulting in sample misidentification for part of the plate [(G), black circles)]. Flipped columns and rows can originate from pipetting errors in the lab (C) or sorting/reshuffling within the data files (D), resulting in incorrect positioning of samples (H) or identities (I), respectively. The final example illustrates a 180° plate rotation (E), causing all samples to be repositioned (J). The misidentified samples detected by conventional QC methods in these examples range from 23%-34% (F). Samples are colored according to reported sex (pink = female, blue = male). The samples that remain correct after the error occurs are labeled in green; misidentified samples are indicated with black circles; and samples that would be flagged in conventional genetic QC are labeled in red. Sections of rows/columns that are flipped are labeled in gray for original positions and black for new positions.
Figure 3
Predicted sex concordance rate. Heatmap illustrating distribution density of sex concordance rate based on the percentage of male/female in the sample population. Distribution was determined by simulated sex shuffling of 92 samples (p = 1000) for each percentage possibility. Solid and dashed line illustrate the median and 95% confidence interval, respectively.
Figure 4
Predicted genotype discordance rate by number of SNPs. Line plot illustrating the average genotype discordance rate based on the number of compared SNPs and the minor allele frequency for each SNP. Distribution was determined by simulation (1000 samples, p = 100, assuming Hardy-Weinberg equilibrium, no linkage disequilibrium between SNPs).

Anderson, C. A., Pettersson, F. H., Clarke, G. M., Cardon, L. R., Morris, A. P., & Zondervan, K. T. (2010). Data quality control in genetic case-control association studies. Nature Protocols, 5(9), 1564–1573. https://doi.org/10.1038/nprot.2010.116 10.1038/nprot.2010.116 CASPubMedWeb of Science®Google Scholar
Beck, T., Shorter, T., & Brookes, A. J. (2020). GWAS Central: A comprehensive resource for the discovery and comparison of genotype and phenotype data from genome-wide association studies. Nucleic Acids Research, 48(D1), D933–D940. https://doi.org/10.1093/nar/gkz895 10.1093/nar/gkz895 CASPubMedWeb of Science®Google Scholar
Broman, K. W., Keller, M. P., Broman, A. T., Kendziorski, C., Yandell, B. S., Sen, S., & Attie, A. D. (2015). Identification and correction of sample mix-ups in expression genetic data: A case study. G3 Genes|Genomes|Genetics, 5(10), 2177–2186. https://doi.org/10.1534/g3.115.019778 10.1534/g3.115.019778 Web of Science®Google Scholar
Cardon, L. R., & Palmer, L. J. (2003). Population stratification and spurious allelic association. Lancet, 361(9357), 598–604. https://doi.org/10.1016/S0140-6736(03)12520-2 10.1016/S0140-6736(03)12520-2 PubMedWeb of Science®Google Scholar
Ekstrom, C. T., & Feenstra, B. (2012). Detecting sample misidentifications in genetic association studies. Statistical Applications in Genetics and Molecular Biology, 11(3), 13. https://doi.org/10.1515/1544-6115.1772 10.1515/1544-6115.1772 PubMedWeb of Science®Google Scholar
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller, J., Sklar, P., de Bakker, P. I., Daly, M. J., & Sham, P. C. (2007). PLINK: A tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics, 81(3), 559–575. https://doi.org/10.1086/519795 10.1086/519795 CASPubMedWeb of Science®Google Scholar
Gut, I. G. (2001). Automation in genotyping of single nucleotide polymorphisms. Human Mutation, 17(6), 475–492. https://doi.org/10.1002/humu.1131 10.1002/humu.1131 CASPubMedWeb of Science®Google Scholar
Huang, J., Kockum, I., & Stridh, P. (2021). Sample Verification and Quality Check Tool (SVAQC Tool). https://doi.org/10.17605/OSF.IO/5KPUC 10.17605/OSF.IO/5KPUC Google Scholar
Hunter-Zinck, H., Shi, Y., Li, M., Gorman, B. R., Ji, S. G., Sun, N., Webster, T., Liem, A., Hsieh, P., Devineni, P., Karnam, P., Gong, X., Radhakrishnan, L., Schmidt, J., Assimes, T. L., Huang, J., Pan, C., Humphries, D., Brophy, M., … Pyarajan, S. (2020). Genotyping array design and data quality control in the million veteran program. American Journal of Human Genetics, 106(4), 535–548. https://doi.org/10.1016/j.ajhg.2020.03.004 10.1016/j.ajhg.2020.03.004 CASPubMedWeb of Science®Google Scholar
International Multiple Sclerosis Genetics Consortium. (2019). Multiple sclerosis genomic map implicates peripheral immune cells and microglia in susceptibility. Science, 365(6460), eaav7188. https://doi.org/10.1126/science.aav7188 10.1126/science.aav7188 PubMedWeb of Science®Google Scholar
Khan, J., Wei, J. S., Ringnér, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., & Meltzer, P. S. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7(6), 673–679. https://doi.org/10.1038/89044 10.1038/89044 CASPubMedWeb of Science®Google Scholar
Kockum, I., Huang, J., & Stridh, P. (2023). Overview of genotyping technologies and methods. Current Protocols, 3, e727. https://doi.org/10.1002/cpz1.727 10.1002/cpz1.727 CASPubMedGoogle Scholar
Kofanova, O. A., Mathieson, W., Thomas, G. A., & Betsou, F. (2014). DNA fingerprinting: A quality control case study for human biospecimen authentication. Biopreservation and Biobanking, 12(2), 151–153. https://doi.org/10.1089/bio.2013.0063 10.1089/bio.2013.0063 PubMedWeb of Science®Google Scholar
Laurie, C. C., Doheny, K. F., Mirel, D. B., Pugh, E. W., Bierut, L. J., Bhangale, T., Boehm, F., Caporaso, N. E., Cornelis, M. C., Edenberg, H. J., Gabriel, S. B., Harris, E. L., Hu, F. B., Jacobs, K. B., Kraft, P., Landi, M. T., Lumley, T., Manolio, T. A., McHugh, C., … GENEVA Investigators. (2010). Quality control and quality assurance in genotypic data for genome-wide association studies. Genetic Epidemiology, 34(6), 591–602. https://doi.org/10.1002/gepi.20516 10.1002/gepi.20516 PubMedWeb of Science®Google Scholar
Olafsson, S., Stridh, P., Bos, S. D., Ingason, A., Euesden, J., Sulem, P., Thorleifsson, G., Gustafsson, O., Johannesson, A., Geirsson, A. J., Thorsson, A. V., Sigurgeirsson, B., Ludviksson, B. R., Olafsson, E., Kristjansdottir, H., Jonasson, J. G., Olafsson, J. H., Orvar, K. B., Benediktsson, R., … Stefansson, K. (2017). Fourteen sequence variants that associate with multiple sclerosis discovered by meta-analysis informed by genetic correlations. NPJ Genomic Medicine, 2, 24. https://doi.org/10.1038/s41525-017-0027-2 10.1038/s41525-017-0027-2 PubMedWeb of Science®Google Scholar
Prive, F., Luu, K., Blum, M. G. B., McGrath, J. J., & Vilhjalmsson, B. J. (2020). Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics, 36(16), 4449–4457. https://doi.org/10.1093/bioinformatics/btaa520 10.1093/bioinformatics/btaa520 CASPubMedWeb of Science®Google Scholar
Turner, S., Armstrong, L. L., Bradford, Y., Carlson, C. S., Crawford, D. C., Crenshaw, A. T., de Andrade, M., Doheny, K. F., Haines, J. L., Hayes, G., Jarvik, G., Jiang, L., Kullo, I. J., Li, R., Ling, H., Manolio, T. A., Matsumoto, M., McCarty, C. A., McDavid, A. N., … Ritchie, M. D. (2011). Quality control procedures for genome-wide association studies. Current Protocols in Human Genetics, 68, 1.19.1–1.19.18. https://doi.org/10.1002/0471142905.hg0119s68 10.1002/0471142905.hg0119s68 Web of Science®Google Scholar
Wellcome Trust Case Control Consortium. (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145), 661–678. https://doi.org/10.1038/nature05911 10.1038/nature05911 PubMedWeb of Science®Google Scholar
Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., & Yang, J. (2017). 10 Years of GWAS Discovery: Biology, Function, and Translation. American Journal of Human Genetics, 101(1), 5–22. https://doi.org/10.1016/j.ajhg.2017.06.005 10.1016/j.ajhg.2017.06.005 CASPubMedWeb of Science®Google Scholar
Zhang, C., Wu, C., Blanzieri, E., Zhou, Y., Wang, Y., Du, W., & Liang, Y. (2009). Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model. Bioinformatics, 25(20), 2708–2714. https://doi.org/10.1093/bioinformatics/btp478 10.1093/bioinformatics/btp478 CASPubMedWeb of Science®Google Scholar