Recovering Misidentified Samples Through Genetic Discordance Clustering
Ingrid Kockum, Ingrid Kockum, Jesse Huang, Jesse Huang, Pernilla Stridh, Pernilla Stridh
Abstract
The many logistical and technical challenges associated with sample and data handling in largescale genotyping studies can increase the risk of sample misidentification, which may compromise subsequent analyses. However, the standard quality assurance methods typical for large genotyping arrays can often be further utilized to identify and recover problematic samples. This article emphasizes the importance of identifying and correcting underlying sample misidentification rather than simply excluding known discrepancies, which may potentially include undetected issues. Lastly, we provide a screening protocol to complement standard quality assessments as a guideline for identifying mismatched samples and a tool for assessing the most common causes of sample misidentification. © 2024 The Authors. Current Protocols published by Wiley Periodicals LLC.
INTRODUCTION
Genome-wide association studies are crucial for investigating the genetic contribution to differences in disease susceptibilities and developmental traits. The surge of these studies in the last decade has provided insight into the interplay between genetic and environmental factors, allowing for a more comprehensive understanding of many complex diseases (Beck et al., 2020; Visscher et al., 2017). Implementing such studies requires a significant investment of resources, often brought forth through international collaboration and motivated by the need for a higher-resolution landscape of genetic influences. However, the logistical challenges associated with sample and data handling can often accumulate errors that counteract the benefits of establishing larger cohorts (Anderson et al., 2010; Laurie et al., 2010; Turner et al., 2011). Such mistakes can originate from several sources (e.g., sampling, biobank management, plating), and previous studies have indicated that >10% of samples in microarray data may be mislabeled (Broman et al., 2015; Khan et al., 2001; Zhang et al., 2009). Sample misidentification often generates noise, reducing the overall statistical power for identifying genetic associations; however, more systematic misidentifications may produce skewed data representations leading to biased findings (Broman et al., 2015). Studies of quantitative or secondary phenotypes are often more sensitive to detecting such biases than case-control studies due to challenges with normalization. Although the quality control of samples often leads to the exclusion of discrepancies, the sources of many issues relating to sample misidentification may not be thoroughly investigated (Ekstrom & Feenstra, 2012; Laurie et al., 2010) or corrected (Hunter-Zinck et al., 2020). This article highlights the importance of identifying discrepancies and correcting underlying issues rather than filtering out flagged samples.
Standard protocols and tools typically assess the quality of genotype data by both the individuals and markers (Fig. 1; Anderson et al., 2010; Purcell et al., 2007). This is a crucial process for minimizing potential biases and errors that may affect the accuracy and reliability of genetic associations. These standard protocols have been detailed extensively elsewhere (Anderson et al., 2010; Hunter-Zinck et al., 2020; Kockum et al., 2023; Turner et al., 2011); however, we have provided a brief overview of the standard procedures for assessing genotype data and a summary of the overall process in Figure 1.

In short, per-individual quality assessments usually include four steps (Anderson et al., 2010). (1) Discordances between genetic and clinical/self-reported sex are often identified and excluded as cases of sample misidentification. (2) A high rate of missing genotypes or outlying mean heterozygosity rate, which is the proportion of heterozygous genotypes relative to the sample population, likely indicates poor sample quality. Variations in DNA quality can impact the genotype call rate and accuracy. Therefore, individuals with high genotype missingness rate (undetermined genotypes for more than 3% to 7% of all markers) and abnormal heterozygosity should be removed. Heterozygosity rates <23% and >30% indicate inbreeding and contamination, respectively. (3) Relatedness between individuals is often assessed using identity-by-descent (IBD), the proportion of shared alleles among independent SNPs between sample pairs, to identify duplicate samples and related individuals. IBD estimates >0.1875 are often used to identify samples from relatives, and IBD >0.98 indicates duplicates or monozygotic twins. In population-based studies, one individual should be removed from each flagged pair to prevent bias due to over-representation. (4) Divergent ancestry is usually assessed by principal component analysis (PCA) to prevent confounding due to population stratification. Population outliers may be excluded to prevent deviations from the study population (e.g., immigration). Furthermore, such genotypic differences due to population origins between cases and controls may generate increased variability and spurious associations if uncorrected in the analysis (Cardon & Palmer, 2003).
Quality control of the genetic markers is also critical to ensure the overall accuracy of genetic findings. Each marker's quality depends on the reliability of its genotyping call and any potential biases that may be introduced with the assessed phenotypes. Per-marker quality assessments usually include four steps. However, the cutoffs used to identify low-quality markers are less standardized and may differ between studies. (1) Markers with excessive missing genotypes, with a typical call rate threshold of 95%, indicate low genotyping quality and reliability and are often filtered out. A call rate threshold of up to 99% is sometimes used for markers with lower minor allele frequency (MAF, <5%) (Wellcome Trust Case Control Consortium, 2007). (2) Markers with a significant deviation from Hardy-Weinberg equilibrium (HWE) may suffer from errors in genotype calling and should be excluded. However, HWE is an indicator of selection pressure, and deviations could be due to the assessed phenotype; therefore, only controls should be used to assess marker quality. (3) Discrepancies in the call rate between comparison groups (e.g., case/control), which may be due to inadequate sample randomization, are assessed to prevent the introduction of bias associated with differences in genotyping quality. (4) Reliability is also assessed and filtered based on a MAF of at least 2%, as low-frequency alleles are more difficult to call accurately.
It is common practice to flag and exclude problematic samples before the analysis, and additional protocols to remove misidentified samples have been proposed (Hunter-Zinck et al., 2020). However, samples flagged by standard quality assessments often represent systematic mistakes affecting other samples that may have gone undetected (Table 1, Fig. 2). Recognizing patterns among misidentified samples during the quality control process may not only salvage excluded samples but can potentially identify and correct additional undetected issues.
Error | Issue | Result | Action |
---|---|---|---|
Contamination | Samples are contaminated with DNA from another sample or sample handler | High heterozygosity, genotype errors | Contaminated samples should be excluded |
Duplication | Samples are unintentionally placed twice either by replacing another sample or shifting remaining samples sequentially | Bias by overrepresentation, breech of genotype-phenotype relationship | Only one of any replicated sample should be included; shifted samples should be corrected and included |
Shifts | Samples are shifted to a neighboring location due to either an unintentional overlap, skip in position, duplication, or deletion | Misidentification leading to a breach of genotype-phenotype relationship | Potential off-by-one and off-by-two pipetting errors should be investigated and corrected |
Flips | Sample rows and columns are unintentionally flipped due to plating, pipetting, or digital errors | Sample misidentification; only samples on two opposing corners (first and last sample placed) will likely remain correct | Row-column flips should be investigated and corrected; blank wells and control samples are helpful |
Plate orientation | The plate is misorientated (e.g., 180°) resulting in samples being mirrored | None of the samples on the plate will be correctly labeled | 90°, 180°, or 270° rotations to correct orientation of samples |
Plate swaps | Two plates are accidentally swapped resulting in sample identity swaps but intact plate position | None of the samples on the plate will be correctly labeled | Concordance tests can identify correct plate identity and sample ID should be corrected and included |
![Details are in the caption following the image Sample errors clustered by plate to detect misidentified samples. Examples of plates that illustrate common errors resulting in misidentification. Compared to the original plate (A), duplicate samples (purple) can cause off-by-one shifts (B), resulting in sample misidentification for part of the plate [(G), black circles)]. Flipped columns and rows can originate from pipetting errors in the lab (C) or sorting/reshuffling within the data files (D), resulting in incorrect positioning of samples (H) or identities (I), respectively. The final example illustrates a 180° plate rotation (E), causing all samples to be repositioned (J). The misidentified samples detected by conventional QC methods in these examples range from 23%-34% (F). Samples are colored according to reported sex (pink = female, blue = male). The samples that remain correct after the error occurs are labeled in green; misidentified samples are indicated with black circles; and samples that would be flagged in conventional genetic QC are labeled in red. Sections of rows/columns that are flipped are labeled in gray for original positions and black for new positions.](https://static.yanyin.tech/literature_test/cpz1972-fig-0002-m.jpg)
CONSIDERATIONS FOR GENETIC QUALITY ASSESSMENT
Cluster analyses of flagged problems are a complementary method that can improve current quality assessment protocols to identify patterns and rescue problematic samples (Fig. 1).
Clustering patterns of sex discordance
Sex discordance is an effective method for detecting significant systematic issues due to its reliability and frequent availability. Clustering sex discordance by sample handling parameters (e.g., pipetting, plating, batch, sample handler, analytical order) can identify the sources of possible mistakes along with patterns that may assist in deducing the cause of observed discrepancies. For example, we re-examined the genotype quality assurance procedures for a Swedish case control cohort of multiple sclerosis patients and population-matched controls (International Multiple Sclerosis Genetics Consortium, 2019). By comparing genotype-predicted sex with sex derived from government-issued personal identification numbers, we could confirm sex discordances previously removed as part of the standard protocol. However, further classification by plate showed a clustering of 32 discrepancies within a single plate. Examining the patterns revealed a flip in the sample layout between columns and rows, leading to systematic misidentification for this plate. The sample identities were then corrected and confirmed using an independent genetic dataset (Olafsson et al., 2017). It is worth noting that only a third of the incorrect samples were initially identified and removed by standard sex concordance analysis, likely leading to the remaining samples introducing noise in any following association analyses (Fig. 2).
Systematic approaches for identifying sample/data mishandling
To facilitate similar analyses, we provide a supplementary Microsoft Excel-based tool where users can assess phenotype and genotype-derived data in a 96-well plate format (Samples Verification and Quality Control Tool, SVAQC; Huang et al., 2021). Data suspected of containing misidentified samples can be analyzed and corrected for a wide range of common sample and data handling mistakes. The concordance rate following each correction may be cross-compared by plate or sample layouts, along with the matching sample identity for each genotype. As illustrated in the example, the sex concordance rate within a plate can help determine systematic errors. However, the observed rate will depend on the type of mistake and the proportion of males and females on the plate. If we assume samples are randomly distributed and have an equal chance of mismatch, the predicted sex concordance rate (Rsc), given the number (nM/nF) or proportion (pM/pF) of males and females, respectively, on the plate would be:
[Equation 1] Rsc = [(pM × nM) + (pF × nF)]/(nM + nF) [Equation 2] Rsc = pM2 + pF2 = pM2 + (1 - pM)2 = pF2 + (1 - pF)2
This relationship is illustrated in Figure 3.

However, inconsistencies may occur in clinical or self-reported sex due to clerical errors. In self-reported sex, inconsistencies may also be due to miscommunication, e.g., reporting gender rather than biological sex, or related to privacy concerns.
Cross-validating sample identity in genetic data
Other phenotypes derived from genetic data, such as ethnicity, may also be useful for determining systematic errors. Genotypes projected onto a known reference, such as the 1000 Genomes Project, can be used to estimate ancestry (Prive et al., 2020). A set of markers can be retyped to confirm the identity of the samples, so-called fingerprinting (Kofanova et al., 2014). The number of SNPs required for fingerprinting depends on desired accuracy and the allele frequency of each SNP. As evidenced in Figure 4, we recommend three to seven SNPs not in linkage disequilibrium and with a minor allele frequency >30%.

Another approach for identifying major sample misidentifications by plate is to construct polygenic risk scores for known traits, such as the primary disease of study, and analyze the correlation between risk scores and the corresponding phenotype within each plate. Although the risk score is not predictive on an individual level for complex traits, a degree of correlation is expected between the risk score and phenotype, a lack of which may indicate broad sample misidentification within the plate. Similarly, the heritability of known traits can be estimated from the genotype data set and compared to expected estimates. A deviation in estimated heritability would suggest the presence of sample misidentification and the magnitude of the error.
Identifying mishandling with sample contamination
An excessive heterozygosity rate may indicate contamination resulting from improper sample handling or overlapping samples during plating. One may identify contaminated plates or systematic errors by assessing for high heterozygosity rates with sample handling parameters. For example, dense clustering of samples with a high heterozygosity rate and inconsistencies between samples and genotypes, such as missing data or empty wells, may indicate shifts during sample plating. In such cases, the identities of affected samples may be predicted and corrected during data preprocessing.
STUDY DESIGN AND ANALYTICAL CONSIDERATIONS
Data sets with corrected sample identities should be tested for concordance with independent parameters. Ideally, overlapping markers from an independent genotype array would provide the ideal check for individual consistency, although this is not commonly available. Furthermore, matching individuals based on genotype, even among those that should not overlap, may help identify potential duplication and sample misplacement. The genotype concordance rate between non-duplicated individuals is typically <70%. In comparison, the same individuals or monozygotic twins tend to have >98% observed genotype concordance with sparse cases in between this range. Therefore, even without a validation dataset, assessing the presence of intentional and unintentional duplicates (IBS ∼1) can help assess certain plating issues.
Defensive practices and preventative measures can limit mistakes and allow for easier detection of errors. A simple preventive measure is placing blanks in the last two wells of each plate. The asymmetric placement can be useful for identifying systematic mistakes in sample ordering or plate misorientation, including the previous example. Another strategy is to repeat samples between plates as a control measure (Hunter-Zinck et al., 2020), facilitating the correction of misidentified samples due to plate layout errors, although this may be cost-prohibitive. Automated pipetting may provide an appropriate solution to reduce the chance of human error; however, complete automation may not be feasible depending on the situation, as reviewed by Gut et al. (2001). The effort to salvage sample errors can be time-consuming and limited by the data and resources available. However, the further assessment of misidentification patterns and clusters utilizes already generated flags to expose systematic errors that would have remained in the data after standard protocols. As a guide for correcting sample misidentification in future studies, we provided a checklist (Table 2), which accompanies the supplemented tool (SVAQC), for different approaches to assessing the most common causes of misidentification.
Parameters | Data | Analysis | Application to data/samples |
---|---|---|---|
QC | Sex discordance | Cluster sex discordance by plate to identify a systematic pattern, e.g., off-by-one shifts, flips, or other sources of sample misidentification (Fig. 1); identify the breakpoints or buffer zones to determine which samples are affected and which can be corrected | Assign the corrected identity to the sample(s) and flag/exclude samples in buffer zones; compare the plate layouts of original and corrected sample identity to understand the source of misidentification (i.e., pipetting errors, digital flips in plate layout, etc.); test additional parameters for concordance in the corrected sample plates to confirm the corrections and fine map the breakpoints in buffer zones |
Heterozygosity rate | Cluster heterozygosity outliers by plate to identify potentially contaminated samples as breakpoints for systematic shifts | If no pattern exists, the source of contamination is most likely a lab or pipetting error that only affects the sample, which can be excluded; if discordances follow a contaminated sample, it may be a breakpoint for a shift | |
Sample relatedness (IBS) | Cluster samples showing high IBS scores by plate to identify unintentional duplications; additionally, note control samples with low relatedness to identify patterns | If duplicated samples are located on the same plate, they may represent a breakpoint for errors; blocks of duplicated samples can reveal sources of errors, e.g., swapped plates | |
Principal component analysis (PCA) | Cluster discordance between 1000 Genomes Project estimated ethnicity and reported ethnicity by plate to identify a systematic pattern, such as off-by-one shifts, flips, or other sources of sample misidentification (Fig. 1); identify the breakpoints or buffer zones to determine which samples are affected and which can be corrected | Assign a corrected identity to the sample(s) and flag/exclude samples in buffer zones; compare the plate layouts of original and corrected sample identity to understand the source of misidentification (i.e., pipetting errors, digital flips in plate layout, etc.); test additional parameters, e.g., sex, for concordance in the corrected sample plates to confirm the corrections and fine map the breakpoints in buffer zones | |
Alternative | Sample confirmation by IBS | If data from an independent genotype array is available, calculate relatedness between samples from the same individual to confirm the identity or flag a discrepancy | High concordance between samples on two independent arrays can resolve concerns regarding identity; this can also confirm previous corrections; conversely, low relatedness between samples can flag misidentification, and identify an alternative sample as the correct match |
Alternative phenotypes | Compare genotype-derived and actual phenotype measures to confirm sample identities; the phenotypes must have a clear genetic component, e.g., ethnicity or ABO blood type | Cluster genotype-phenotype discordances by plate to reveal patterns of inconsistency | |
Digital flips and rotations | Assess if the pattern of errors is due to data file mishandling (sorting, A > Z, etc.) or potential misorientation of the plate when pipetting | Check concordance rate by plate and sample sheet layout and observe if the rate improves with flipping or rotating systematically | |
Blank wells | Check positions of blank wells in genotype data to confirm correct layout | Unexpected blanks may indicate shifts or errors; blanks can used as anchors to correctly reposition sample layout | |
Fingerprinting | A set of markers can be re-genotyped to confirm the identity of all or some of the samples | Samples with confirmed identities can then be used to reassess the accuracy of the plate layout and test concordance with other parameters, e.g., sex |
CONCLUDING REMARKS
In conclusion, poor quality of genotyping data can increase the risk of producing spurious genetic associations in exploratory studies, which may be erroneously pursued at the cost of both time and resources. Although standard quality control protocols can identify certain discrepancies, it may often overlook many affected samples. Therefore, along with establishing preventative measures, it is important to actively identify sources of sample misidentification and contamination to ensure the overall quality and reliability of findings.
AUTHOR CONTRIBUTIONS
Jesse Huang : Conceptualization; data curation; formal analysis; methodology; validation; visualization; writing – original draft; writing – review and editing. Ingrid Kockum : Conceptualization; data curation; funding acquisition; investigation; project administration; resources; writing – original draft; writing – review and editing. Pernilla Stridh : Conceptualization; data curation; formal analysis; methodology; validation; visualization; writing – original draft; writing – review and editing.
ACKNOWLEDGMENTS
This work was supported in part by Margaretha af Ugglas Foundation, MS Research Fund, the Neuro Foundation, an endMS Doctoral Studentship (EGID:3045) from the Multiple Sclerosis Society of Canada (to J.H.), and Horizon 2020 MultipleMS grant number 733161 (to I.K.).
CONFLICT OF INTEREST
The authors declare no competing interests.
Open Research
DATA AVAILABILITY STATEMENT
The referenced tool has been made publicly available through a repository (https://doi.org/10.17605/OSF.IO/5KPUC; Huang et al., 2021).
LITERATURE CITED
- Anderson, C. A., Pettersson, F. H., Clarke, G. M., Cardon, L. R., Morris, A. P., & Zondervan, K. T. (2010). Data quality control in genetic case-control association studies. Nature Protocols , 5(9), 1564–1573. https://doi.org/10.1038/nprot.2010.116
- Beck, T., Shorter, T., & Brookes, A. J. (2020). GWAS Central: A comprehensive resource for the discovery and comparison of genotype and phenotype data from genome-wide association studies. Nucleic Acids Research , 48(D1), D933–D940. https://doi.org/10.1093/nar/gkz895
- Broman, K. W., Keller, M. P., Broman, A. T., Kendziorski, C., Yandell, B. S., Sen, S., & Attie, A. D. (2015). Identification and correction of sample mix-ups in expression genetic data: A case study. G3 Genes|Genomes|Genetics , 5(10), 2177–2186. https://doi.org/10.1534/g3.115.019778
- Cardon, L. R., & Palmer, L. J. (2003). Population stratification and spurious allelic association. Lancet , 361(9357), 598–604. https://doi.org/10.1016/S0140-6736(03)12520-2
- Ekstrom, C. T., & Feenstra, B. (2012). Detecting sample misidentifications in genetic association studies. Statistical Applications in Genetics and Molecular Biology , 11(3), 13. https://doi.org/10.1515/1544-6115.1772
- Gut, I. G. (2001). Automation in genotyping of single nucleotide polymorphisms. Human Mutation , 17(6), 475–492. https://doi.org/10.1002/humu.1131
- Huang, J., Kockum, I., & Stridh, P. (2021). Sample Verification and Quality Check Tool (SVAQC Tool). https://doi.org/10.17605/OSF.IO/5KPUC
- Hunter-Zinck, H., Shi, Y., Li, M., Gorman, B. R., Ji, S. G., Sun, N., Webster, T., Liem, A., Hsieh, P., Devineni, P., Karnam, P., Gong, X., Radhakrishnan, L., Schmidt, J., Assimes, T. L., Huang, J., Pan, C., Humphries, D., Brophy, M., … Pyarajan, S. (2020). Genotyping array design and data quality control in the million veteran program. American Journal of Human Genetics , 106(4), 535–548. https://doi.org/10.1016/j.ajhg.2020.03.004
- International Multiple Sclerosis Genetics Consortium. (2019). Multiple sclerosis genomic map implicates peripheral immune cells and microglia in susceptibility. Science , 365(6460), eaav7188. https://doi.org/10.1126/science.aav7188
- Khan, J., Wei, J. S., Ringnér, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C., & Meltzer, P. S. (2001). Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine , 7(6), 673–679. https://doi.org/10.1038/89044
- Kockum, I., Huang, J., & Stridh, P. (2023). Overview of genotyping technologies and methods. Current Protocols , 3, e727. https://doi.org/10.1002/cpz1.727
- Kofanova, O. A., Mathieson, W., Thomas, G. A., & Betsou, F. (2014). DNA fingerprinting: A quality control case study for human biospecimen authentication. Biopreservation and Biobanking , 12(2), 151–153. https://doi.org/10.1089/bio.2013.0063
- Laurie, C. C., Doheny, K. F., Mirel, D. B., Pugh, E. W., Bierut, L. J., Bhangale, T., Boehm, F., Caporaso, N. E., Cornelis, M. C., Edenberg, H. J., Gabriel, S. B., Harris, E. L., Hu, F. B., Jacobs, K. B., Kraft, P., Landi, M. T., Lumley, T., Manolio, T. A., McHugh, C., … GENEVA Investigators. (2010). Quality control and quality assurance in genotypic data for genome-wide association studies. Genetic Epidemiology , 34(6), 591–602. https://doi.org/10.1002/gepi.20516
- Olafsson, S., Stridh, P., Bos, S. D., Ingason, A., Euesden, J., Sulem, P., Thorleifsson, G., Gustafsson, O., Johannesson, A., Geirsson, A. J., Thorsson, A. V., Sigurgeirsson, B., Ludviksson, B. R., Olafsson, E., Kristjansdottir, H., Jonasson, J. G., Olafsson, J. H., Orvar, K. B., Benediktsson, R., … Stefansson, K. (2017). Fourteen sequence variants that associate with multiple sclerosis discovered by meta-analysis informed by genetic correlations. NPJ Genomic Medicine , 2, 24. https://doi.org/10.1038/s41525-017-0027-2
- Prive, F., Luu, K., Blum, M. G. B., McGrath, J. J., & Vilhjalmsson, B. J. (2020). Efficient toolkit implementing best practices for principal component analysis of population genetic data. Bioinformatics , 36(16), 4449–4457. https://doi.org/10.1093/bioinformatics/btaa520
- Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller, J., Sklar, P., de Bakker, P. I., Daly, M. J., & Sham, P. C. (2007). PLINK: A tool set for whole-genome association and population-based linkage analyses. American Journal of Human Genetics , 81(3), 559–575. https://doi.org/10.1086/519795
- Turner, S., Armstrong, L. L., Bradford, Y., Carlson, C. S., Crawford, D. C., Crenshaw, A. T., de Andrade, M., Doheny, K. F., Haines, J. L., Hayes, G., Jarvik, G., Jiang, L., Kullo, I. J., Li, R., Ling, H., Manolio, T. A., Matsumoto, M., McCarty, C. A., McDavid, A. N., … Ritchie, M. D. (2011). Quality control procedures for genome-wide association studies. Current Protocols in Human Genetics , 68, 1.19.1–1.19.18. https://doi.org/10.1002/0471142905.hg0119s68
- Wellcome Trust Case Control Consortium. (2007). Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature , 447(7145), 661–678. https://doi.org/10.1038/nature05911
- Visscher, P. M., Wray, N. R., Zhang, Q., Sklar, P., McCarthy, M. I., Brown, M. A., & Yang, J. (2017). 10 Years of GWAS Discovery: Biology, Function, and Translation. American Journal of Human Genetics , 101(1), 5–22. https://doi.org/10.1016/j.ajhg.2017.06.005
- Zhang, C., Wu, C., Blanzieri, E., Zhou, Y., Wang, Y., Du, W., & Liang, Y. (2009). Methods for labeling error detection in microarrays based on the effect of data perturbation on the regression model. Bioinformatics , 25(20), 2708–2714. https://doi.org/10.1093/bioinformatics/btp478