Genotypic Imputation: How the Process Views Original Data and Fills Gaps

Abstract
Genotypic imputation is a statistical technique that uses population references to infer genotypes of missing variants in a genotyping dataset. In this article, I describe in detail how imputation interprets original calls, preserves or adjusts existing genotypes, and fills gaps in ungenotyped SNPs, using a region of the HERC2 gene as an example. I noticed this while reviewing imputation protocols this week.

1. Introduction
In genomic studies, genotyping chips usually measure only a subset of SNPs. Imputation allows “filling in” variants not directly observed, increasing the density of markers for association analyses and haplotype mapping. However, there are doubts about the extent to which the process can alter original calls, which are considered “true” by the laboratory.

2. Fundamentals of
Input Imputation (original calls)

Each genotyped position (eg “present” SNP) brings chromosome, position, alleles (REF/ALT) and genotype (GT: 0/0, 0/1, 1/1).

Example:

#CHROM POS ID REF ALT FORMAT GT
chr15 28365618 rs12913832 AG GT 0/0

Reference Panel

Contains thousands of individuals with dense genotypes (eg 1000 Genomes).

Allows the identification of haplotype patterns (allele sequences) that occur in blocks.

Statistical Model

Imputation applies a hidden Markov model (HMM) that includes:

Genotypic error (err): probability that a call is incorrect (e.g.: 0.1%).

Recombination: mapped by a genetic file (centiMorgans).

The algorithm assembles haplotypes by combining your data and the panel, estimating the sequence that maximizes the joint likelihood.

3. Preservation vs. Adjustment of Original Genotypes
Although imputation starts from the original calls, it does not consider them absolute. If, in a haplotype block, the measured genotype is very unlikely, the model can “correct” this call:

Error Model
Even existing GTs are weighted by an internal error rate.

Haplotype Concordance
The algorithm searches for the combination of alleles (including in the original SNPs) that best fits the population pattern.

How to avoid unwanted changes

Setting err=0 forces imputation to keep original calls and only fill absences.

Post-imputation restoration: restore GTs from the measured VCF over the imputed one, to ensure that only missing positions are filled.

4. Real Example: HERC2 Region (Eye Color)
SNP Status Population¹ Your Chip Imputation
rs12913832 Present 40 % A/A, 45 % A/G, 15 % G/GA/AA/A (maintained)
rs1129038 Absent T/T in 90 % of A/A haplotypes in rs12913832 – T/T imputed
¹ Patterns observed in the 1000 Genomes panel.

The A/A call at rs12913832 is compatible with the “A/A+T/T” haplotype and therefore maintained.

The missing SNP rs1129038 is inferred as T/T (1/1) because it co-occurs in block with rs12913832 = A/A in the population.

5. Conclusion
Genotypic imputation views your original data as observations subject to statistical noise and can, in theory, correct unlikely calls. For publications and clinical applications, it is recommended to preserve the measured genotypes, restoring them after the imputation step. This ensures that only genuinely missing variants are filled in, maintaining the reliability of the original data.

WhatsApp
Telegram
Facebook
Twitter
LinkedIn
Email

Leave a Reply

Your email address will not be published. Required fields are marked *