POPH90111 · Genetic Epidemiology
Genetic Association Studies
Heritability said genes matter; association asks which ones. At its core a genetic-association study is a case-control study whose exposure is a genetic marker — usually a SNP — comparing how often the variant appears in cases versus controls. Scale that from a handful of candidate genes to millions of SNPs across the whole genome and you have a GWAS. The price of testing millions of hypotheses is a flood of false positives, which forces the famously strict genome-wide threshold p < 5×10⁻⁸ and two diagnostic plots: the Manhattan (where are the hits?) and the QQ (is the whole study inflated?). The chapter is calculation plus interpretation: build the 2×2 and compute the per-allele OR = ad/bc and an allelic χ² (df = 1), read a logistic-regression row (OR = eβ), then read the plots and rule out the key confounder — population stratification, fixed by adjusting for ancestry principal components.
What this chapter covers
- 013.1 The idea: exposure = a SNP; three reasons for a signal (causal / LD / stratification)
- 02Candidate-gene vs genome-wide design
- 033.2 The 2×2: per-allele OR = ad/bc and the allelic χ² (df = 1); genotypic df = 2
- 043.3 Multiple testing and the 5×10⁻⁸ genome-wide threshold (Bonferroni)
- 053.4 Reading the Manhattan plot (peaks above ≈7.3 = loci, not causal genes)
- 063.5 The QQ plot & genomic inflation (λGC); tail-lift vs whole-line lift
- 073.6 Population stratification — the key confounder — and how to fix it
- 083.7 The GWAS read-out drill: effect → test → multiple testing → calibration → replication
Worked example: per-allele OR and the allelic χ²
- +1Label the 2×2. a = 120, b = 80, c = 90, d = 110; N = 400 alleles.
- +1(a) Odds ratio. OR = ad/bc = (120 × 110) / (80 × 90) = 13200 / 7200 = 1.83 — each risk allele raises the odds about 83%.
- +1(b) Expected counts. Row totals 200/200; column totals A = 210, other = 190. E(case,A) = 200×210/400 = 105, and similarly 95, 105, 95.
- +1(b) Chi-square. χ² = Σ(O−E)²/E = (15²/105) + (15²/95) + (15²/105) + (15²/95) ≈ 2.14 + 2.37 + 2.14 + 2.37 = 9.0; df = 1, and 9.0 > 3.84, so p < 0.05.
- +1(c) Genome-wide? The Bonferroni threshold is 0.05 / 1,000,000 = 5×10⁻⁸. Here p ≈ 0.003, which is far larger, so it is nominally significant but not genome-wide significant — in a GWAS this is noise.
- +1Appraise. A genome-wide hit must also replicate independently and survive a clean QQ plot; the top SNP marks a locus in LD with the truth, not ‘the disease gene’.
Key terms
- Allelic χ² test
- A test of genetic association comparing the observed allele counts in cases versus controls against those expected under no association, χ² = Σ(O−E)²/E with E = row total × column total / N. The allelic test has df = 1 (significant if χ² > 3.84 at α = 0.05); the genotypic test (three genotypes) has df = 2.
- Per-allele (additive) odds ratio
- The odds ratio per extra copy of the risk allele, OR = ad/bc from the allelic 2×2, or OR = eβ from a logistic regression that codes genotype 0/1/2 by the number of risk alleles. It is the standard GWAS effect measure, reported with a 95% CI and adjusted for ancestry principal components, age and sex.
- Genome-wide significance (5×10⁻⁸)
- The strict significance threshold for a GWAS, the Bonferroni correction for roughly one million effectively-independent common-variant tests (0.05 / 10⁶). On a Manhattan plot it sits at −log10(5×10⁻⁸) ≈ 7.3. A signal must clear this line and then replicate in an independent sample.
- Manhattan plot
- The GWAS summary plot: each dot is one SNP, with genomic position (chromosome 1→22) on the x-axis and −log10(p) on the y-axis. Genuine associations rise as ‘skyscrapers’ above the genome-wide line at ≈7.3. One peak is one locus — a cluster of correlated SNPs in LD — and the tallest SNP is the best tag, not necessarily the causal variant.
- Population stratification
- Confounding by ancestry: if cases and controls differ in ancestry and both allele frequencies and disease rates vary by ancestry, then any ancestry-marking SNP looks associated — a spurious hit driven by structure, not biology. It is the usual cause of a whole-line QQ lift (genomic inflation λGC > 1) and is fixed by adjusting for ancestry principal components, matching, or genomic control.
Genetic Association Studies FAQ
Why does a GWAS need a threshold as strict as 5×10⁻⁸?
Because of multiple testing. Test one SNP at α = 0.05 and you accept a 1-in-20 false-positive risk; test a million SNPs and you expect about 50,000 ‘significant’ results purely by chance. The Bonferroni correction divides 0.05 by the number of effectively-independent tests — about a million common-variant tests across the genome — giving 5×10⁻⁸. Memorise both the p-value and its −log10 ≈ 7.3.
What is the difference between a Manhattan plot and a QQ plot?
The Manhattan plot shows where the hits are — −log10(p) across the chromosomes, with peaks above ≈7.3 marking associated loci. The QQ plot is a calibration check you read before trusting any peak: observed −log10(p) against the null expectation. A late upward tail only = genuine associations (the good GWAS); a whole-line lift from the origin = genomic inflation (λ > 1), the signature of population stratification or other artefact.
Why can’t I call the top Manhattan SNP ‘the disease gene’?
Because it is almost always just the best tag in LD with the true causal variant, not the cause itself, and a single peak can span several genes. The honest statement is ‘a locus at chromosome X is associated; fine-mapping is needed to identify the causal variant’. Reporting the top SNP as the disease gene is a classic association-chapter error.
How do you detect and fix population stratification?
Detect it from a whole-line QQ lift and λGC > 1. Fix it by (1) adjusting for ancestry principal components in the logistic model — the standard solution; (2) matching cases and controls on ancestry; (3) genomic control (divide χ² by λ); or (4) family-based designs. A Hardy–Weinberg deviation in controls is also a useful QC flag for genotyping error or structure.
Exam move
Anchor the chapter on the one calculation that recurs: 2×2 → OR = ad/bc and the allelic χ² with E = row×col/N (df = 1, critical value 3.84; genotypic df = 2). Then memorise the GWAS gates in order — effect size (OR away from 1, CI excludes 1), test (small p), multiple testing (p < 5×10⁻⁸), calibration (QQ diagonal except a tail; λ ≈ 1), confounding (adjust for principal components), confirmation (replication), localisation (fine-mapping). Practise reading both plots: Manhattan peaks above ≈7.3 are loci not causal genes; a QQ tail-lift is real signal but a whole-line lift is inflation → population stratification. The appraisal line to write: a genome-wide hit marks a locus in LD with the truth and must replicate.