Genetic Epidemiology
Sem 1 2026 · Side 1 of 2
Foundations → heritability → association
0 · How To Use Thisread first
This subject is a pipeline: UNDERSTAND (is there a genetic role? — aggregation, heritability) → DISCOVER (which variants? — LD, GWAS, MR) → CHARACTERISE (how risky? — penetrance, modifiers, G×E) → USE (screening). Side 1 = understand & discover; side 2 = characterise & use.
Assessment shape: online MCQ 10% (10 Qs, 1-week window) + written A1 40% (Modules 1–3) + A2 50% (Modules 4–8). All online / take-home — no invigilated exam.
Every assignment task is one of three: (a) calculate + interpret, (b) discuss findings, (c) critically appraise a design. So the two high-value moves are: plug the right formula, then judge the design's bias. A Stata .do file is even handed out for A1 Q1 — expect software-based calculation, then a written interpretation.
1 · Genetics PrimerExtra-Module 1
Locus = position on a chromosome. Allele = the base(s) there; minor allele = rarer one. Genotype = the pair (e.g. TT, TC, CC); homo- vs hetero-zygous.
- Polymorphism — common variant (>1% freq), e.g. a SNP; small/no effect
- Pathogenic mutation — major deleterious effect → big risk
- Germline = inherited, in every cell → familial risk (sample blood/buccal)
- Somatic = acquired, tumour only → not inherited (sample biopsy)
- Minor allele = the less common allele at the locus in the population
Mutation classes: silent (usually benign), missense (changes amino acid), nonsense (premature stop), frameshift indels (corrupt every downstream codon → usually pathogenic). CNV = larger gain/loss.
2 · Modes of Inheritancerisk inequalities
Defined on Pr(phenotype | # risk alleles), not on "having" the trait:
AutosomalDominant: Pr(2)=Pr(1) > Pr(0)
Recessive: Pr(2) > Pr(1)=Pr(0)
Codominant: Pr(2) > Pr(1) > Pr(0)
Carrier risk can be <1 (incomplete penetrance) and non-carrier risk >0 (phenocopies/sporadic). So a dominant variant can still have penetrance below 100%.
Segregation (Punnett)
Each parent passes one randomly-chosen allele. Aa×aa → ½ Aa, ½ aa (no AA). Aa×Aa → ¼ AA, ½ Aa, ¼ aa ⇒ P(child carries ≥1 A)=¾, P(AA)=¼.
Trap: the genotype gives the expected probability distribution, not the realised counts in a small sibship.
2b · Germline vs Somaticsample choice
- Inherited colorectal-cancer family risk → germline → sample blood / buccal swab
- Tumour responds differently to chemo, no family history → somatic → sample the tumour biopsy
3 · Allele & Genotype Freqcalculate
From counts n(AA), n(Aa), n(aa) in N people:
Allele frequency (per chromosome)p = [2·n(AA) + n(Aa)] / 2N · q = 1 − p
Worked: 100 people = 64 CC, 32 CT, 4 TT. T alleles = 2·4+32 = 40; total alleles = 2·100 = 200 ⇒ freq(T)=40/200=0.20, freq(C)=0.80 (20% of all alleles at this locus are T).
Carrier frequency (per person)carrier freq = p² + 2pq = 1 − q²
Worked: risk-allele freq 0.1 ⇒ 0.1² + 2(0.1)(0.9) = 0.01 + 0.18 = 0.19 (19% carry ≥1 copy). Equivalently 1 − q² = 1 − 0.81 = 0.19.
Why it matters: the variant is the exposure; carrier freq = exposure prevalence → drives sample size/power. Rare variants need huge or enriched samples. Trap: allele freq (per-chromosome, ÷2N) ≠ carrier/genotype freq (per-person, ÷N). At T freq 0.01, TT is very rare (q²=0.0001) yet carriers are ~2% — design power around the carrier count.
4 · Hardy–Weinbergcanon
Holds in a large, randomly-mating population with no selection, migration or mutation ⇒ genotype freqs are constant across generations & predicted by allele freqs. This is exactly the genotype split used for carrier frequency:
HWEp² + 2pq + q² = 1
AA=p² · Aa=2pq · aa=q²
Test (χ² goodness-of-fit)χ² = Σ (O − E)² / E · df = 1
significant if χ² > 3.84 (α=0.05)
df=1: 3 genotype classes − 1 − 1 (estimated allele freq). Deviation in CONTROLS ⇒ genotyping error / population stratification → GWAS QC check.
Worked: p(T)=0.20 in N=100 ⇒ expected 100·0.2²=4 TT, 100·2(0.2)(0.8)=32 TC, 100·0.8²=64 CC. Observed 4/32/64 match exactly ⇒ χ²≈0 ⇒ in HWE (QC passes).
If instead observed = 10 TT, 20 TC, 70 CC (allele freq still ≈0.20), then χ² = (10−4)²/4 + (20−32)²/32 + (70−64)²/64 ≈ 9 + 4.5 + 0.6 = 14.1 > 3.84 ⇒ reject HWE ⇒ in controls, suspect a genotyping error or population stratification and exclude/recheck the SNP.
Trap: HWE deviation in cases can be a real disease association — so test HWE conventionally in controls.
5 · Linkage Disequilibriumwhy GWAS works
Two loci in LD = their genotypes are statistically correlated in a random person; nearby loci co-inherited. A marker SNP associated with disease flags a nearby causal variant.
LD measuresD = P(AB) − P(A)P(B)
D' = D/D_max ∈ [−1,1] · D'=1 ⇒ complete LD
r² = D² / [P(A)P(a)P(B)P(b)] ∈ [0,1]
r² is the metric that matters for tagging/power: r²=1 ⇒ marker perfectly proxies the causal SNP; r²=0.5 ⇒ need ~2× the cases to detect the same indirect signal. A haplotype = the specific alleles inherited together on one chromosome.
Trap: D' and r² answer different questions. D'=1 (no recombination) can coexist with low r² when the two SNPs have different allele frequencies — for tagging/power it is r², not D', that counts.
6 · Familial AggregationModule 1
Families share genes + environment + can be followed over time. Stronger aggregation in genetically closer relatives ⇒ evidence for (not proof of) inherited aetiology — because closer relatives also share more environment.
| Degree | Relatives | Genes shared |
|---|---|---|
| 1st | parents, sibs, children | ½ |
| 2nd | grandparents, aunts, half-sibs | ¼ |
| 3rd | first cousins | ⅛ |
Design → measure → bias
| Design | Measure | Watch |
|---|---|---|
| Case-control | OR | recall, selection |
| Retro cohort | RR, SMR | recall, selection |
| Prospective fam. | RR/HR | slow; no recall bias |
| Twin | heritability | not pop-repr. |
| Adoption | genes vs env | rare, hard |
| Migrant | rate compare | healthy-migrant |
7 · Aggregation Measuresplug numbers
From the 2×2 (proband case/control × relative affected/unaffected), cells a,b,c,d:
Effect estimatesOR = (a·d)/(b·c)
RR = [a/(a+b)] / [c/(c+d)]
SMR = Observed / Expected
λ_R = risk in type-R relative / prevalence K
FRR = RR given affected 1st-degree relative
SMR worked: mothers of cases O=45, E (population rates × person-time) =17.7 ⇒ SMR ≈ 2.5. λ_R >1 and declining with relatedness ⇒ genetic; the rate of decline hints polygenic vs single-gene. OR ≈ RR only when disease is rare.
OR worked: any affected sister 13/462 in cases vs 1/405 in controls ⇒ OR = (13·404)/(449·1) ≈ 11.7 (95% CI 1.7–98.2). The very wide CI (only one exposed control) ⇒ imprecise — report the CI, not just the point estimate, and beware the small-cell instability.
8 · Migrant & FH Qualityinterpret
- Migrant rate stays like source ⇒ genetics (or similar env)
- Shifts toward host ⇒ environment
- Migrant vs descendants differ ⇒ a critical age of exposure
Family-history misclassification: non-differential (random) ⇒ bias toward null; differential (cases recall better) ⇒ bias away from null, inflating OR/RR. Fix with standardised questionnaires, multiple informants, validation against registries/pathology/death records, trained interviewers.
8b · Family DesignsM1 extras
Case-control-family / case-family: relatives directly interviewed ⇒ OR / RR / SMR; relatives of controls are hard to recruit, and the case-family design needs a population registry.
Outcome can be analysed as dichotomous (affected y/n), ordinal (number affected) or multinomial — match the analysis to how family history was coded.
9 · HeritabilityModule 2
= proportion of phenotypic variance due to genetic variance. A property of a population in an environment, not an individual. Variance = SD² (e.g. height SD 9.29 ⇒ variance ≈ 86).
Variance partitionVp = Vg + Ve
Vg = Va + Vd (+ Vi)
Broad-sense H² = Vg/Vp
Narrow-sense h² = Va/Vp (h² ≤ H²)
Narrow-sense (additive Va) predicts relative resemblance & response to selection; Vd = dominance, Vi = epistatic/interaction variance. Estimate variance separately by sex & zygosity (M>F; DZ>MZ spread).
10 · Twin Studiesthe engine
MZ share ~100% genes; DZ ~50% (like full sibs). Both share rearing env → comparing them isolates genetics; twins control for age & shared env.
Binary: concordance = proportion of pairs both affected; conc_MZ > conc_DZ ⇒ genetic. Continuous: correlate twin-1 vs twin-2.
Falconer's heritabilityh² = 2 (r_MZ − r_DZ) (continuous)
h² = 2 (conc_MZ − conc_DZ) (binary)
Worked: female height r_MZ=0.78, r_DZ=0.46 ⇒ h² = 2(0.78−0.46) = 0.64 — 64% of variance in female height is additively genetic. Interpret: "consistent with, but not proof of, an inherited genetic aetiology."
Genetic variance from heritability: Vg = h² × Vp. With Vp≈86 and h²=0.64 ⇒ Vg≈55. Opposite-sex DZ pairs & the twin–co-twin (TRA) design extend the model to probe shared-environment and sex effects.
11 · ACE Modelvariance components
Split Vp into A additive genetic, C common/shared env, E unique env + error. From twin correlations:
ACE from r_MZ, r_DZr_MZ = A + C · r_DZ = ½A + C
A = 2(r_MZ − r_DZ) (= Falconer)
C = 2·r_DZ − r_MZ · E = 1 − r_MZ
Worked: r_MZ=0.78, r_DZ=0.46 ⇒ A=2(0.78−0.46)=0.64; C=2(0.46)−0.78=0.14; E=1−0.78=0.22. Check: A+C+E = 0.64+0.14+0.22 = 1.00 ✓.
So C is the part of resemblance shared equally by both twin types; E (incl. measurement error) is the only thing that makes MZ co-twins differ. Trap — equal-environments assumption: if MZ pairs are treated more alike than DZ, shared env masquerades as genes ⇒ h² overestimated.
11b · Classic Twin Model4 assumptions
- MZ share A=1.0, DZ share A=0.5 (like full sibs)
- MZ & DZ share C equally (equal-environments)
- Random mating (no assortative mating inflating r_DZ)
- No gene–environment interaction/correlation
- Trait measured the same way in both twin types
Break any assumption ⇒ biased h². Concordance/correlation are estimated separately by sex & zygosity because variance differs.
Binary worked: conc_MZ=0.40, conc_DZ=0.15 ⇒ h²(liability) = 2(0.40−0.15) = 0.50. MZ>DZ concordance is the signal; near-equality (conc_MZ≈conc_DZ) ⇒ shared environment, not genes, drives the resemblance.
12 · Liability-Thresholdbinary traits
Assume an unobserved continuous liability (genes+env), ~Normal; disease occurs above a threshold set by prevalence. Puts yes/no disease onto a continuous scale so variance/heritability methods apply.
liability ~Normal · disease = tail beyond threshold
Tail area = prevalence. Relatives of cases sit at a right-shifted liability distribution ⇒ larger tail ⇒ higher risk, the model's link from heritability to a yes/no trait. Trap: heritability of liability ≠ heritability "of the disease," and is very sensitive to the assumed prevalence (which sets where T sits).
13 · Heritability Cautionsassignment gold
High h² does NOT mean: (a) the trait is unmodifiable; (b) genes cause between-group/between-population differences; or (c) anything about an individual. It is a population- & environment-specific quantity.
Missing heritability: GWAS-discovered SNPs explain far less variance than the twin-study h². Candidate causes: private (family-specific) mutations, rare moderate-risk variants, additional undiscovered common SNPs, gene–gene interactions, and non-genetic factors correlated within relatives.
So twin-estimated h² and GWAS-explained variance are different quantities — don't expect the discovered SNPs to "add up" to the twin h². High h² ≠ "untreatable": environment can still shift the whole distribution (height is highly heritable yet population mean rose with nutrition).
14 · Genetic AssociationModule 3
= a case-control study where the exposure is a genetic marker (a SNP). Association arises if the SNP causes disease, is in LD with a causal variant, or is confounded by ancestry (stratification).
Candidate-gene = a few pre-specified, biologically-motivated SNPs; GWAS = hundreds of thousands–millions of SNPs, scanned agnostically across the whole genome. The marker is the exposure; cases vs controls are compared on marker frequency, reported as an OR + 95% CI per SNP.
An association is useful for prediction even if non-causal. Three reasons a SNP associates with disease:
- the SNP causes disease (directly functional)
- it is in LD with a nearby causal variant (still useful for prediction)
- artefact of confounding by ancestry (stratification)
Only the first two replicate in an independent sample — the third is what replication + PC-adjustment are designed to kill.
A genetic/polygenic risk score sums many such SNPs and is ~Normal in the population, sliding people along a continuous risk axis rather than a single yes/no genotype — the basis for risk stratification in M8.
15 · Association Testsχ² / logistic
| Test | Table | df |
|---|---|---|
| Allelic | 2×2 allele×status | 1 |
| Genotypic | 2×3 genotype×status | 2 |
| Dominant | AA+Aa vs aa | 1 |
| Recessive | AA vs Aa+aa | 1 |
| Additive | per-allele 0/1/2 | 1 |
Chi-square & logistic ORχ² = Σ(O−E)²/E → large χ² → small p
logit P(D) = β₀ + β₁·genotype + covariates
OR = e^β₁ · OR = (a·d)/(b·c)
Per-allele coding (0,1,2) ⇒ OR per extra risk allele; adjust for ancestry principal components, age, sex. State the mode of inheritance up front; testing several models multiplies the tests and so needs a stricter threshold.
16 · Multiple Testingthe GWAS problem
Testing millions of SNPs hugely inflates the type-1 error / false-positive rate; at α=0.05, 1 in 20 truly-null SNPs looks "significant" by chance alone.
ThresholdsBonferroni: α = 0.05 / (# tests)
genome-wide significance = 5×10⁻⁸
5×10⁻⁸ ≈ 0.05 / 10⁶ independent common-variant tests; hits must replicate independently. Worked: a candidate study of 50 SNPs ⇒ Bonferroni α = 0.05/50 = 0.001 — a SNP at p=0.01 is not significant after correction. Trap: Bonferroni is conservative (LD makes tests correlated) but 5×10⁻⁸ is the field standard — use it for GWAS.
17 · Manhattan & QQread the plot
Manhattan: x = genomic position, y = −log₁₀(p). Peaks crossing −log₁₀(5×10⁻⁸) ≈ 7.3 = associated loci.
QQ plot: observed vs expected −log₁₀(p) under the null. On the diagonal = no inflation; an early, whole-line upward lift = stratification / cryptic relatedness / artefact (genomic inflation λ_GC; λ≈1 is good); a departure only in the extreme tail = genuine signal.
Trap: don't read a single Manhattan peak as "the causal gene" — the top SNP is usually the best tag in LD with the true causal variant, so fine-mapping is needed to localise the cause.
18 · Pop. Stratificationkey confounder
Cases & controls differ in ancestry; both allele freqs & disease rates vary by ancestry ⇒ spurious association (confounding). Fixes: match on ancestry, adjust for principal components, genomic control (λ_GC), or family-based designs; HWE deviation in controls helps flag it.
This is why a hit must replicate in an independent sample and why GWAS report λ_GC — a clean QQ plot (λ≈1) is the reassurance that genuine signal, not stratification, is driving the Manhattan peaks. λ > 1 ⇒ inflate-corrected before trusting any hit.
Formula Beltside 1
p=[2n(AA)+n(Aa)]/2N · carrier=p²+2pq
HWE p²+2pq+q²=1 · χ²=Σ(O−E)²/E df1
r²=D²/[P(A)P(a)P(B)P(b)] · OR=ad/bc
h²=2(r_MZ−r_DZ) · A=2(r_MZ−r_DZ)
SMR=O/E · λ_R=relative risk/K · GWS 5×10⁻⁸