POPH90111 · Genetic Epidemiology
Genetic Epidemiology Foundations
Genetic epidemiology asks one question first: do genes contribute to who gets a disease? Before you can measure that, you need the working vocabulary — what an allele, genotype and haplotype are, how Hardy–Weinberg equilibrium turns an allele frequency into genotype and carrier frequencies, and why a non-causal marker can stand in for a cause through linkage disequilibrium. Then comes the first real method: familial aggregation — the question of whether disease clusters in families more than chance allows, read off a pedigree and quantified from a 2×2 table as an odds ratio, relative risk, standardised morbidity ratio or recurrence-risk ratio λR. This chapter is the foundation (Extra-Module 1) plus Module 1, and it sets up the mantra you write at every later stage: familial aggregation is evidence for, but not proof of, an inherited genetic aetiology — because relatives also share an environment.
What this chapter covers
- 011.1 Genes vs environment — why families resemble each other
- 02Degrees of relatedness (1st ½, 2nd ¼, 3rd ⅛)
- 031.2 The vocabulary: alleles, genotypes, haplotypes; germline vs somatic
- 04Modes of inheritance defined on risk (dominant / recessive / codominant)
- 051.3 Hardy–Weinberg equilibrium: p² + 2pq + q² = 1, carrier frequency, HWE as QC
- 061.4 Linkage disequilibrium: D, D′ and r² — the tagging metric
- 071.5 Reading a pedigree
- 081.6 Familial-aggregation study designs and their signature biases
- 091.7 Measuring aggregation: the 2×2 → OR, RR, SMR, λR
Worked example: an odds ratio from a familial-aggregation 2×2
- +1Lay out the 2×2. Rows = affected sister (yes / no); columns = case / control proband. a = 13 (case, sister affected), b = 449 (case, not), c = 1 (control, sister affected), d = 404 (control, not).
- +2Apply OR = ad/bc. OR = (13 × 404) / (449 × 1) = 5252 / 449 ≈ 11.7 (the course’s reported 95% CI is roughly 1.7–98.2).
- +1Interpret. OR > 1 with a confidence interval excluding 1 → positive familial aggregation: the case probands’ families carry the disease far more often than controls’.
- +1Appraise. This is “evidence for, but not proof of, a genetic aetiology” — shared environment is the rival explanation, and a case-control design is open to recall bias (cases over-report affected relatives) and selection bias.
Key terms
- Familial aggregation
- The clustering of a disease in families more than chance would allow, measured by an OR, RR, SMR or recurrence-risk ratio λR. It is evidence for — never proof of — a genetic role, because relatives also share an environment, the rival explanation.
- Hardy–Weinberg equilibrium (HWE)
- The relationship that turns allele frequencies into genotype frequencies: for alleles with frequencies p and q (p + q = 1), the genotypes split as p², 2pq and q², so p² + 2pq + q² = 1. Carrier frequency = 1 − q². A deviation in controls flags genotyping error or population structure, so HWE doubles as a quality-control check.
- Linkage disequilibrium (LD)
- The non-random correlation between alleles at nearby loci, because neighbouring variants tend to be co-inherited. Measured by D, the scaled D′, and r²; r² is the tagging metric — r² = 1 means a genotyped marker perfectly proxies an unmeasured causal variant, and you need about 1/r² times the sample to detect an indirect association.
- Germline vs somatic variant
- A germline variant is inherited and present in every cell (sample blood or a buccal swab); a somatic variant is acquired and present only in descendant cells, such as a tumour clone (sample the tumour biopsy). Inherited family risk is a germline question; why one tumour behaves differently is somatic.
- Standardised morbidity ratio (SMR)
- Observed cases divided by the number expected if the relatives experienced population age- and sex-specific rates: SMR = O / E, with E = population rates × the relatives’ person-time. An SMR above 1 means relatives of cases are at raised risk; it is a cohort-design measure.
Genetic Epidemiology Foundations FAQ
Why does familial aggregation not prove a disease is genetic?
Because relatives share two things that are hopelessly tangled: their DNA and their environment. Families eat, live and are exposed alike, so a disease that clusters in families could be inherited or simply the product of a shared environment. Aggregation that is stronger in closer relatives and declines with relatedness points toward genes, but shared environment is always the rival explanation — which is why the course wants you to write ‘evidence for, but not proof of, a genetic aetiology’ every time.
What is the difference between allele frequency and carrier frequency?
Allele frequency is counted per chromosome (denominator 2N) — how common a variant is in the gene pool. Carrier frequency is counted per person (denominator N) — the proportion of people carrying at least one copy. Under Hardy–Weinberg, with risk-allele frequency p the carrier frequency is p² + 2pq = 1 − q². Confusing the two, or testing HWE in cases rather than controls, are the classic foundations-chapter slips.
When can a SNP that isn't causal still be useful?
When it is in linkage disequilibrium with the true causal variant. Most associated SNPs are not themselves causal; they are correlated with a nearby cause because nearby loci are co-inherited. The r² between marker and cause measures how well the marker ‘tags’ it — at r² = 1 the marker is a perfect proxy. This is the single principle that makes marker-based gene discovery (and GWAS) work.
Which familial-aggregation design avoids recall bias?
The prospective family cohort: you recruit relatives and follow them forward, so disease is recorded as it happens rather than remembered. Case-control and retrospective cohort designs ask people to recall affected relatives, and cases tend to over-report — differential recall biases the estimate away from the null and inflates the OR/RR. The price of the prospective design is that it is slow, costly and can introduce a screening effect.
Exam move
Lock down the three foundational calculations until they are automatic: the Hardy–Weinberg split (allele frequency from genotype counts, then carrier frequency 1 − q²), the LD reasoning (quote r², not D′, when the question is about tagging or power), and above all the 2×2 → OR = ad/bc that runs through the whole subject. Then pair every familial-aggregation design with its signature bias — recall and selection bias in case-control and retrospective cohort, the screening effect in prospective cohorts — and learn the migrant-study read-off (rate stays like the source → genes/shared environment; shifts toward the host → environment). Close every aggregation interpretation with the mantra, and never quote an RR off a case-control table.