DATA1001 · Foundations Of Data Science
Study Design
Before a single number is computed, one question fixes what you are allowed to conclude: how were the data produced? The same association — coffee drinkers live longer, helmet-wearers crash less — supports a causal claim from a randomised experiment but only an associational claim from an observational study. This is the single most heavily examined critique in DATA1001, and it costs nothing to get right except discipline. The chapter runs four moves: name each variable's type (it dictates the legal summary and plot), name the study design (it dictates whether you may say causes), spot the confounder (the lurking third variable that fakes or even reverses a link), and name the bias — selection, non-response or measurement — that a big sample will not fix. Get the design right and the rest of the course is reading the answer correctly.
What this chapter covers
- 01Data types: quantitative (discrete / continuous) vs categorical (nominal / ordinal)
- 02Observational vs randomised — and why the design drives the conclusion
- 03Confounding and Simpson's paradox: when a third variable lies
- 04Bias: selection, non-response, measurement — size cures variance, not bias
- 05Population vs sample · parameter vs statistic · simple random sampling
Worked example: read the design, name the limit
- +2(a) Design: observational with self-selected respondents — not an experiment, so no causal claim is intended, but inference to all "customers" is intended.
- +1(b) Selection bias: only visitors who chose to answer are counted; people who feel strongly (either way) over-respond.
- +1(b) Non-response bias: non-visitors and non-responders are excluded entirely and may differ systematically.
- +1(c) No: a larger website sample would not fix this — bias is a fixed offset that more data cannot remove. It needs a probability sample of the customer list.
Key terms
- Observational study
- A study where the investigator only observes and the units chose their own exposure. Confounding is uncontrolled, so the only legal verb is "is associated with" — never "causes".
- Randomised controlled experiment
- A study where the investigator assigns the treatment at random. Randomisation balances every confounder, known and unknown, across the groups, so a difference in outcome is caused by the treatment.
- Confounder
- A variable linked to both the exposure and the outcome, by a route other than through the exposure. It makes the raw association untrustworthy; in its sharpest form (Simpson's paradox) it can reverse the trend when subgroups are pooled.
- Bias
- A systematic push away from the truth that more data does not remove — a big biased sample is still biased. The three types to name on sight are selection, non-response and measurement bias.
- Simple random sample (SRS)
- A sample in which every unit, and every set of n units, is equally likely to be chosen. Only probability sampling supports valid inference; convenience and quota samples do not.
Study Design FAQ
When can I say one thing "causes" another in DATA1001?
Only when the data come from a randomised controlled experiment. Randomisation is the one device that balances confounders you didn't even think to measure, so a difference in outcome can be attributed to the treatment. From an observational study — no matter how large or how strong the association — the only legal verb is "is associated with". This observational-vs-randomised distinction is the most examined critique in the subject (LO9).
What exactly is a confounder, and how do I confirm one?
A confounder is a variable Z linked to both the exposure X and the outcome Y. Confirm it in two steps: is Z plausibly linked to X, and is Z also plausibly linked to Y by a route other than through X? If both hold, Z is a confounder and the raw X-Y association is contaminated. You fix it by design (randomise) or by analysis (stratify or adjust for Z, then read the trend within levels of Z).
Does a huge sample make a survey trustworthy?
No — this is the most examined trap. Sample size cures variance (chance error), not bias. The 1936 Literary Digest poll asked millions, drawn from car and phone owners (the wealthy), and confidently called the wrong election. If a question offers "the sample was large" as reassurance, that is usually the bait: ask how the sample was drawn, not how big it is.
Why does data type matter before I do any statistics?
Because type dictates the legal summary, plot and test: you cannot take a mean of a category or draw a histogram of a colour. A whole answer can be wrong from line one if you pick a histogram for a nominal variable. Watch the coded-category trap: a variable stored as 1 = control, 2 = drug A is still categorical, and averaging it ("mean treatment = 2.1") is nonsense.
Exam move
Build one reflex and use it on every "data study" prompt: design → conclusion → bias. First name each variable's type (it fixes the legal summary and plot). Then ask who decided who got the treatment: nature or self-selection → observational → association only; the investigator randomised → causation licensed. Then hunt for a confounder Z linked to both X and Y, and name any selection, non-response or measurement bias that a larger sample would not cure. Three or four crisp sentences earn full marks here, and the marks are cheap if you are disciplined — this is where rushed students lose easy marks early in the exam.