University of Sydney · S1 2026 · FACULTY OF SCIENCE

DATA1001 · Foundations Of Data Science

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters4-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 1 of 7 · DATA1001

Study Design

Before a single number is computed, one question fixes what you are allowed to conclude: how were the data produced? The same association — coffee drinkers live longer, helmet-wearers crash less — supports a causal claim from a randomised experiment but only an associational claim from an observational study. This is the single most heavily examined critique in DATA1001, and it costs nothing to get right except discipline. The chapter runs four moves: name each variable's type (it dictates the legal summary and plot), name the study design (it dictates whether you may say causes), spot the confounder (the lurking third variable that fakes or even reverses a link), and name the bias — selection, non-response or measurement — that a big sample will not fix. Get the design right and the rest of the course is reading the answer correctly.

In this chapter

What this chapter covers

  • 01Data types: quantitative (discrete / continuous) vs categorical (nominal / ordinal)
  • 02Observational vs randomised — and why the design drives the conclusion
  • 03Confounding and Simpson's paradox: when a third variable lies
  • 04Bias: selection, non-response, measurement — size cures variance, not bias
  • 05Population vs sample · parameter vs statistic · simple random sampling
Worked example · free

Worked example: read the design, name the limit

Q [5 marks]. A company reports: "In our website survey, 72% of visitors support the new fee, so most of our customers do." An observational survey with self-selected respondents. (a) State the study design and whether a causal claim is intended. (b) Name the biases that threaten the inference to "customers". (c) Would a larger website sample fix the problem?
  • +2(a) Design: observational with self-selected respondents — not an experiment, so no causal claim is intended, but inference to all "customers" is intended.
  • +1(b) Selection bias: only visitors who chose to answer are counted; people who feel strongly (either way) over-respond.
  • +1(b) Non-response bias: non-visitors and non-responders are excluded entirely and may differ systematically.
  • +1(c) No: a larger website sample would not fix this — bias is a fixed offset that more data cannot remove. It needs a probability sample of the customer list.
The survey is observational and self-selected, so 72% describes respondents, not customers; selection and non-response bias threaten the inference, and a larger website sample would not fix it — only a probability sample of the customer list would. Size cures variance, not bias.
Sia tip — The mark-winning move on any "data study" prompt is the three-line reflex: design → conclusion → bias. Observational or randomised? Any confounder linked to both X and Y? Any selection / non-response / measurement bias the sample size won't cure?
Glossary

Key terms

Observational study
A study where the investigator only observes and the units chose their own exposure. Confounding is uncontrolled, so the only legal verb is "is associated with" — never "causes".
Randomised controlled experiment
A study where the investigator assigns the treatment at random. Randomisation balances every confounder, known and unknown, across the groups, so a difference in outcome is caused by the treatment.
Confounder
A variable linked to both the exposure and the outcome, by a route other than through the exposure. It makes the raw association untrustworthy; in its sharpest form (Simpson's paradox) it can reverse the trend when subgroups are pooled.
Bias
A systematic push away from the truth that more data does not remove — a big biased sample is still biased. The three types to name on sight are selection, non-response and measurement bias.
Simple random sample (SRS)
A sample in which every unit, and every set of n units, is equally likely to be chosen. Only probability sampling supports valid inference; convenience and quota samples do not.
FAQ

Study Design FAQ

When can I say one thing "causes" another in DATA1001?

Only when the data come from a randomised controlled experiment. Randomisation is the one device that balances confounders you didn't even think to measure, so a difference in outcome can be attributed to the treatment. From an observational study — no matter how large or how strong the association — the only legal verb is "is associated with". This observational-vs-randomised distinction is the most examined critique in the subject (LO9).

What exactly is a confounder, and how do I confirm one?

A confounder is a variable Z linked to both the exposure X and the outcome Y. Confirm it in two steps: is Z plausibly linked to X, and is Z also plausibly linked to Y by a route other than through X? If both hold, Z is a confounder and the raw X-Y association is contaminated. You fix it by design (randomise) or by analysis (stratify or adjust for Z, then read the trend within levels of Z).

Does a huge sample make a survey trustworthy?

No — this is the most examined trap. Sample size cures variance (chance error), not bias. The 1936 Literary Digest poll asked millions, drawn from car and phone owners (the wealthy), and confidently called the wrong election. If a question offers "the sample was large" as reassurance, that is usually the bait: ask how the sample was drawn, not how big it is.

Why does data type matter before I do any statistics?

Because type dictates the legal summary, plot and test: you cannot take a mean of a category or draw a histogram of a colour. A whole answer can be wrong from line one if you pick a histogram for a nominal variable. Watch the coded-category trap: a variable stored as 1 = control, 2 = drug A is still categorical, and averaging it ("mean treatment = 2.1") is nonsense.

Study strategy

Exam move

Build one reflex and use it on every "data study" prompt: design → conclusion → bias. First name each variable's type (it fixes the legal summary and plot). Then ask who decided who got the treatment: nature or self-selection → observational → association only; the investigator randomised → causation licensed. Then hunt for a confounder Z linked to both X and Y, and name any selection, non-response or measurement bias that a larger sample would not cure. Three or four crisp sentences earn full marks here, and the marks are cheap if you are disciplined — this is where rushed students lose easy marks early in the exam.

A+Everything unlocked
Unlocks this Bible + all 25 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your DATA1001 tutor, unlimited, worked the way the exam marks it
The full 4-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full DATA1001 Bible + 25 University of Sydney subjects解锁完整 DATA1001 Bible + University of Sydney 25 门科目
$25/mo