DATA1001 · Foundations Of Data Science
Sampling Distributions
This chapter is the hinge between a single sample and a statement about the population. You want a parameter (the true mean μ or proportion p) but all you have is a statistic (x̄ or p̂) computed from one sample. The central limit theorem is the engine: the sampling distribution of the mean (or sum) is approximately Normal for large n, centred on the parameter and narrowing as n grows — regardless of the population's own shape. That lets you attach a margin of error and build a confidence interval: estimate ± z×SE. The subtlety the exam tests hardest is what "95% confident" actually means — it is a statement about the long-run reliability of the procedure, not the probability that a particular interval contains the parameter. The chapter also covers the bias types that no sample size cures, and the bootstrap as a resampling way to get an SE when a formula is awkward.
What this chapter covers
- 01Parameter vs statistic; the sampling frame
- 02The central limit theorem: the sampling distribution of the mean
- 03Bias types revisited (size cures variance, not bias)
- 04Confidence intervals: estimate ± z×SE
- 05What '95% confident' really means; the bootstrap
Worked example: a confidence interval for a proportion
- +1(a) p̂ = 220/400 = 0.55.
- +1(a) SE = √(p̂(1−p̂)/n) = √(0.55×0.45/400) ≈ 0.0249.
- +2(b) 95% CI = p̂ ± 1.96×SE = 0.55 ± 1.96×0.0249 = 0.55 ± 0.049 = (0.501, 0.599).
- +2(c) If we repeated this sampling procedure many times, about 95% of the intervals built this way would contain the true proportion.
Key terms
- Parameter
- A fixed, unknown number describing the whole population — the true mean μ or proportion p. Inference is the business of estimating a parameter from a statistic and quantifying how far off the estimate might be.
- Statistic
- A number computed from the sample — the sample mean x̄ or proportion p̂ — used to estimate the corresponding parameter. It varies from sample to sample, and that variation is the sampling distribution.
- Central limit theorem (CLT)
- The result that the sampling distribution of the mean (or sum) is approximately Normal for large n, centred on the parameter and with SE shrinking like 1/√n — whatever the shape of the population. It is what lets us use the Normal curve for inference.
- Confidence interval
- A range built as estimate ± z×SE that is designed to capture the parameter a stated proportion of the time (e.g. 95% with z = 1.96). Its width is the margin of error; it narrows as n grows.
- Bootstrap
- A resampling method: draw new samples (with replacement) from the data itself, recompute the statistic each time, and use the spread of those values as the standard error. It estimates an SE empirically when a formula is awkward or unavailable.
Sampling Distributions FAQ
What's the difference between a parameter and a statistic?
A parameter is a fixed but unknown number describing the whole population (the true mean μ or proportion p); a statistic is a number you compute from your sample (x̄ or p̂) to estimate it. You never see the parameter directly — inference is using the statistic, plus its standard error, to say something credible about the parameter.
What does the central limit theorem actually let me do?
It lets you treat the sampling distribution of a mean or proportion as approximately Normal for large n, no matter how skewed the population is. That single fact is why you can attach a Normal-based margin of error, build confidence intervals and run z-tests. The catch is "large enough n": heavy skew or small samples need more data before the Normal approximation is safe.
What does '95% confidence' really mean?
It is a statement about the procedure, not about one interval. If you repeated the whole sampling-and-interval process many times, about 95% of the intervals you built would contain the true parameter. For a specific computed interval, the parameter is either in it or not — there is no 95% probability attached to that one interval. Saying otherwise is the most penalised CI error.
When would I use the bootstrap instead of a formula?
When a clean SE formula is awkward, unavailable, or relies on assumptions you can't justify — for example the SE of a median or a complicated statistic. You resample (with replacement) from your own data many times, recompute the statistic each time, and read the SE off the spread of those bootstrap values. It is a general-purpose, computer-based route to the same margin-of-error reasoning.
Exam move
Keep the parameter-statistic distinction front of mind: you estimate a fixed unknown parameter with a variable statistic, and the central limit theorem tells you that statistic is approximately Normal around the parameter with SE shrinking like 1/√n. For confidence intervals, drill the mechanics (estimate ± z×SE) but spend most of your effort on the interpretation — "95% of intervals built this way capture the parameter" — because that wording is examined relentlessly and the probability-of-one-interval phrasing loses the mark. Remember that bias is a fixed offset no sample size removes, and keep the bootstrap in your toolkit for when a formula SE is awkward.