DATA1001 · Foundations Of Data Science
Probability and the Box Model
Probability is the engine room of inference, and DATA1001 keeps it concrete with Freedman's box model: imagine the chance process as drawing tickets from a box. Two rules run everything — add probabilities for "or" (mutually exclusive events), multiply for "and" (independent events) — with care for whether you draw with or without replacement. The box model then gives the two quantities every later test needs. For the sum of n draws: EV = n×(box mean) and SE = √n×(box SD). For the average: EV = box mean and SE = (box SD)/√n. Crucially the SE grows like √n for sums but shrinks like 1/√n for averages — this is why bigger samples give more precise estimates, and it is the law of large numbers in action. These EV and SE formulas are the OV, EV and SE that the hypothesis-testing engine plugs into.
What this chapter covers
- 01Probability rules: add for 'or', multiply for 'and'
- 02With vs without replacement; independence
- 03The binomial idea
- 04The box model: tickets, draws, EV and SE for the sum and the average
- 05Standard-error reasoning and the law of large numbers
Worked example: EV and SE from a box
- +1(a) Box mean = (18×(+1) + 20×(−1))/38 = −2/38 ≈ −0.0526.
- +1(a) Box SD: the tickets are +1 and −1 in fractions 18/38 and 20/38; the SD ≈ 0.999 (essentially 1).
- +1(b) EV of the sum = n×(box mean) = 100×(−0.0526) ≈ −$5.26.
- +1(b) SE of the sum = √n×(box SD) = √100×0.999 ≈ $10.
- +2(c) Approx 95% range = EV ± 2×SE = −5.26 ± 20, i.e. about −$25 to +$15.
Key terms
- Box model
- Freedman's device for any chance process: tickets in a box, drawn n times. It converts a real process into a box mean and box SD, from which the EV and SE of the sum or average follow mechanically — the foundation for every later inference.
- Expected value (EV)
- The long-run average outcome. For the sum of n draws, EV = n×(box mean); for the average, EV = box mean. It is the centre that the OV (observed value) is compared against in the inference engine.
- Standard error (SE)
- The SD of a chance quantity — how much a sum or average wobbles from repetition to repetition. For the sum, SE = √n×(box SD); for the average, SE = (box SD)/√n. The average's SE shrinks like 1/√n, which is why bigger samples are more precise.
- Independence
- Two events are independent when one occurring does not change the probability of the other; then P(A and B) = P(A)×P(B). Drawing with replacement keeps draws independent; drawing without replacement makes them dependent.
- Law of large numbers
- As the number of draws grows, the observed average converges to the box mean (the EV), because the average's SE shrinks like 1/√n. It is why long-run frequencies stabilise and why larger samples estimate parameters more precisely.
Probability and the Box Model FAQ
When do I add probabilities and when do I multiply?
Add for "or" with mutually exclusive events: P(A or B) = P(A) + P(B). Multiply for "and" with independent events: P(A and B) = P(A)×P(B). The catch is replacement — drawing without replacement changes the probabilities on later draws, so the events are no longer independent and you must update the fractions as you go.
What is the box model and why is it so central?
The box model represents any chance process as tickets in a box that you draw n times. It is central because it gives you, mechanically, the EV and SE of the sum and the average — and those are exactly the ingredients the hypothesis-testing engine needs. Learn to build the right box (what's on the tickets, in what proportions) and the rest of the inference course is plugging numbers into formulas.
Why does the standard error sometimes grow and sometimes shrink with n?
Both happen, for different quantities. The SE of the sum is √n×(box SD), which grows with n — totals get more variable in absolute terms. The SE of the average is (box SD)/√n, which shrinks with n — averages get more precise. The shrinking-average SE is the law of large numbers and the reason larger samples give tighter estimates.
How do EV and SE connect to hypothesis testing?
Directly. The (OV−EV)/SE engine that runs every test in the next chapter uses exactly these quantities: OV is the observed statistic, EV is what you'd expect if the null were true (the box mean), and SE is the box-model standard error. So mastering the box model here is mastering the denominator and centre of every test you'll meet later.
Exam move
Get fluent at building the box: decide what is written on the tickets and in what proportions, then read off the box mean and box SD. Memorise the two pairs — sum: EV = n×mean, SE = √n×SD; average: EV = mean, SE = SD/√n — and keep clear that the sum's SE grows while the average's SE shrinks with n. Use "add for or, multiply for and", and always check replacement before assuming independence. Because these EV and SE values become the EV and SE of the testing engine, practise them until they are automatic; the back half of the course rewards the box model heavily.