ECMT1010 · Introduction To Economic Statistics
Describing Data: Centre, Spread & Shape
Week 2 is the toolkit for summarising one or two variables: centre (mean vs median and resistance to outliers), spread (SD, range, IQR and the five-number summary), shape (symmetric vs skewed histograms), location (z-scores) and association (the correlation r). It is examined as MCQ in the Week-7 test and as short-answer calculation — compute x̄ and s, build a five-number summary, flag outliers with the 1.5×IQR rule, and interpret a z-score and an r in plain English.
What this chapter covers
- 011. Categorical vs quantitative variables, and choosing the right summary
- 022. Histograms and shape: symmetric vs left/right-skewed; how skew pulls the mean
- 033. Centre: mean x̄ vs median; the median is resistant to outliers, the mean is not
- 044. Spread: standard deviation s, range, IQR and the five-number summary
- 055. Boxplots and the 1.5×IQR rule for flagging outliers
- 066. Standardisation: the z-score zᵢ = (xᵢ − x̄)/s as 'distance from the mean in SDs'
- 077. The 95% rule: for bell-shaped data ~95% of values lie within x̄ ± 2s
- 088. Correlation r: unit-free, −1 ≤ r ≤ 1, measures only linear association and is not resistant
Mean, standard deviation and a z-score
- 1 markCompute the mean: x̄ = (18 + 26 + 22 + 14 + 33 + 21)/6 = 134/6 ≈ 22.33 ($00s).
- 2 marksFind the deviations from the mean: −4.33, 3.67, −0.33, −8.33, 10.67, −1.33; square and sum them: 18.7 + 13.4 + 0.11 + 69.4 + 113.8 + 1.78 ≈ 217.3.
- 2 marksUse the sample divisor (n − 1) = 5: s² = 217.3/5 ≈ 43.46, so s = √43.46 ≈ 6.59 ($00s).
- 1 markStandardise the best day (33): z = (33 − 22.33)/6.59 ≈ 10.67/6.59 ≈ +1.62 — the best day is about 1.6 SD above the mean.
Key terms
- Mean vs median
- The mean is the arithmetic average (Σxᵢ)/n; the median is the middle of the ordered data. The median is resistant to outliers, while the mean is pulled toward the long tail of a skewed distribution.
- Standard deviation (s)
- A measure of spread, s = √[Σ(xᵢ − x̄)²/(n − 1)] for a sample. It is the typical distance of a value from the mean and uses the (n − 1) divisor for sample data.
- Five-number summary & IQR
- The five-number summary is {min, Q1, median, Q3, max}; the interquartile range IQR = Q3 − Q1 captures the middle 50% of the data and is the basis of the boxplot and the outlier rule.
- 1.5×IQR outlier rule
- A value is flagged as an outlier if it is below Q1 − 1.5·IQR or above Q3 + 1.5·IQR. It is the standard rule used to draw whiskers and points on a boxplot.
- z-score
- The standardised value zᵢ = (xᵢ − x̄)/s, the distance of a value from the mean measured in standard deviations. Standardising a dataset gives it mean 0 and SD 1, making different variables comparable.
- Correlation (r)
- A unit-free measure of the strength and direction of a linear association, with −1 ≤ r ≤ 1. It is symmetric in x and y, is not resistant to outliers, and r = 0 means no linear association only — not no relationship and not no causation.
Describing Data: Centre, Spread & Shape FAQ
When should I report the median instead of the mean?
Use the median when the data are skewed or contain outliers, because it is resistant — it ignores how extreme the tail values are and reports the true middle. The mean is pulled toward the long tail, so for skewed data like incomes or house prices the mean overstates the 'typical' value. For roughly symmetric data the mean and median are close and the mean is fine.
Why do I divide by n − 1 and not n for the standard deviation?
Because you are computing a sample standard deviation that estimates the population SD. Dividing by (n − 1) instead of n corrects for the fact that the deviations are taken around the sample mean rather than the true mean, which slightly underestimates spread; (n − 1) makes the estimator unbiased. In ECMT1010 you almost always have a sample, so use (n − 1).
What does the 95% rule say and when does it apply?
For roughly bell-shaped (symmetric, unimodal) data, about 68% of values lie within one SD of the mean, about 95% within two SDs (x̄ ± 2s), and about 99.7% within three SDs. It lets you judge quickly whether a value is unusual: anything beyond ±2s is in the outer 5%. It only applies to bell-shaped data, so check the histogram first.
What does r = 0 actually mean?
It means there is no linear association between the two variables — but there could still be a strong non-linear (for example U-shaped) relationship that r cannot detect. Always plot the scatterplot first. Also remember r is not resistant, so a single outlier can drag it up or down, and a strong r still never proves causation.
Exam move
Build a fixed routine for any single dataset: order the values, write the five-number summary, compute x̄ and s (with the n − 1 divisor), then apply the 1.5×IQR rule before you trust the mean. Practise reading shape off a histogram and predicting whether the mean sits above or below the median from the direction of the skew — examiners love this conceptual MCQ. Learn the z-score as a portable 'how unusual' ruler and the 95% rule as its companion. For association questions, internalise the four facts about r — unit-free, bounded by ±1, symmetric, not resistant, linear-only — and always pair an r value with a one-sentence plain-English interpretation, because the interpretation earns marks a bare number does not.