QBUS5001 · Foundation In Data Analytics For Business
Descriptive Statistics & Association
Module 1 builds the descriptive toolkit: measures of centre (mean, median, mode), measures of spread (range, variance, standard deviation, IQR, coefficient of variation) and measures of association (covariance and correlation). The headline idea is that covariance tells you the direction of a linear relationship but its size is scale-dependent, whereas correlation standardises it onto [−1, 1] so you can read off strength.
These statistics are the raw materials for everything later: standard error uses the SD, regression uses covariance, and confidence intervals quote the sample mean. Excel functions CORREL and COVARIANCE.S do the arithmetic.
What this chapter covers
- 01Mean: population μ vs sample x̄
- 02Median, mode and when each is preferred
- 03Variance and standard deviation (note the n−1 divisor for samples)
- 04Range and interquartile range (IQR = Q₃ − Q₁)
- 05Coefficient of variation: CV = s/x̄ for unit-free comparison
- 06Sample covariance and its direction-only meaning
- 07Sample correlation r and the [−1, 1] scale
- 08Scale-dependence: why correlation standardises covariance
Standard deviation, CV and correlation reasoning
- 1 markCompute the coefficient of variation for each route: CV(A) = 6/30 = 0.20 (20%).
- 1 markCV(B) = 10/80 = 0.125 (12.5%).
- 1 markCompare: although Route B has the larger absolute SD (10 > 6), Route A is relatively more variable because its CV (20%) exceeds B's (12.5%).
- 1 mark(b) Covariance only signals direction (here, positive: longer distance tends to mean longer time) and its magnitude depends on the units (minutes × km).
- 1 markCorrelation r divides covariance by the product of the two SDs, giving a unit-free number in [−1, 1] that measures the strength of the linear relationship comparably across datasets.
Key terms
- Coefficient of variation (CV)
- CV = s/x̄ (or σ/μ), a unit-free measure of relative spread used to compare variability across datasets with different means or units.
- Sample covariance
- s(X,Y) = (1/(n−1))Σ(xᵢ−x̄)(yᵢ−ȳ), measuring the direction of a linear relationship; its magnitude depends on the variables' units.
- Sample correlation (r)
- r = s(X,Y)/(sₓ s_y), a standardised covariance lying in [−1, 1]: near +1 strong positive, near −1 strong negative, near 0 no linear relationship.
- Interquartile range (IQR)
- IQR = Q₃ − Q₁, the spread of the middle 50% of the data; robust to outliers and used to flag them in a box-and-whisker plot.
- Sample variance (s²)
- s² = (1/(n−1))Σ(xᵢ−x̄)²; the n−1 divisor (Bessel's correction) makes it an unbiased estimator of the population variance σ².
Descriptive Statistics & Association FAQ
Why divide by n−1 instead of n for the sample variance?
Because the sample mean is itself estimated from the data, dividing by n−1 (the degrees of freedom) corrects the downward bias and makes s² an unbiased estimator of σ². Population variance, where μ is known, divides by N.
When should I report correlation rather than covariance?
Whenever you want to judge the strength of a relationship or compare relationships across different variable pairs. Covariance is scale-dependent, so its size means nothing on its own; correlation is bounded in [−1, 1].
Does a correlation near zero mean the variables are unrelated?
It means no linear relationship. Two variables can have r near 0 yet a strong non-linear (e.g. U-shaped) relationship, which a scatter plot would reveal. Correlation only captures the linear component.
Exam move
Compute every descriptive measure by hand on one small dataset, then reproduce it with Excel (AVERAGE, STDEV.S, CORREL, COVARIANCE.S) so you trust both routes under exam time. Internalise the slogan covariance = direction, correlation = strength, because regression in Module 10 reuses exactly the covariance-over-variance structure for the slope.