COMP20008 · Elements Of Data Processing
Correlation, Entropy & Mutual Information
This chapter is about measuring how two variables relate — and why the choice of measure matters. Pearson r captures only linear association; entropy and conditional entropy quantify uncertainty in bits; and mutual information (rescaled to NMI) captures any dependence, including non-linear. It also covers PCA, which projects onto directions of maximum variance — with the warning that high explained variance is not the same as relevance. It is examined as critique + small calculation: the 2025 exam's Q4 asks for a high-NMI-but-zero-Pearson example (the y = x² archetype) and Q9 critiques discarding low-variance PCA components.
What this chapter covers
- 011. Pearson r: measures linear association only, range −1 ≤ r ≤ 1; blind to curved/non-monotone relations
- 022. Entropy H(X) = − Σ pᵢ log₂ pᵢ: bits of uncertainty; max when uniform, 0 when deterministic
- 033. Conditional entropy H(Y | X): uncertainty in Y once X is known
- 044. Mutual information MI(X, Y) = H(Y) − H(Y | X): reduction in uncertainty; captures any dependence; 0 iff independent
- 055. NMI = MI / min(H(X), H(Y)): rescales MI to 0 ≤ NMI ≤ 1
- 066. Pearson vs MI/NMI: y = x² has near-zero Pearson but high NMI
- 077. PCA: project onto orthogonal directions of maximum variance; PC1 captures the most variance
- 088. The PCA caution: high explained variance ≠ relevance to the target; scale features first
Near-zero Pearson but high NMI (mirrors 2025 Q4)
- 1 markLet x take the values −3, −2, −1, 0, 1, 2, 3 and let y = x². As x increases, y first falls then rises symmetrically, so the linear trend cancels out and Pearson r ≈ 0.
- 1 markBut y is completely determined by x: knowing x removes all uncertainty about y, so the mutual information — and hence NMI — is high. There is strong dependence even though there is no linear correlation.
- 1 markLimitations: Pearson captures only linear association, so it misses curved or non-monotone structure; MI/NMI captures any dependence but needs binning/discretisation for continuous data and gives no sign or direction of the relationship.
Key terms
- Pearson correlation r
- Measures the strength and direction of the linear relationship between two numeric variables, ranging −1 to +1. It is blind to non-linear or non-monotone structure, so a strong curved relationship can give r near 0.
- Entropy H(X)
- The uncertainty of a categorical variable in bits, H(X) = − Σ pᵢ log₂ pᵢ. It is maximal when the categories are equally likely (uniform) and 0 when the outcome is certain (deterministic).
- Conditional entropy H(Y | X)
- The remaining uncertainty about Y once X is known, averaged over the values of X. If X fully determines Y then H(Y | X) = 0; if X tells you nothing then H(Y | X) = H(Y).
- Mutual information MI(X, Y)
- The reduction in uncertainty about Y from knowing X, MI = H(Y) − H(Y | X). It captures any dependence, including non-linear, and is 0 if and only if X and Y are independent.
- Normalised mutual information (NMI)
- MI rescaled to the range 0–1 by dividing by min(H(X), H(Y)). NMI = 1 means one variable fully determines the other; NMI = 0 means independence, making it comparable across variable pairs.
- PCA (Principal Components Analysis)
- A linear projection onto orthogonal directions of maximum variance: PC1 captures the most variance, PC2 the next, and so on. Used for visualisation and compression, but high variance on a component does not guarantee it is relevant to a target.
Correlation, Entropy & Mutual Information FAQ
How can two variables have zero correlation but be strongly related?
Pearson r only measures linear association, so a relationship that is strong but curved or symmetric can give r ≈ 0. The classic example is y = x² over a symmetric range of x: y is fully determined by x, yet as x rises y falls then rises, cancelling the linear trend so r ≈ 0. Mutual information captures this dependence and is high, which is why the exam contrasts the two measures.
When should I use mutual information instead of Pearson r?
Use mutual information (or NMI) when the relationship may be non-linear or the variables are categorical, because MI detects any kind of dependence and is 0 only under independence. Use Pearson when you specifically want the strength and sign of a linear relationship between numeric variables. The trade-off: MI has no sign/direction and needs binning for continuous data, while Pearson is simple and signed but linear-only.
Why is high explained variance in PCA not the same as relevance?
PCA picks directions purely by how much variance they capture, with no reference to any target variable. A component with 95% of the variance might be irrelevant to the thing you want to predict, while a low-variance direction could be the one that separates the classes. So discarding low-variance components can throw away the most class-discriminative information. The 2025 Q9 critique is exactly this: variance ≠ relevance, and you should also scale features first because PCA is variance-sensitive.
How is this chapter examined in COMP20008?
As critique plus small calculation: give and justify a high-NMI/zero-Pearson example (2025 Q4), compute or reason about entropy/conditional entropy/MI from a small table, or critique a PCA decision such as keeping only PC1 because it has 95% variance (2025 Q9). The marks reward naming the right limitation — Pearson is linear-only, MI lacks direction, PCA variance ≠ relevance — rather than heavy derivation, since the formula sheet provides the formulas.
Exam move
Hold the measures in a clear hierarchy: Pearson = linear only (signed, simple); MI/NMI = any dependence (unsigned, needs binning). Memorise the y = x² archetype as your zero-Pearson/high-NMI example and the two limitations that go with it, because that pairing is almost always the answer. Be comfortable computing entropy H(X) = − Σ pᵢ log₂ pᵢ from category proportions and reasoning that H(Y|X) = 0 when X determines Y, since the formula sheet gives the forms but you must apply them. For PCA, rehearse the single sharpest critique — high variance ≠ relevance, so do not discard a low-variance but discriminative direction, and scale first. These critiques are reliable "evaluate" marks across the whole exam.