University of Melbourne · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

COMP20008 · Elements Of Data Processing

- one subject, every graph, every model, every mark

50% final exam · hurdle14 Chapters7-page Bible

Our own words - no uploaded lecturer files

Built to mirror S1 2026 · updated this semester

Chapter 8 of 12 · COMP20008

Correlation, Entropy & Mutual Information

This chapter is about measuring how two variables relate — and why the choice of measure matters. Pearson r captures only linear association; entropy and conditional entropy quantify uncertainty in bits; and mutual information (rescaled to NMI) captures any dependence, including non-linear. It also covers PCA, which projects onto directions of maximum variance — with the warning that high explained variance is not the same as relevance. It is examined as critique + small calculation: the 2025 exam's Q4 asks for a high-NMI-but-zero-Pearson example (the y = x² archetype) and Q9 critiques discarding low-variance PCA components.

In this chapter

What this chapter covers

011. Pearson r: measures linear association only, range −1 ≤ r ≤ 1; blind to curved/non-monotone relations
022. Entropy H(X) = − Σ pᵢ log₂ pᵢ: bits of uncertainty; max when uniform, 0 when deterministic
033. Conditional entropy H(Y | X): uncertainty in Y once X is known
044. Mutual information MI(X, Y) = H(Y) − H(Y | X): reduction in uncertainty; captures any dependence; 0 iff independent
055. NMI = MI / min(H(X), H(Y)): rescales MI to 0 ≤ NMI ≤ 1
066. Pearson vs MI/NMI: y = x² has near-zero Pearson but high NMI
077. PCA: project onto orthogonal directions of maximum variance; PC1 captures the most variance
088. The PCA caution: high explained variance ≠ relevance to the target; scale features first

Worked example · free

Near-zero Pearson but high NMI (mirrors 2025 Q4)

Q [3 marks]. Give an example of two variables with near-zero Pearson r but high mutual information (NMI), explain why each measure gives that value, and state one limitation of each measure.

1 markLet x take the values −3, −2, −1, 0, 1, 2, 3 and let y = x². As x increases, y first falls then rises symmetrically, so the linear trend cancels out and Pearson r ≈ 0.
1 markBut y is completely determined by x: knowing x removes all uncertainty about y, so the mutual information — and hence NMI — is high. There is strong dependence even though there is no linear correlation.
1 markLimitations: Pearson captures only linear association, so it misses curved or non-monotone structure; MI/NMI captures any dependence but needs binning/discretisation for continuous data and gives no sign or direction of the relationship.

y = x² over a symmetric range gives Pearson r ≈ 0 yet high NMI; Pearson misses non-linear structure, while MI/NMI detects it but needs binning and has no sign.

Sia tip — "High dependence, zero linear correlation" is the y = x² archetype the examiners reuse — memorise it as your go-to example, and always pair it with the two limitations (Pearson = linear-only; MI = no direction, needs binning).

Glossary

Key terms

Pearson correlation r: Measures the strength and direction of the linear relationship between two numeric variables, ranging −1 to +1. It is blind to non-linear or non-monotone structure, so a strong curved relationship can give r near 0.
Entropy H(X): The uncertainty of a categorical variable in bits, H(X) = − Σ pᵢ log₂ pᵢ. It is maximal when the categories are equally likely (uniform) and 0 when the outcome is certain (deterministic).
Conditional entropy H(Y | X): The remaining uncertainty about Y once X is known, averaged over the values of X. If X fully determines Y then H(Y | X) = 0; if X tells you nothing then H(Y | X) = H(Y).
Mutual information MI(X, Y): The reduction in uncertainty about Y from knowing X, MI = H(Y) − H(Y | X). It captures any dependence, including non-linear, and is 0 if and only if X and Y are independent.
Normalised mutual information (NMI): MI rescaled to the range 0–1 by dividing by min(H(X), H(Y)). NMI = 1 means one variable fully determines the other; NMI = 0 means independence, making it comparable across variable pairs.
PCA (Principal Components Analysis): A linear projection onto orthogonal directions of maximum variance: PC1 captures the most variance, PC2 the next, and so on. Used for visualisation and compression, but high variance on a component does not guarantee it is relevant to a target.

FAQ

Correlation, Entropy & Mutual Information FAQ

How can two variables have zero correlation but be strongly related?

Pearson r only measures linear association, so a relationship that is strong but curved or symmetric can give r ≈ 0. The classic example is y = x² over a symmetric range of x: y is fully determined by x, yet as x rises y falls then rises, cancelling the linear trend so r ≈ 0. Mutual information captures this dependence and is high, which is why the exam contrasts the two measures.

When should I use mutual information instead of Pearson r?

Use mutual information (or NMI) when the relationship may be non-linear or the variables are categorical, because MI detects any kind of dependence and is 0 only under independence. Use Pearson when you specifically want the strength and sign of a linear relationship between numeric variables. The trade-off: MI has no sign/direction and needs binning for continuous data, while Pearson is simple and signed but linear-only.

Why is high explained variance in PCA not the same as relevance?

PCA picks directions purely by how much variance they capture, with no reference to any target variable. A component with 95% of the variance might be irrelevant to the thing you want to predict, while a low-variance direction could be the one that separates the classes. So discarding low-variance components can throw away the most class-discriminative information. The 2025 Q9 critique is exactly this: variance ≠ relevance, and you should also scale features first because PCA is variance-sensitive.

How is this chapter examined in COMP20008?

As critique plus small calculation: give and justify a high-NMI/zero-Pearson example (2025 Q4), compute or reason about entropy/conditional entropy/MI from a small table, or critique a PCA decision such as keeping only PC1 because it has 95% variance (2025 Q9). The marks reward naming the right limitation — Pearson is linear-only, MI lacks direction, PCA variance ≠ relevance — rather than heavy derivation, since the formula sheet provides the formulas.

Study strategy

Exam move

Hold the measures in a clear hierarchy: Pearson = linear only (signed, simple); MI/NMI = any dependence (unsigned, needs binning). Memorise the y = x² archetype as your zero-Pearson/high-NMI example and the two limitations that go with it, because that pairing is almost always the answer. Be comfortable computing entropy H(X) = − Σ pᵢ log₂ pᵢ from category proportions and reasoning that H(Y|X) = 0 when X determines Y, since the formula sheet gives the forms but you must apply them. For PCA, rehearse the single sharpest critique — high variance ≠ relevance, so do not discard a low-variance but discriminative direction, and scale first. These critiques are reliable "evaluate" marks across the whole exam.

Keep going — explore the course

A+Everything unlocked

Unlocks this Bible + all 24 of your University of Melbourne subjects - and 1,000+ Bibles across every Australian university.

Sia - your COMP20008 tutor, unlimited, worked the way the exam marks it

The full 7-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works