COMP5318 · Machine Learning and Data Mining
Decision Trees & Ensembles
This Week 5 topic of COMP5318 Machine Learning and Data Mining at the University of Sydney builds a classifier you can read as if-then rules: a decision tree that splits, at each node, on the feature with the highest information gain — the largest drop in entropy. It then controls overfitting with pruning and scales a single tree into a strong ensemble: bagging and random forests average independent trees to cut variance, while boosting trains trees in sequence to cut bias. Entropy and information-gain calculations are recurring short numeric problems in the closed-book final.
What this chapter covers
- 01Compute the entropy H(S) = −Σ pᵢ log₂ pᵢ of a labelled node in bits, using 0·log₂(0) = 0
- 02Read the entropy scale: 0 at a pure node, a maximum of 1 bit at a two-class 50/50 split
- 03Calculate the information gain of an attribute as parent entropy minus weighted child entropy
- 04Apply the split rule: choose the attribute with the highest information gain
- 05Recognise that information gain is biased toward many-valued (ID-like) attributes, and how gain ratio fixes it
- 06Handle numeric features by binary thresholds, and know the Gini index as an alternative impurity
- 07Contrast pre-pruning (early stopping) with post-pruning (grow then cut) and why a validation set is used
- 08Explain bagging: bootstrap samples, parallel independent trees, majority vote, variance reduction
- 09See how a random forest adds a random feature subset at each split to decorrelate trees
- 10Contrast boosting (sequential, reweighted, bias-reducing) with bagging (parallel, variance-reducing)
Entropy, information gain, and choosing the split
- +1Parent entropy: p(+) = 6/10 = 0.6, p(−) = 4/10 = 0.4. H(S) = −0.6·log₂(0.6) − 0.4·log₂(0.4) = 0.6(0.737) + 0.4(1.322) = 0.442 + 0.529 = 0.971 bits.
- +1Region children: North (4+,1−) has H = −0.8·log₂(0.8) − 0.2·log₂(0.2) = 0.258 + 0.464 = 0.722. South (2+,3−) is a 0.4/0.6 split, so H = 0.971 (same as the parent).
- +1Gain(Region): weighted child entropy = (5/10)(0.722) + (5/10)(0.971) = 0.361 + 0.486 = 0.846. Gain = 0.971 − 0.846 = 0.125 bits.
- +1Member children: Member=yes (5+,0−) is pure, so H = 0. Member=no (1+,4−) has H = −0.2·log₂(0.2) − 0.8·log₂(0.8) = 0.464 + 0.258 = 0.722.
- +1Gain(Member): weighted child entropy = (5/10)(0) + (5/10)(0.722) = 0 + 0.361 = 0.361. Gain = 0.971 − 0.361 = 0.610 bits.
- +1Decide: Gain(Member) = 0.610 > Gain(Region) = 0.125, so the tree splits on Member — it produces a pure branch and removes far more uncertainty.
Key terms
- Entropy
- H(S) = −Σ pᵢ log₂ pᵢ, the mix of class labels in a node measured in bits; 0 at a pure node, a maximum of 1 bit at a two-class 50/50 split, with the convention 0·log₂(0) = 0.
- Information gain
- Gain(S,A) = H(S) − Σₖ (|Sₖ|/|S|)·H(Sₖ), the drop in entropy from splitting on attribute A; the tree splits on the attribute with the highest gain.
- High-cardinality bias
- Information gain favours attributes with many distinct values (e.g. an ID column), which can split data into pure singletons that memorise rather than generalise; gain ratio penalises many-valued splits to counter it.
- Gini index
- Gini(S) = 1 − Σ pᵢ², an alternative node impurity used by CART-style trees; 0 at a pure node and 0.5 at a two-class 50/50 split, choosing almost the same splits as entropy without a logarithm.
- Pruning
- Reducing tree size to prevent overfitting: pre-pruning stops growth early, while post-pruning grows the full tree then removes unhelpful branches using a validation set (often preferred).
- Bagging
- Bootstrap AGGregatING: train many trees in parallel on bootstrap samples (sampled with replacement) and combine them by majority vote or averaging, which reduces variance.
- Random forest
- Bagging of decision trees with an added twist: at each split only a random subset of k < m features is considered, decorrelating the trees for a bigger variance reduction.
- Boosting
- Training learners sequentially, each reweighting the examples the previous ones got wrong, then combining them with weights; it reduces bias but is sensitive to noisy data.
Decision Trees & Ensembles FAQ
How do I decide which attribute a decision tree splits on?
Compute the information gain of every candidate attribute at that node and split on the one with the highest gain. Information gain is the parent entropy minus the weighted average of the child entropies, where each child is weighted by its share of the examples (|Sₖ|/|S|), not a plain average. Because a pure child has entropy 0, an attribute that produces a pure branch tends to win. Watch the bias: information gain favours many-valued, ID-like attributes that split data into near-unique singletons — those look perfect but only memorise the training set, which is why gain ratio or dropping ID columns is used.
What is the difference between bagging and boosting?
Bagging trains many trees in parallel and independently, each on a bootstrap sample (sampled with replacement), then combines them by an equal-weight majority vote; it mainly reduces variance and is robust to noise. A random forest is a bagging method that also picks a random feature subset at each split. Boosting instead trains learners in sequence, each one reweighting the examples the previous learners misclassified, and combines them with weights; it mainly reduces bias but is more sensitive to noisy or mislabelled points. A common exam slip is calling a random forest a boosting method — it is bagging.
Can AI help me with decision trees and ensembles in COMP5318?
Yes, for learning the method. Sia can explain each step — how entropy is calculated, why you take the weighted average of child entropies, how a random forest decorrelates trees — and walk you through a practice information-gain problem with your own numbers so you can check your reasoning and rehearse the derivation by hand. Keep AI to understanding and revision rather than obtaining answers to submitted assignments or the closed-book exam, and acknowledge any AI tools you use in assessable work as the unit requires; the exam allows only a non-programmable calculator, so the goal is to be able to do the calculation yourself.
Studying with AI? Sia — free AI machine learning tutor works through COMP5318 step by step.
Exam move
Make the entropy-to-information-gain calculation automatic, because it is the reliable numeric mark here. Practise on small nodes: write p for each class, compute H(S) = −Σ pᵢ log₂ pᵢ in bits, then for each candidate attribute compute the weighted average of the child entropies and subtract it from the parent entropy — and remember to weight each child by its share of the examples. Memorise a few anchor values (pure = 0, 50/50 = 1 bit, an 0.8/0.2 split ≈ 0.722, an 0.6/0.4 split ≈ 0.971) so you can check your work quickly with a non-programmable calculator. For the conceptual marks, lock down the contrasts the exam frames: entropy versus Gini (max 1 bit versus 0.5), pre- versus post-pruning (and that both use a validation set, never the test set), and bagging versus boosting (parallel and variance-reducing versus sequential and bias-reducing, with random forest a bagging method). The final exam is 2 hours, closed book, with a non-programmable calculator only, so budget roughly one minute per mark and always show the formula before the number. Keep AI tutoring to rehearsing the method, and confirm the exact exam date on the Canvas exam timetable.