University of Sydney · FACULTY OF COMPUTER SCIENCE

COMP5318 · Machine Learning and Data Mining

- one subject, every graph, every model, every mark
Computer Science14 Chapters8-page Bible
Our own words - no uploaded lecturer files
Updated for this semester
Chapter 5 of 11 · COMP5318

Decision Trees & Ensembles

This Week 5 topic of COMP5318 Machine Learning and Data Mining at the University of Sydney builds a classifier you can read as if-then rules: a decision tree that splits, at each node, on the feature with the highest information gain — the largest drop in entropy. It then controls overfitting with pruning and scales a single tree into a strong ensemble: bagging and random forests average independent trees to cut variance, while boosting trains trees in sequence to cut bias. Entropy and information-gain calculations are recurring short numeric problems in the closed-book final.

In this chapter

What this chapter covers

  • 01Compute the entropy H(S) = −Σ pᵢ log₂ pᵢ of a labelled node in bits, using 0·log₂(0) = 0
  • 02Read the entropy scale: 0 at a pure node, a maximum of 1 bit at a two-class 50/50 split
  • 03Calculate the information gain of an attribute as parent entropy minus weighted child entropy
  • 04Apply the split rule: choose the attribute with the highest information gain
  • 05Recognise that information gain is biased toward many-valued (ID-like) attributes, and how gain ratio fixes it
  • 06Handle numeric features by binary thresholds, and know the Gini index as an alternative impurity
  • 07Contrast pre-pruning (early stopping) with post-pruning (grow then cut) and why a validation set is used
  • 08Explain bagging: bootstrap samples, parallel independent trees, majority vote, variance reduction
  • 09See how a random forest adds a random feature subset at each split to decorrelate trees
  • 10Contrast boosting (sequential, reweighted, bias-reducing) with bagging (parallel, variance-reducing)
Worked example · free

Entropy, information gain, and choosing the split

Q [6 marks]. A decision-tree node holds 10 customers labelled Buy (+) or No-buy (−): 6 are Buy and 4 are No-buy. Two binary attributes are candidates. Region splits them into North (4 Buy, 1 No-buy) and South (2 Buy, 3 No-buy). Member splits them into Member=yes (5 Buy, 0 No-buy) and Member=no (1 Buy, 4 No-buy). Compute the parent entropy and the information gain of each attribute, and say which attribute the tree should split on.
  • +1Parent entropy: p(+) = 6/10 = 0.6, p(−) = 4/10 = 0.4. H(S) = −0.6·log₂(0.6) − 0.4·log₂(0.4) = 0.6(0.737) + 0.4(1.322) = 0.442 + 0.529 = 0.971 bits.
  • +1Region children: North (4+,1−) has H = −0.8·log₂(0.8) − 0.2·log₂(0.2) = 0.258 + 0.464 = 0.722. South (2+,3−) is a 0.4/0.6 split, so H = 0.971 (same as the parent).
  • +1Gain(Region): weighted child entropy = (5/10)(0.722) + (5/10)(0.971) = 0.361 + 0.486 = 0.846. Gain = 0.971 − 0.846 = 0.125 bits.
  • +1Member children: Member=yes (5+,0−) is pure, so H = 0. Member=no (1+,4−) has H = −0.2·log₂(0.2) − 0.8·log₂(0.8) = 0.464 + 0.258 = 0.722.
  • +1Gain(Member): weighted child entropy = (5/10)(0) + (5/10)(0.722) = 0 + 0.361 = 0.361. Gain = 0.971 − 0.361 = 0.610 bits.
  • +1Decide: Gain(Member) = 0.610 > Gain(Region) = 0.125, so the tree splits on Member — it produces a pure branch and removes far more uncertainty.
The parent entropy is 0.971 bits; Gain(Region) = 0.125 bits and Gain(Member) = 0.610 bits, so the tree splits on Member (the highest information gain). Check: both gains are ≥ 0, and the attribute with a pure child (Member=yes) gives the larger gain, exactly as expected — a pure branch has entropy 0, which drives the weighted term down and the gain up.
Sia tip — Always use the weighted average of the child entropies (each multiplied by its share of the examples), not a plain average — forgetting the weights is the most common lost mark. Learn the two entropies that keep reappearing: an 0.8/0.2 split ≈ 0.722 bits and a 0.6/0.4 split ≈ 0.971 bits, with 0 for a pure node and 1 for a 50/50 node, so you can sanity-check most answers without redoing every logarithm.
Glossary

Key terms

Entropy
H(S) = −Σ pᵢ log₂ pᵢ, the mix of class labels in a node measured in bits; 0 at a pure node, a maximum of 1 bit at a two-class 50/50 split, with the convention 0·log₂(0) = 0.
Information gain
Gain(S,A) = H(S) − Σₖ (|Sₖ|/|S|)·H(Sₖ), the drop in entropy from splitting on attribute A; the tree splits on the attribute with the highest gain.
High-cardinality bias
Information gain favours attributes with many distinct values (e.g. an ID column), which can split data into pure singletons that memorise rather than generalise; gain ratio penalises many-valued splits to counter it.
Gini index
Gini(S) = 1 − Σ pᵢ², an alternative node impurity used by CART-style trees; 0 at a pure node and 0.5 at a two-class 50/50 split, choosing almost the same splits as entropy without a logarithm.
Pruning
Reducing tree size to prevent overfitting: pre-pruning stops growth early, while post-pruning grows the full tree then removes unhelpful branches using a validation set (often preferred).
Bagging
Bootstrap AGGregatING: train many trees in parallel on bootstrap samples (sampled with replacement) and combine them by majority vote or averaging, which reduces variance.
Random forest
Bagging of decision trees with an added twist: at each split only a random subset of k < m features is considered, decorrelating the trees for a bigger variance reduction.
Boosting
Training learners sequentially, each reweighting the examples the previous ones got wrong, then combining them with weights; it reduces bias but is sensitive to noisy data.
FAQ

Decision Trees & Ensembles FAQ

How do I decide which attribute a decision tree splits on?

Compute the information gain of every candidate attribute at that node and split on the one with the highest gain. Information gain is the parent entropy minus the weighted average of the child entropies, where each child is weighted by its share of the examples (|Sₖ|/|S|), not a plain average. Because a pure child has entropy 0, an attribute that produces a pure branch tends to win. Watch the bias: information gain favours many-valued, ID-like attributes that split data into near-unique singletons — those look perfect but only memorise the training set, which is why gain ratio or dropping ID columns is used.

What is the difference between bagging and boosting?

Bagging trains many trees in parallel and independently, each on a bootstrap sample (sampled with replacement), then combines them by an equal-weight majority vote; it mainly reduces variance and is robust to noise. A random forest is a bagging method that also picks a random feature subset at each split. Boosting instead trains learners in sequence, each one reweighting the examples the previous learners misclassified, and combines them with weights; it mainly reduces bias but is more sensitive to noisy or mislabelled points. A common exam slip is calling a random forest a boosting method — it is bagging.

Can AI help me with decision trees and ensembles in COMP5318?

Yes, for learning the method. Sia can explain each step — how entropy is calculated, why you take the weighted average of child entropies, how a random forest decorrelates trees — and walk you through a practice information-gain problem with your own numbers so you can check your reasoning and rehearse the derivation by hand. Keep AI to understanding and revision rather than obtaining answers to submitted assignments or the closed-book exam, and acknowledge any AI tools you use in assessable work as the unit requires; the exam allows only a non-programmable calculator, so the goal is to be able to do the calculation yourself.

Studying with AI? Sia — free AI machine learning tutor works through COMP5318 step by step.

Study strategy

Exam move

Make the entropy-to-information-gain calculation automatic, because it is the reliable numeric mark here. Practise on small nodes: write p for each class, compute H(S) = −Σ pᵢ log₂ pᵢ in bits, then for each candidate attribute compute the weighted average of the child entropies and subtract it from the parent entropy — and remember to weight each child by its share of the examples. Memorise a few anchor values (pure = 0, 50/50 = 1 bit, an 0.8/0.2 split ≈ 0.722, an 0.6/0.4 split ≈ 0.971) so you can check your work quickly with a non-programmable calculator. For the conceptual marks, lock down the contrasts the exam frames: entropy versus Gini (max 1 bit versus 0.5), pre- versus post-pruning (and that both use a validation set, never the test set), and bagging versus boosting (parallel and variance-reducing versus sequential and bias-reducing, with random forest a bagging method). The final exam is 2 hours, closed book, with a non-programmable calculator only, so budget roughly one minute per mark and always show the formula before the number. Keep AI tutoring to rehearsing the method, and confirm the exact exam date on the Canvas exam timetable.

A+Everything unlocked
Unlocks this Bible + all 25 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your COMP5318 tutor, unlimited, worked the way the exam marks it
The full 8-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full COMP5318 Bible + 25 University of Sydney subjects解锁完整 COMP5318 Bible + University of Sydney 25 门科目
$25/mo