University of Sydney · FACULTY OF COMPUTER SCIENCE

COMP5318 · Machine Learning and Data Mining

Q: How do I decide which attribute a decision tree splits on?

Compute the information gain of every candidate attribute at that node and split on the one with the highest gain. Information gain is the parent entropy minus the weighted average of the child entropies, where each child is weighted by its share of the examples (|Sₖ|/|S|), not a plain average. Because a pure child has entropy 0, an attribute that produces a pure branch tends to win. Watch the bias: information gain favours many-valued, ID-like attributes that split data into near-unique singletons — those look perfect but only memorise the training set, which is why gain ratio or dropping ID columns is used.

Q: What is the difference between bagging and boosting?

Bagging trains many trees in parallel and independently, each on a bootstrap sample (sampled with replacement), then combines them by an equal-weight majority vote; it mainly reduces variance and is robust to noise. A random forest is a bagging method that also picks a random feature subset at each split. Boosting instead trains learners in sequence, each one reweighting the examples the previous learners misclassified, and combines them with weights; it mainly reduces bias but is more sensitive to noisy or mislabelled points. A common exam slip is calling a random forest a boosting method — it is bagging.

Q: Can AI help me with decision trees and ensembles in COMP5318?

Yes, for learning the method. Sia can explain each step — how entropy is calculated, why you take the weighted average of child entropies, how a random forest decorrelates trees — and walk you through a practice information-gain problem with your own numbers so you can check your reasoning and rehearse the derivation by hand. Keep AI to understanding and revision rather than obtaining answers to submitted assignments or the closed-book exam, and acknowledge any AI tools you use in assessable work as the unit requires; the exam allows only a non-programmable calculator, so the goal is to be able to do the calculation yourself.

- one subject, every graph, every model, every mark

Computer Science14 Chapters8-page Bible

Our own words - no uploaded lecturer files

Updated for this semester

Chapter 5 of 11 · COMP5318

Decision Trees & Ensembles

This Week 5 topic of COMP5318 Machine Learning and Data Mining at the University of Sydney builds a classifier you can read as if-then rules: a decision tree that splits, at each node, on the feature with the highest information gain — the largest drop in entropy. It then controls overfitting with pruning and scales a single tree into a strong ensemble: bagging and random forests average independent trees to cut variance, while boosting trains trees in sequence to cut bias. Entropy and information-gain calculations are recurring short numeric problems in the closed-book final.

In this chapter

What this chapter covers

01Compute the entropy H(S) = −Σ pᵢ log₂ pᵢ of a labelled node in bits, using 0·log₂(0) = 0
02Read the entropy scale: 0 at a pure node, a maximum of 1 bit at a two-class 50/50 split
03Calculate the information gain of an attribute as parent entropy minus weighted child entropy
04Apply the split rule: choose the attribute with the highest information gain
05Recognise that information gain is biased toward many-valued (ID-like) attributes, and how gain ratio fixes it
06Handle numeric features by binary thresholds, and know the Gini index as an alternative impurity
07Contrast pre-pruning (early stopping) with post-pruning (grow then cut) and why a validation set is used
08Explain bagging: bootstrap samples, parallel independent trees, majority vote, variance reduction
09See how a random forest adds a random feature subset at each split to decorrelate trees
10Contrast boosting (sequential, reweighted, bias-reducing) with bagging (parallel, variance-reducing)

Worked example · free

Entropy, information gain, and choosing the split

Q [6 marks]. A decision-tree node holds 10 customers labelled Buy (+) or No-buy (−): 6 are Buy and 4 are No-buy. Two binary attributes are candidates. Region splits them into North (4 Buy, 1 No-buy) and South (2 Buy, 3 No-buy). Member splits them into Member=yes (5 Buy, 0 No-buy) and Member=no (1 Buy, 4 No-buy). Compute the parent entropy and the information gain of each attribute, and say which attribute the tree should split on.

+1Parent entropy: p(+) = 6/10 = 0.6, p(−) = 4/10 = 0.4. H(S) = −0.6·log₂(0.6) − 0.4·log₂(0.4) = 0.6(0.737) + 0.4(1.322) = 0.442 + 0.529 = 0.971 bits.
+1Region children: North (4+,1−) has H = −0.8·log₂(0.8) − 0.2·log₂(0.2) = 0.258 + 0.464 = 0.722. South (2+,3−) is a 0.4/0.6 split, so H = 0.971 (same as the parent).
+1Gain(Region): weighted child entropy = (5/10)(0.722) + (5/10)(0.971) = 0.361 + 0.486 = 0.846. Gain = 0.971 − 0.846 = 0.125 bits.
+1Member children: Member=yes (5+,0−) is pure, so H = 0. Member=no (1+,4−) has H = −0.2·log₂(0.2) − 0.8·log₂(0.8) = 0.464 + 0.258 = 0.722.
+1Gain(Member): weighted child entropy = (5/10)(0) + (5/10)(0.722) = 0 + 0.361 = 0.361. Gain = 0.971 − 0.361 = 0.610 bits.
+1Decide: Gain(Member) = 0.610 > Gain(Region) = 0.125, so the tree splits on Member — it produces a pure branch and removes far more uncertainty.

The parent entropy is 0.971 bits; Gain(Region) = 0.125 bits and Gain(Member) = 0.610 bits, so the tree splits on Member (the highest information gain). Check: both gains are ≥ 0, and the attribute with a pure child (Member=yes) gives the larger gain, exactly as expected — a pure branch has entropy 0, which drives the weighted term down and the gain up.

Sia tip — Always use the weighted average of the child entropies (each multiplied by its share of the examples), not a plain average — forgetting the weights is the most common lost mark. Learn the two entropies that keep reappearing: an 0.8/0.2 split ≈ 0.722 bits and a 0.6/0.4 split ≈ 0.971 bits, with 0 for a pure node and 1 for a 50/50 node, so you can sanity-check most answers without redoing every logarithm.

Glossary

Key terms

Entropy: H(S) = −Σ pᵢ log₂ pᵢ, the mix of class labels in a node measured in bits; 0 at a pure node, a maximum of 1 bit at a two-class 50/50 split, with the convention 0·log₂(0) = 0.
Information gain: Gain(S,A) = H(S) − Σₖ (|Sₖ|/|S|)·H(Sₖ), the drop in entropy from splitting on attribute A; the tree splits on the attribute with the highest gain.
High-cardinality bias: Information gain favours attributes with many distinct values (e.g. an ID column), which can split data into pure singletons that memorise rather than generalise; gain ratio penalises many-valued splits to counter it.
Gini index: Gini(S) = 1 − Σ pᵢ², an alternative node impurity used by CART-style trees; 0 at a pure node and 0.5 at a two-class 50/50 split, choosing almost the same splits as entropy without a logarithm.
Pruning: Reducing tree size to prevent overfitting: pre-pruning stops growth early, while post-pruning grows the full tree then removes unhelpful branches using a validation set (often preferred).
Bagging: Bootstrap AGGregatING: train many trees in parallel on bootstrap samples (sampled with replacement) and combine them by majority vote or averaging, which reduces variance.
Random forest: Bagging of decision trees with an added twist: at each split only a random subset of k < m features is considered, decorrelating the trees for a bigger variance reduction.
Boosting: Training learners sequentially, each reweighting the examples the previous ones got wrong, then combining them with weights; it reduces bias but is sensitive to noisy data.

FAQ

Decision Trees & Ensembles FAQ

How do I decide which attribute a decision tree splits on?

Compute the information gain of every candidate attribute at that node and split on the one with the highest gain. Information gain is the parent entropy minus the weighted average of the child entropies, where each child is weighted by its share of the examples (|Sₖ|/|S|), not a plain average. Because a pure child has entropy 0, an attribute that produces a pure branch tends to win. Watch the bias: information gain favours many-valued, ID-like attributes that split data into near-unique singletons — those look perfect but only memorise the training set, which is why gain ratio or dropping ID columns is used.

What is the difference between bagging and boosting?

Bagging trains many trees in parallel and independently, each on a bootstrap sample (sampled with replacement), then combines them by an equal-weight majority vote; it mainly reduces variance and is robust to noise. A random forest is a bagging method that also picks a random feature subset at each split. Boosting instead trains learners in sequence, each one reweighting the examples the previous learners misclassified, and combines them with weights; it mainly reduces bias but is more sensitive to noisy or mislabelled points. A common exam slip is calling a random forest a boosting method — it is bagging.

Can AI help me with decision trees and ensembles in COMP5318?

Yes, for learning the method. Sia can explain each step — how entropy is calculated, why you take the weighted average of child entropies, how a random forest decorrelates trees — and walk you through a practice information-gain problem with your own numbers so you can check your reasoning and rehearse the derivation by hand. Keep AI to understanding and revision rather than obtaining answers to submitted assignments or the closed-book exam, and acknowledge any AI tools you use in assessable work as the unit requires; the exam allows only a non-programmable calculator, so the goal is to be able to do the calculation yourself.

Studying with AI? Sia — free AI machine learning tutor works through COMP5318 step by step.

Study strategy

Exam move

Make the entropy-to-information-gain calculation automatic, because it is the reliable numeric mark here. Practise on small nodes: write p for each class, compute H(S) = −Σ pᵢ log₂ pᵢ in bits, then for each candidate attribute compute the weighted average of the child entropies and subtract it from the parent entropy — and remember to weight each child by its share of the examples. Memorise a few anchor values (pure = 0, 50/50 = 1 bit, an 0.8/0.2 split ≈ 0.722, an 0.6/0.4 split ≈ 0.971) so you can check your work quickly with a non-programmable calculator. For the conceptual marks, lock down the contrasts the exam frames: entropy versus Gini (max 1 bit versus 0.5), pre- versus post-pruning (and that both use a validation set, never the test set), and bagging versus boosting (parallel and variance-reducing versus sequential and bias-reducing, with random forest a bagging method). The final exam is 2 hours, closed book, with a non-programmable calculator only, so budget roughly one minute per mark and always show the formula before the number. Keep AI tutoring to rehearsing the method, and confirm the exact exam date on the Canvas exam timetable.

Keep going — explore the course

A+Everything unlocked

Unlocks this Bible + all 25 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.

Sia - your COMP5318 tutor, unlimited, worked the way the exam marks it

The full 8-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works