University of Melbourne · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

COMP20008 · Elements Of Data Processing

- one subject, every graph, every model, every mark

50% final exam · hurdle14 Chapters8-page Bible

Our own words - no uploaded lecturer files

Built to mirror S1 2026 · updated this semester

Chapter 5 of 12 · COMP20008

Bag-of-Words, TF-IDF & Visualisation

Once text is tokenised, Bag-of-Words turns it into count vectors (order discarded) that you can compare with Euclidean distance, and TF-IDF reweights those counts so distinctive words — not ubiquitous ones — drive similarity. This chapter also covers visualisation: chart-type choice and the binning/aggregation effect that can exaggerate a trend. It is examined as small by-hand computation + critique: the 2024 exam's Q4 builds BoW vectors and computes Euclidean distance, while the 2025 exam's Q1 and Q3 probe binning and why pure TF clusters poorly.

In this chapter

What this chapter covers

011. Bag-of-Words: a document as a vector of word counts over a fixed vocabulary; order discarded
022. Windowed co-occurrence vectors: a word represented by counts of context words within a window
033. Euclidean distance between two BoW vectors: d(x, y) = √( Σᵢ (xᵢ − yᵢ)² )
044. BoW positives (simple, interpretable, fixed-length, language-agnostic) vs negatives (loses order, sparse, common words dominate)
055. TF vs TF-IDF: down-weight words common across all documents, up-weight distinctive ones
066. Why pure TF clusters topics poorly and IDF fixes it
077. Chart-type choice: bar, line, scatter, boxplot, heatmap
088. The binning/aggregation effect: equal-frequency binned averages can smooth and exaggerate a trend

Worked example · free

Why pure TF fails for topic clustering and TF-IDF fixes it (mirrors 2025 Q3c–d)

Q [4 marks]. You cluster cooking blog posts by topic using raw term-frequency (TF) Bag-of-Words vectors and get poor clusters. Explain why pure TF is ineffective here, and name the scoring change that fixes it (no formula required).

1 markPure TF lets ubiquitous words dominate every vector: words like "the", "recipe", "cup" and "minutes" appear in essentially every post, so they get large counts in all documents.
1 markBecause those high-frequency, low-information words swamp the distinctive topic words ("sourdough", "tempering", "marinade"), every document ends up looking similar regardless of its actual topic, so clusters are poor.
1 markThe fix is to weight term frequency by inverse document frequency — TF-IDF — which down-weights words that appear in many documents and up-weights words that appear in few.
1 markAfter TF-IDF the distinctive, topic-specific words carry the most weight and drive the similarity, so posts about the same dish cluster together. "Distinguishes topics" is the role IDF plays.

Pure TF is dominated by common words that appear everywhere, so all documents look alike; TF-IDF reweights by inverse document frequency so distinctive words dominate and topics separate.

Sia tip — The keyword the marker wants for IDF's role is "distinguishes topics" (down-weights common words, up-weights distinctive ones) — state it explicitly rather than only giving the formula.

Glossary

Key terms

Bag of Words (BoW): Represents a document as a vector of word counts over a fixed vocabulary, discarding word order. It is simple, interpretable, fixed-length and language-agnostic, but loses order/semantics, is sparse and high-dimensional, and lets common words dominate.
Windowed co-occurrence vector: A way to represent a single word by counting the context words that appear within a window of n words on either side of each occurrence. Two words are then compared by the distance between their context vectors.
Euclidean distance: The straight-line distance between two vectors, d(x, y) = √( Σᵢ (xᵢ − yᵢ)² ). It is the standard way to compare BoW/word vectors; you align the vectors on a shared vocabulary (padding with 0) before subtracting.
Term frequency (TF): The raw count (or normalised frequency) of a term in a document. Used alone it overweights common words that appear everywhere, which is why it clusters topics poorly.
TF-IDF: Term frequency weighted by inverse document frequency, idf(t) = log(N / df(t)). It down-weights words common across all documents and up-weights distinctive ones, so topic-specific words drive similarity and clustering.
Binning / aggregation effect: Replacing raw scatter points with averages over equal-frequency bins. Binning smooths the data and can strengthen or exaggerate an apparent trend while hiding spread and variance, which can mislead later analysis.

FAQ

Bag-of-Words, TF-IDF & Visualisation FAQ

What information does Bag-of-Words throw away, and does it matter?

It discards word order, so "dog bites man" and "man bites dog" give the same vector, and it loses most semantics. It is also sparse and high-dimensional, and common words can dominate. Whether that matters depends on the task: for coarse topic similarity BoW (especially with TF-IDF) works well; for anything needing order or meaning (sentiment with negation, syntax) you need richer representations or n-grams.

How do I compute Euclidean distance between two word vectors?

First align both vectors on the union of their vocabularies, padding any missing word with a count of 0 — this is the step students most often get wrong. Then apply d(x, y) = √( Σᵢ (xᵢ − yᵢ)² ): subtract componentwise, square each difference, sum, and take the square root. The exam lets you leave the answer as a root, so √6 is an acceptable final answer.

Why is TF-IDF better than raw TF for clustering documents by topic?

Raw TF gives high weight to words that appear in every document ("the", "recipe"), so all documents look similar and clusters blur. TF-IDF multiplies term frequency by inverse document frequency, which down-weights those ubiquitous words and up-weights words that appear in only a few documents. The distinctive, topic-defining words then dominate the vector, so documents on the same topic become close and clusters separate cleanly.

How can binning a scatter plot mislead?

Plotting raw points shows both the trend and the spread, but replacing them with averages over equal-frequency bins smooths the noise and can make a weak or noisy relationship look like a strong, clean trend. The variance is hidden, so a reader over-trusts the pattern. The exam (2025 Q1) wants you to recognise that binning is an aggregation choice that can exaggerate the apparent relationship and conceal uncertainty.

Study strategy

Exam move

Make the Euclidean-distance routine automatic: build the union vocabulary, pad missing words with 0, then subtract-square-sum-root — and remember you can leave a square root as the answer. Be able to construct a BoW document vector and a windowed word vector by hand, since both appear in the exam. For TF-IDF, memorise the one-line story (common words swamp pure TF; IDF down-weights them so distinctive words distinguish topics) and lead with the phrase "distinguishes topics". Keep BoW's positives and negatives as a ready list for any "evaluate this representation" question. Finally, for visualisation, rehearse the binning critique: binning smooths and can exaggerate a trend while hiding spread — a high-frequency "critically evaluate" mark.

Keep going — explore the course

A+Everything unlocked

Unlocks this Bible + all 24 of your University of Melbourne subjects - and 1,000+ Bibles across every Australian university.

Sia - your COMP20008 tutor, unlimited, worked the way the exam marks it

The full 8-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works