University of Melbourne · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

COMP20008 · Elements Of Data Processing

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters6-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 4 of 12 · COMP20008

Text Pre-processing & Lemmatisation

Raw text has hidden structure, and turning it into clean tokens is the step before any Bag-of-Words or TF-IDF representation. This chapter walks the standard four-stage pipeline — sentence splitting, tokenisation & normalisation, lemmatisation (vs stemming), and stop-word removal / n-grams — and explains why each step matters. It is examined as short-answer + applied reasoning: the 2024 exam's Q4 lemmatises a passage and asks how the resulting word vectors change and why you would lemmatise at all.

In this chapter

What this chapter covers

  • 011. Why text has hidden structure and needs pre-processing before modelling
  • 022. Stage 1 — sentence splitting at . ? ! with abbreviation and number exceptions
  • 033. Stage 2 — tokenisation & normalisation: lowercase, strip punctuation
  • 044. Stage 3 — lemmatisation: map words to a dictionary base form ("cats" → "cat", "chased" → "chase")
  • 055. Lemmatisation vs stemming: dictionary base form vs crude suffix stripping
  • 066. Stage 4 — stop-word removal and n-grams
  • 077. Effect on representation: smaller, less-sparse vocabulary and better similarity
Worked example · free

Lemmatisation changes the word vectors (mirrors 2024 Q4c–d)

Q [3 marks]. Take the text "The fox saw the hen and chased it. The hen turned and chased the fox." After lemmatisation, explain how the context vector for "hen" changes, and give one reason you would lemmatise before building Bag-of-Words vectors.
  • 1 markLemmatise the surface forms to their base forms: "chased" → "chase", "turned" → "turn", "saw" → "see". Counts that were split across different surface forms now collapse onto a single lemma.
  • 1 markThe "hen" context vector becomes chase=2, it=1, the=1, turn=1, and=1 — "chase" now aggregates both occurrences that were previously "chased", so the vector is denser and cleaner than the surface-form version.
  • 1 markWhy lemmatise: it collapses inflected forms onto one base form, which shrinks and de-sparsifies the vocabulary and lets two documents that use "chase / chased / chasing" be recognised as similar instead of as three unrelated words.
After lemmatisation the "hen" vector merges inflections onto base forms (e.g. chase=2), giving a denser, less-sparse representation; you lemmatise to reduce vocabulary size and improve document similarity.
Sia tip — Keep lemmatisation and stemming straight: lemmatisation returns a valid dictionary word ("better" → "good"), while stemming just chops suffixes and may return a non-word ("studies" → "studi"). Naming that distinction is a reliable mark.
Glossary

Key terms

Sentence splitting
The first text stage: dividing a passage into sentences at . ? ! while handling exceptions where those marks do not end a sentence — abbreviations ("Dr.", "e.g.") and decimals ("3.14") — so the boundaries are correct before tokenising.
Tokenisation & normalisation
Breaking each sentence into tokens (usually words) and normalising them — lowercasing and stripping punctuation — so "The", "the" and "the," all map to the same token before counting.
Lemmatisation
Mapping a word to its dictionary base form (lemma) using grammatical knowledge: "cats" → "cat", "chased" → "chase", "better" → "good". It always yields a valid word and collapses inflections onto one form.
Stemming
A cruder alternative to lemmatisation that strips suffixes by rule ("studies" → "studi", "running" → "run"). It is faster but can produce non-words and over- or under-merge, so lemmatisation is preferred when accuracy matters.
Stop-word removal
Dropping very common, low-information words ("the", "is", "and") that appear in nearly every document and so add noise rather than signal to a Bag-of-Words representation.
n-grams
Tokens made of n consecutive words ("data processing" as a bigram) rather than single words. They capture short phrases and local order that a unigram Bag-of-Words discards, at the cost of a larger vocabulary.
FAQ

Text Pre-processing & Lemmatisation FAQ

What is the difference between lemmatisation and stemming?

Both reduce inflected words to a common form, but lemmatisation uses dictionary and grammatical knowledge to return a valid base word ("chased" → "chase", "better" → "good"), while stemming just strips suffixes by rule and may return a non-word ("studies" → "studi"). Lemmatisation is more accurate and is preferred; stemming is faster and cruder. The exam frequently wants this exact contrast.

Why does lemmatisation improve a Bag-of-Words representation?

Because it collapses inflected forms ("chase / chased / chasing") onto a single lemma, so the vocabulary is smaller and less sparse, and counts that were spread across surface forms are aggregated. Two documents that talk about the same thing in different tenses are then recognised as similar instead of appearing to use unrelated words, which makes downstream clustering and similarity work better.

Why is sentence splitting harder than splitting on full stops?

Because a full stop does not always end a sentence: abbreviations ("Dr.", "U.S.A.", "e.g.") and decimal numbers ("3.14") contain dots that are not sentence boundaries. A naive split on every . ? ! would over-segment the text, so the splitter needs exception handling to recognise these cases — which is why it is treated as a distinct pre-processing stage.

How is text pre-processing examined in COMP20008?

As applied short-answer, mirroring 2024 Q4: you are given a short passage and asked to lemmatise it, show how the resulting word/context vectors change, and justify a step (why lemmatise, why remove stop words). Marks reward correctly applying a stage by hand and explaining its effect on the vocabulary and similarity, plus knowing the lemmatisation-vs-stemming distinction.

Study strategy

Exam move

Commit the four-stage pipeline to memory as an ordered list — sentence split → tokenise/normalise → lemmatise → stop-words/n-grams — and be ready to apply any one stage to a short passage by hand, since that is exactly what the exam asks. Drill lemmatisation on real inflections (chased→chase, better→good, mice→mouse) and always pair it with the stemming contrast, because that one-line distinction is a near-guaranteed mark. For every stage, learn its purpose in one sentence (sentence splitting needs exception handling; normalisation merges case/punctuation; lemmatisation de-sparsifies the vocabulary; stop-word removal drops noise; n-grams recapture short phrases). This chapter feeds directly into Bag-of-Words and TF-IDF, so treat clean tokens as the prerequisite for the next chapter.

A+Everything unlocked
Unlocks this Bible + all 24 of your University of Melbourne subjects - and 1,000+ Bibles across every Australian university.
Sia - your COMP20008 tutor, unlimited, worked the way the exam marks it
The full 6-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full COMP20008 Bible + 24 University of Melbourne subjects解锁完整 COMP20008 Bible + University of Melbourne 24 门科目
$25/mo