COMP20008 · Elements Of Data Processing
Text Pre-processing & Lemmatisation
Raw text has hidden structure, and turning it into clean tokens is the step before any Bag-of-Words or TF-IDF representation. This chapter walks the standard four-stage pipeline — sentence splitting, tokenisation & normalisation, lemmatisation (vs stemming), and stop-word removal / n-grams — and explains why each step matters. It is examined as short-answer + applied reasoning: the 2024 exam's Q4 lemmatises a passage and asks how the resulting word vectors change and why you would lemmatise at all.
What this chapter covers
- 011. Why text has hidden structure and needs pre-processing before modelling
- 022. Stage 1 — sentence splitting at . ? ! with abbreviation and number exceptions
- 033. Stage 2 — tokenisation & normalisation: lowercase, strip punctuation
- 044. Stage 3 — lemmatisation: map words to a dictionary base form ("cats" → "cat", "chased" → "chase")
- 055. Lemmatisation vs stemming: dictionary base form vs crude suffix stripping
- 066. Stage 4 — stop-word removal and n-grams
- 077. Effect on representation: smaller, less-sparse vocabulary and better similarity
Lemmatisation changes the word vectors (mirrors 2024 Q4c–d)
- 1 markLemmatise the surface forms to their base forms: "chased" → "chase", "turned" → "turn", "saw" → "see". Counts that were split across different surface forms now collapse onto a single lemma.
- 1 markThe "hen" context vector becomes chase=2, it=1, the=1, turn=1, and=1 — "chase" now aggregates both occurrences that were previously "chased", so the vector is denser and cleaner than the surface-form version.
- 1 markWhy lemmatise: it collapses inflected forms onto one base form, which shrinks and de-sparsifies the vocabulary and lets two documents that use "chase / chased / chasing" be recognised as similar instead of as three unrelated words.
Key terms
- Sentence splitting
- The first text stage: dividing a passage into sentences at . ? ! while handling exceptions where those marks do not end a sentence — abbreviations ("Dr.", "e.g.") and decimals ("3.14") — so the boundaries are correct before tokenising.
- Tokenisation & normalisation
- Breaking each sentence into tokens (usually words) and normalising them — lowercasing and stripping punctuation — so "The", "the" and "the," all map to the same token before counting.
- Lemmatisation
- Mapping a word to its dictionary base form (lemma) using grammatical knowledge: "cats" → "cat", "chased" → "chase", "better" → "good". It always yields a valid word and collapses inflections onto one form.
- Stemming
- A cruder alternative to lemmatisation that strips suffixes by rule ("studies" → "studi", "running" → "run"). It is faster but can produce non-words and over- or under-merge, so lemmatisation is preferred when accuracy matters.
- Stop-word removal
- Dropping very common, low-information words ("the", "is", "and") that appear in nearly every document and so add noise rather than signal to a Bag-of-Words representation.
- n-grams
- Tokens made of n consecutive words ("data processing" as a bigram) rather than single words. They capture short phrases and local order that a unigram Bag-of-Words discards, at the cost of a larger vocabulary.
Text Pre-processing & Lemmatisation FAQ
What is the difference between lemmatisation and stemming?
Both reduce inflected words to a common form, but lemmatisation uses dictionary and grammatical knowledge to return a valid base word ("chased" → "chase", "better" → "good"), while stemming just strips suffixes by rule and may return a non-word ("studies" → "studi"). Lemmatisation is more accurate and is preferred; stemming is faster and cruder. The exam frequently wants this exact contrast.
Why does lemmatisation improve a Bag-of-Words representation?
Because it collapses inflected forms ("chase / chased / chasing") onto a single lemma, so the vocabulary is smaller and less sparse, and counts that were spread across surface forms are aggregated. Two documents that talk about the same thing in different tenses are then recognised as similar instead of appearing to use unrelated words, which makes downstream clustering and similarity work better.
Why is sentence splitting harder than splitting on full stops?
Because a full stop does not always end a sentence: abbreviations ("Dr.", "U.S.A.", "e.g.") and decimal numbers ("3.14") contain dots that are not sentence boundaries. A naive split on every . ? ! would over-segment the text, so the splitter needs exception handling to recognise these cases — which is why it is treated as a distinct pre-processing stage.
How is text pre-processing examined in COMP20008?
As applied short-answer, mirroring 2024 Q4: you are given a short passage and asked to lemmatise it, show how the resulting word/context vectors change, and justify a step (why lemmatise, why remove stop words). Marks reward correctly applying a stage by hand and explaining its effect on the vocabulary and similarity, plus knowing the lemmatisation-vs-stemming distinction.
Exam move
Commit the four-stage pipeline to memory as an ordered list — sentence split → tokenise/normalise → lemmatise → stop-words/n-grams — and be ready to apply any one stage to a short passage by hand, since that is exactly what the exam asks. Drill lemmatisation on real inflections (chased→chase, better→good, mice→mouse) and always pair it with the stemming contrast, because that one-line distinction is a near-guaranteed mark. For every stage, learn its purpose in one sentence (sentence splitting needs exception handling; normalisation merges case/punctuation; lemmatisation de-sparsifies the vocabulary; stop-word removal drops noise; n-grams recapture short phrases). This chapter feeds directly into Bag-of-Words and TF-IDF, so treat clean tokens as the prerequisite for the next chapter.