University of Melbourne · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

COMP20008 · Elements Of Data Processing

- one subject, every graph, every model, every mark

50% final exam · hurdle12 Chapters105-page Bible

Our own words - no uploaded lecturer files

Built to mirror S1 2026 · updated this semester

The Complete Exam Bible · S2 2026

Elements of Data Processing

— one subject, every step of the data pipeline, every exam mark

COMP20008 Elements of Data Processing is the University of Melbourne's second-year data-science core — an end-to-end tour of the data pipeline: getting data in (types, formats, quality, crawling and scraping), cleaning and enhancing it (imputation, scaling, text pre-processing, Bag-of-Words and TF-IDF), interpreting it (clustering, PCA, correlation and mutual information, classification and regression, evaluation and feature selection), and using it responsibly (visualisation, ethics, intellectual property, k-anonymity and differential privacy, and modern LLMs, prompting and RAG).

It is assessed 50% by continuous project work and 50% by a closed-book 2-hour written exam, with a dual hurdle — you must score at least 20/50 on each half. The continuous half is currently delivered as an individual assignment and a group assignment, each with an oral assessment, plus weekly ungraded quizzes (A1/A2 sub-weights subject to confirmation). The exam supplies a formula sheet on page 1, so marks come from picking the right method, doing one clean by-hand calculation, and critically evaluating the trade-offs — not memorising formulas.

COMP20008 · University of Melbourne

Contents · the whole subject, one map

What COMP20008 covers

The whole subject → one exam-ready map. Each topic links to its free chapter guide.

01The Data Pipeline & Data TypesWeeks 1–2. The five-stage data pipeline; numerical vs non-numerical vs could-be-either; continuous/discrete; structured/semi-structured/unstructured; levels of measurement; JSON, XML and HTML.02Data Cleaning, Imputation & ScalingWeek 2. The six data-quality dimensions; missing/inconsistent/duplicate/outlier problems; imputation (drop, mean/median/mode, model-based); min-max & z-score scaling; sampling.03Web Crawling & ScrapingWeek 3. Crawling (link-following frontier, coverage, robots.txt) vs scraping (targeted field extraction); the seed-URL → fetch → parse → BeautifulSoup → clean-matrix pipeline; the visited-set flaw.04Text Pre-processing & LemmatisationWeek 3. Sentence splitting with exceptions; tokenisation & normalisation; lemmatisation vs stemming; stop-word removal; n-grams; why text has hidden structure.05Bag-of-Words, TF-IDF & VisualisationWeek 4. BoW document/word vectors (order discarded); windowed co-occurrence vectors; Euclidean distance; TF vs TF-IDF for distinguishing topics; chart-type choice and the binning effect.06Clustering: k-means, VAT & the ElbowWeek 5. k-means (assign/recompute, SSE/inertia); the elbow method (advantage/limitation); outlier sensitivity and mitigation (k-medoids, robust/density methods); the VAT reordered distance matrix.07Hierarchical Clustering & DendrogramsWeeks 5–6. Agglomerative clustering; single vs complete vs average linkage; building dendrograms (y-axis = merge distance); cutting the tree for k clusters; NMI for comparing clusterings.08Correlation, Entropy & Mutual InformationWeeks 5–6. Pearson r (linear only); entropy & conditional entropy (log₂); mutual information & NMI; PCA (principal axes, variance explained, the 95%-on-PC1 caution); high-NMI-but-zero-Pearson (y = x²).09Supervised Learning: k-NN & Decision TreesWeek 7. Classification vs regression; train/test; k-NN (distance, k, scaling, cold-start/Jaccard pitfalls); decision trees (information gain/Gini, depth); high-cardinality split overfitting.10Regression, Evaluation & Feature SelectionWeek 8. Linear regression (least squares, slope, R²); experimental design; k-fold cross-validation; confusion matrix to accuracy/precision/recall/F1; the precision-recall trade-off; feature selection; data leakage.11Ethics, IP & PrivacyWeek 9. Intellectual property (patent/copyright/trademark, licensing, GDPR); privacy levels; k-anonymity & quasi-identifiers; l-diversity vs the homogeneity attack; global differential privacy; bias & fairness.12LLMs, Prompting, RAG & Verifiable DataWeeks 10–12. What makes an LLM generative (embeddings, hallucination); prompts vs fine-tuning; the AUTOMAT prompt schema; the RAG store and content-partitioning; blockchain tamper-evidence and digital-signature verification.

Assessment

How COMP20008 is assessed

Component	Weight	Format
Continuous / project assessment (data-processing & analysis projects across the semester, ~60 hrs; submissions ~week 5 and ~week 11) · hurdle	50%	Project work applied to real datasets; in the current offering delivered as Assignment 1 (individual: coding & analysis + interactive oral via Zoom + critical-thinking submission) and Assignment 2 (group: report + slides + in-person group oral), plus weekly ungraded formative quizzes. A1/A2 sub-weights subject to confirmation
End-of-semester examination · hurdle	50%	Closed-book written exam, 2 hours, during the exam period; a formula sheet is printed on page 1; ~11–12 questions of 2–4 marks each — short-answer + small by-hand calculation + critical-evaluation across the whole syllabus

Worked example · free

Bag-of-Words word vectors + Euclidean distance (closed-book, formula-sheet style)

Q [4 marks]. In the text "The fox saw the hen and chased it. The hen turned and chased the fox.", build Bag-of-Words context vectors for "fox" and "hen" using a window of 2 words either side (excluding the target itself), then compute the Euclidean distance between them. You may leave the answer as a square root.

1 markCollect the context. "fox" occurs twice — context {the, saw} and {chased, the}; "hen" occurs twice — context {chased, it, the, turned} and {and, chased}. So fox: the=2, saw=1, chased=1; hen: chased=2, it=1, the=1, turned=1, and=1.
1 markAlign on the shared vocabulary {the, saw, chased, it, turned, and} and pad missing words with 0: fox = (the 2, saw 1, chased 1, it 0, turned 0, and 0); hen = (the 1, saw 0, chased 2, it 1, turned 1, and 1).
1 markApply the Euclidean-distance formula d(x, y) = √( Σᵢ (xᵢ − yᵢ)² ): d = √( (2−1)² + (1−0)² + (1−2)² + (0−1)² + (0−1)² + (0−1)² ).
1 markEvaluate the sum of squares: each squared difference is 1, so d = √(1 + 1 + 1 + 1 + 1 + 1) = √6.

d(fox, hen) = √6 — you may leave the answer as a square root; the exam states you do not have to reduce a root to a decimal.

Sia tip — Build the union vocabulary first and pad missing words with 0 before subtracting — the most common mark-loser is forgetting a word that appears for only one of the two targets, which throws every later squared difference off.

Glossary

Key terms

The data pipeline: The spine of the whole subject: Data Input → Data Cleanup & Enhancement → Data Interpretation (models) → Data Output (visualisation) → Data Use (ethics). Every week re-anchors to one of these five stages, so naming the stage a question lives in is half the answer.
Bag of Words (BoW): Represents a document (or a word's context) as a vector of counts over a fixed vocabulary; word order is discarded. Simple, interpretable, fixed-length and language-agnostic, but loses order/semantics and is sparse and high-dimensional, with common words dominating.
TF-IDF: Term frequency weighted by inverse document frequency, tf·idf with idf(t) = log(N / df(t)). It down-weights words common across all documents and up-weights distinctive ones, so topic-specific words drive similarity — which is why it clusters documents by topic far better than raw TF.
Mutual information & NMI: Mutual information MI(X,Y) = H(Y) − H(Y|X) measures the reduction in uncertainty about Y from knowing X; it captures any dependence (including non-linear) and is 0 only if X and Y are independent. NMI rescales MI to 0–1 by dividing by min(H(X), H(Y)).
k-anonymity vs l-diversity: k-anonymity generalises/suppresses quasi-identifiers so every record is indistinguishable from at least k−1 others — it hides who. l-diversity additionally requires at least l distinct sensitive values within each group — it hides what, defending against the homogeneity attack that defeats k-anonymity alone.

FAQ

COMP20008 FAQ

Is COMP20008 hard?

It is broad rather than deep. The challenge is the unusually wide surface — data types and quality, cleaning, web scraping, text and Bag-of-Words, clustering and PCA, correlation and mutual information, classification and regression, evaluation, privacy, LLMs/RAG and blockchain — almost all examined as short-answer plus a tiny by-hand calculation plus "critically evaluate", not coding. With a formula sheet provided, the difficulty is breadth and method-selection, not memorising maths. Students who keep up week-to-week and practise the recurring question types find the exam very predictable.

Is the COMP20008 exam a hurdle?

Yes. The handbook sets a dual hurdle: you must score at least 20/50 on the continuous project assessment and at least 20/50 on the end-of-semester exam, on top of passing overall. The headline split is 50% continuous / 50% exam, so the exam is both the largest single stake and a barrier you cannot afford to fail.

Do I need to memorise the formulas?

No. A formula sheet is printed on page 1 of the exam — Euclidean distance, Pearson r, entropy and conditional entropy, mutual information and NMI, accuracy/precision/recall/F1, Jaccard and cosine. Marks come from choosing the right method, substituting the numbers correctly with one clean by-hand calculation, and explaining the trade-off. The exam even notes you don't have to reduce a square root to a decimal — answers are expressions, not memorised numbers.

Is COMP20008 a coding exam?

Not in the written exam. Coding (pandas, BeautifulSoup, scikit-learn) is assessed through the continuous project half and the workshops. The closed-book exam is short-answer, small by-hand computation and critical evaluation — for example classifying data types, building a BoW vector, reading a dendrogram, or naming the two steps of differential privacy — so practise reasoning and calculation by hand, not writing code under exam conditions.

What does the continuous assessment involve?

In the current instance the 50% continuous half is two assignments on real datasets. Assignment 1 is individual — a coding-and-analysis submission, an interactive oral assessment over Zoom, and a critical-thinking submission. Assignment 2 is a group project — a report, slides and an in-person group oral (content marked at the group level, communication individually). There are also weekly ungraded quizzes for formative practice. The exact sub-weights inside the 50% are subject to confirmation against your unit outline.

Study strategy

How to study for the exam

Treat COMP20008 as a breadth subject where method-selection is the skill: the formula sheet is given, so your edge is picking the right tool fast and explaining the trade-off. (1) Anchor every topic to the five-stage data pipeline (input → cleanup → interpretation → output → use) so you always know which stage a question lives in. (2) Drill the recurring exam patterns directly — they repeat almost verbatim year to year: classify data types, name two data-quality problems, crawling vs scraping, BoW vectors + Euclidean distance, TF vs TF-IDF, single/complete-linkage dendrograms, the elbow + outliers, Pearson-vs-NMI (the y = x² archetype), confusion-matrix metrics + which to optimise, data leakage, the k-anonymity/l-diversity homogeneity attack, the differential-privacy two-step, AUTOMAT, RAG, and blockchain verification. (3) For every calculation write the formula in symbols, substitute the numbers, then state the answer with one line of interpretation — examiners reward the reasoning chain and the "so what". (4) Practise the "critically evaluate" muscle: each method has a standard limitation (Pearson misses non-linearity, the elbow is subjective, k-means is outlier-sensitive, high PCA variance ≠ relevance, leakage gives MSE 0), and naming it earns the mark. (5) Keep up with the workshops and assignments — the by-hand skills they build are exactly what the closed-book exam tests, and the dual 20/50 hurdle means neither half can be neglected.

A+Everything unlocked

Unlocks this Bible + all 24 of your University of Melbourne subjects - and 1,000+ Bibles across every Australian university.

Sia - your COMP20008 tutor, unlimited, worked the way the exam marks it

The full 105-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works