COMP20008 · Elements Of Data Processing
Elements of Data Processing
COMP20008 Elements of Data Processing is the University of Melbourne's second-year data-science core — an end-to-end tour of the data pipeline: getting data in (types, formats, quality, crawling and scraping), cleaning and enhancing it (imputation, scaling, text pre-processing, Bag-of-Words and TF-IDF), interpreting it (clustering, PCA, correlation and mutual information, classification and regression, evaluation and feature selection), and using it responsibly (visualisation, ethics, intellectual property, k-anonymity and differential privacy, and modern LLMs, prompting and RAG).
It is assessed 50% by continuous project work and 50% by a closed-book 2-hour written exam, with a dual hurdle — you must score at least 20/50 on each half. The continuous half is currently delivered as an individual assignment and a group assignment, each with an oral assessment, plus weekly ungraded quizzes (A1/A2 sub-weights subject to confirmation). The exam supplies a formula sheet on page 1, so marks come from picking the right method, doing one clean by-hand calculation, and critically evaluating the trade-offs — not memorising formulas.
What COMP20008 covers
The whole subject → one exam-ready map. Each topic links to its free chapter guide.
How COMP20008 is assessed
| Component | Weight | Format |
|---|---|---|
| Continuous / project assessment (data-processing & analysis projects across the semester, ~60 hrs; submissions ~week 5 and ~week 11) · hurdle | 50% | Project work applied to real datasets; in the current offering delivered as Assignment 1 (individual: coding & analysis + interactive oral via Zoom + critical-thinking submission) and Assignment 2 (group: report + slides + in-person group oral), plus weekly ungraded formative quizzes. A1/A2 sub-weights subject to confirmation |
| End-of-semester examination · hurdle | 50% | Closed-book written exam, 2 hours, during the exam period; a formula sheet is printed on page 1; ~11–12 questions of 2–4 marks each — short-answer + small by-hand calculation + critical-evaluation across the whole syllabus |
Bag-of-Words word vectors + Euclidean distance (closed-book, formula-sheet style)
- 1 markCollect the context. "fox" occurs twice — context {the, saw} and {chased, the}; "hen" occurs twice — context {chased, it, the, turned} and {and, chased}. So fox: the=2, saw=1, chased=1; hen: chased=2, it=1, the=1, turned=1, and=1.
- 1 markAlign on the shared vocabulary {the, saw, chased, it, turned, and} and pad missing words with 0: fox = (the 2, saw 1, chased 1, it 0, turned 0, and 0); hen = (the 1, saw 0, chased 2, it 1, turned 1, and 1).
- 1 markApply the Euclidean-distance formula d(x, y) = √( Σᵢ (xᵢ − yᵢ)² ): d = √( (2−1)² + (1−0)² + (1−2)² + (0−1)² + (0−1)² + (0−1)² ).
- 1 markEvaluate the sum of squares: each squared difference is 1, so d = √(1 + 1 + 1 + 1 + 1 + 1) = √6.
Key terms
- The data pipeline
- The spine of the whole subject: Data Input → Data Cleanup & Enhancement → Data Interpretation (models) → Data Output (visualisation) → Data Use (ethics). Every week re-anchors to one of these five stages, so naming the stage a question lives in is half the answer.
- Bag of Words (BoW)
- Represents a document (or a word's context) as a vector of counts over a fixed vocabulary; word order is discarded. Simple, interpretable, fixed-length and language-agnostic, but loses order/semantics and is sparse and high-dimensional, with common words dominating.
- TF-IDF
- Term frequency weighted by inverse document frequency, tf·idf with idf(t) = log(N / df(t)). It down-weights words common across all documents and up-weights distinctive ones, so topic-specific words drive similarity — which is why it clusters documents by topic far better than raw TF.
- Mutual information & NMI
- Mutual information MI(X,Y) = H(Y) − H(Y|X) measures the reduction in uncertainty about Y from knowing X; it captures any dependence (including non-linear) and is 0 only if X and Y are independent. NMI rescales MI to 0–1 by dividing by min(H(X), H(Y)).
- k-anonymity vs l-diversity
- k-anonymity generalises/suppresses quasi-identifiers so every record is indistinguishable from at least k−1 others — it hides who. l-diversity additionally requires at least l distinct sensitive values within each group — it hides what, defending against the homogeneity attack that defeats k-anonymity alone.
COMP20008 FAQ
Is COMP20008 hard?
It is broad rather than deep. The challenge is the unusually wide surface — data types and quality, cleaning, web scraping, text and Bag-of-Words, clustering and PCA, correlation and mutual information, classification and regression, evaluation, privacy, LLMs/RAG and blockchain — almost all examined as short-answer plus a tiny by-hand calculation plus "critically evaluate", not coding. With a formula sheet provided, the difficulty is breadth and method-selection, not memorising maths. Students who keep up week-to-week and practise the recurring question types find the exam very predictable.
Is the COMP20008 exam a hurdle?
Yes. The handbook sets a dual hurdle: you must score at least 20/50 on the continuous project assessment and at least 20/50 on the end-of-semester exam, on top of passing overall. The headline split is 50% continuous / 50% exam, so the exam is both the largest single stake and a barrier you cannot afford to fail.
Do I need to memorise the formulas?
No. A formula sheet is printed on page 1 of the exam — Euclidean distance, Pearson r, entropy and conditional entropy, mutual information and NMI, accuracy/precision/recall/F1, Jaccard and cosine. Marks come from choosing the right method, substituting the numbers correctly with one clean by-hand calculation, and explaining the trade-off. The exam even notes you don't have to reduce a square root to a decimal — answers are expressions, not memorised numbers.
Is COMP20008 a coding exam?
Not in the written exam. Coding (pandas, BeautifulSoup, scikit-learn) is assessed through the continuous project half and the workshops. The closed-book exam is short-answer, small by-hand computation and critical evaluation — for example classifying data types, building a BoW vector, reading a dendrogram, or naming the two steps of differential privacy — so practise reasoning and calculation by hand, not writing code under exam conditions.
What does the continuous assessment involve?
In the current instance the 50% continuous half is two assignments on real datasets. Assignment 1 is individual — a coding-and-analysis submission, an interactive oral assessment over Zoom, and a critical-thinking submission. Assignment 2 is a group project — a report, slides and an in-person group oral (content marked at the group level, communication individually). There are also weekly ungraded quizzes for formative practice. The exact sub-weights inside the 50% are subject to confirmation against your unit outline.
How to study for the exam
Treat COMP20008 as a breadth subject where method-selection is the skill: the formula sheet is given, so your edge is picking the right tool fast and explaining the trade-off. (1) Anchor every topic to the five-stage data pipeline (input → cleanup → interpretation → output → use) so you always know which stage a question lives in. (2) Drill the recurring exam patterns directly — they repeat almost verbatim year to year: classify data types, name two data-quality problems, crawling vs scraping, BoW vectors + Euclidean distance, TF vs TF-IDF, single/complete-linkage dendrograms, the elbow + outliers, Pearson-vs-NMI (the y = x² archetype), confusion-matrix metrics + which to optimise, data leakage, the k-anonymity/l-diversity homogeneity attack, the differential-privacy two-step, AUTOMAT, RAG, and blockchain verification. (3) For every calculation write the formula in symbols, substitute the numbers, then state the answer with one line of interpretation — examiners reward the reasoning chain and the "so what". (4) Practise the "critically evaluate" muscle: each method has a standard limitation (Pearson misses non-linearity, the elbow is subjective, k-means is outlier-sensitive, high PCA variance ≠ relevance, leakage gives MSE 0), and naming it earns the mark. (5) Keep up with the workshops and assignments — the by-hand skills they build are exactly what the closed-book exam tests, and the dual 20/50 hurdle means neither half can be neglected.