University of Melbourne · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

COMP20008 · Elements Of Data Processing

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters5-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 2 of 12 · COMP20008

Data Cleaning, Imputation & Scaling

Real data is dirty, so before any model runs you must audit quality and fix it. This chapter covers the six data-quality dimensions (accuracy, completeness, consistency, timeliness, believability, interpretability), the common problems they flag (missing, inconsistent, duplicate, outlier values), and the three repair threads: imputation of missing data, scaling/normalisation before distance-based methods, and sampling. It is examined as short-answer: the 2024 exam's Q2 asks you to name two data-quality problems with an example each, and scaling underpins later k-NN/k-means/PCA questions.

In this chapter

What this chapter covers

  • 011. The six data-quality dimensions: accuracy, completeness, consistency, timeliness, believability, interpretability
  • 022. Common problems: missing values, inconsistent representation, duplicates, outliers, wrong/implausible values
  • 033. Resolving inconsistencies: standardise naming, formats and equivalency; de-duplicate
  • 044. Imputation: drop rows/cols; mean/median/mode; forward/backward fill; model-based (regression / k-NN)
  • 055. The imputation trade-off: bias vs information loss
  • 066. Min-max normalisation: x' = (x − min) / (max − min) → [0, 1]
  • 077. z-score standardisation: z = (x − x̄) / s
  • 088. Sampling: random, stratified, systematic — why sample, and the bias risks
Worked example · free

Name two data-quality problems with an example each (mirrors 2024 Q2)

Q [2 marks]. Name TWO common data-quality problems, describe each in one line, and give a concrete example of how it would appear in a dataset.
  • 1 markInconsistent representation — the same fact recorded in different formats within one column. Example: a date column holding "13/02/19", "13th Feb 2019" and "2019-02-13", which must be standardised to one format before it can be parsed or sorted.
  • 1 markMissing values — fields that were not recorded or are unavailable. Example: a "date of birth" column with blank cells or the placeholder "20 years ago", which breaks any age computation and must be imputed or have the row removed.
Inconsistent representation and missing values (other valid answers: duplicates, outliers, wrong/implausible values).
Sia tip — Use the six quality dimensions as a ready checklist to generate problems from — and always pair each named problem with a concrete, dataset-level example, because the marker is rewarding that you can spot it in practice, not just recite the term.
Glossary

Key terms

Six data-quality dimensions
The pre-ingest checklist: accuracy (values are correct), completeness (no missing data), consistency (no contradictions across records/formats), timeliness (data is current enough), believability (the data is trusted), interpretability (it is understandable). Most quality problems map onto one of these.
Imputation
Filling missing values rather than dropping them. Options range from simple (mean/median/mode, forward/backward fill) to model-based (predict the missing value with regression or k-NN). The trade-off is that imputing introduces some bias, while dropping rows loses information.
Min-max normalisation
Rescales a feature to the range [0, 1] with x' = (x − min) / (max − min). It preserves the shape of the distribution but is sensitive to outliers (one extreme value stretches the range and compresses everything else).
z-score standardisation
Recentres and rescales a feature to mean 0 and standard deviation 1 with z = (x − x̄) / s. It is less squashed by outliers than min-max and is the usual choice before distance-based methods like k-NN, k-means and PCA.
Outlier
A value far from the bulk of the data. Outliers distort means, inflate variance, and (because they sit far away in feature space) drag centroids in k-means and dominate distance-based methods, so they must be detected and handled before modelling.
Sampling
Selecting a subset of records to analyse: random (every record equally likely), stratified (sample within subgroups to preserve their proportions), systematic (every k-th record). Sampling saves cost and scale but can introduce bias if the scheme is not representative.
FAQ

Data Cleaning, Imputation & Scaling FAQ

When should I drop rows versus impute missing values?

Drop when missingness is rare and looks random, so removing a few rows barely affects the dataset; impute when too many rows would be lost or when the values are systematically missing in a way you can model. The trade-off is the heart of the answer: dropping loses information and can bias the sample, while imputing fabricates plausible values and can dampen real variance. State the trade-off, not just the method.

Why do I need to scale features before k-NN, k-means or PCA?

Because all three rely on distance, and an unscaled feature with a large numeric range (say income in dollars) will dominate a small-range feature (say age in years) purely because of its units. Scaling — min-max to [0,1] or z-score to mean 0, sd 1 — puts every feature on a comparable footing so distance reflects the data, not the measurement units.

Min-max or z-score — which scaling should I use?

Use min-max when you need values bounded in [0,1] and the data has no extreme outliers, since one outlier stretches the range and squashes everyone else. Use z-score when the data has outliers or you want a mean-0, sd-1 feature for distance methods, because it is more robust to extremes. Either way you scale; the exam wants you to know the formula and the trade-off, not just one name.

How is data cleaning examined in COMP20008?

As compact short-answer questions, exactly like the 2024 Q2: name data-quality problems and give an example, describe how you would resolve an inconsistency, or state and apply a scaling formula. The scaling formulas also resurface inside later clustering, k-NN and PCA questions, so knowing min-max and z-score by heart pays off twice.

Study strategy

Exam move

Learn the six quality dimensions as a single mnemonic so you can generate problems on demand — almost every cleaning question is answered by naming a dimension and giving a dataset example. Drill the two scaling formulas (min-max and z-score) until you can write and apply them in seconds, and pair each with its weakness (min-max is outlier-sensitive; z-score is more robust), because that weakness is the "critically evaluate" mark. For imputation, never just name a method — always state the bias-vs-information-loss trade-off, since that is what distinguishes a 2-mark answer from a 1-mark one. Finally, remember scaling is a prerequisite for the clustering, k-NN and PCA chapters, so treat it as a reusable tool rather than a one-off topic.

A+Everything unlocked
Unlocks this Bible + all 24 of your University of Melbourne subjects - and 1,000+ Bibles across every Australian university.
Sia - your COMP20008 tutor, unlimited, worked the way the exam marks it
The full 5-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full COMP20008 Bible + 24 University of Melbourne subjects解锁完整 COMP20008 Bible + University of Melbourne 24 门科目
$25/mo