COMP20008 · Elements Of Data Processing
Data Cleaning, Imputation & Scaling
Real data is dirty, so before any model runs you must audit quality and fix it. This chapter covers the six data-quality dimensions (accuracy, completeness, consistency, timeliness, believability, interpretability), the common problems they flag (missing, inconsistent, duplicate, outlier values), and the three repair threads: imputation of missing data, scaling/normalisation before distance-based methods, and sampling. It is examined as short-answer: the 2024 exam's Q2 asks you to name two data-quality problems with an example each, and scaling underpins later k-NN/k-means/PCA questions.
What this chapter covers
- 011. The six data-quality dimensions: accuracy, completeness, consistency, timeliness, believability, interpretability
- 022. Common problems: missing values, inconsistent representation, duplicates, outliers, wrong/implausible values
- 033. Resolving inconsistencies: standardise naming, formats and equivalency; de-duplicate
- 044. Imputation: drop rows/cols; mean/median/mode; forward/backward fill; model-based (regression / k-NN)
- 055. The imputation trade-off: bias vs information loss
- 066. Min-max normalisation: x' = (x − min) / (max − min) → [0, 1]
- 077. z-score standardisation: z = (x − x̄) / s
- 088. Sampling: random, stratified, systematic — why sample, and the bias risks
Name two data-quality problems with an example each (mirrors 2024 Q2)
- 1 markInconsistent representation — the same fact recorded in different formats within one column. Example: a date column holding "13/02/19", "13th Feb 2019" and "2019-02-13", which must be standardised to one format before it can be parsed or sorted.
- 1 markMissing values — fields that were not recorded or are unavailable. Example: a "date of birth" column with blank cells or the placeholder "20 years ago", which breaks any age computation and must be imputed or have the row removed.
Key terms
- Six data-quality dimensions
- The pre-ingest checklist: accuracy (values are correct), completeness (no missing data), consistency (no contradictions across records/formats), timeliness (data is current enough), believability (the data is trusted), interpretability (it is understandable). Most quality problems map onto one of these.
- Imputation
- Filling missing values rather than dropping them. Options range from simple (mean/median/mode, forward/backward fill) to model-based (predict the missing value with regression or k-NN). The trade-off is that imputing introduces some bias, while dropping rows loses information.
- Min-max normalisation
- Rescales a feature to the range [0, 1] with x' = (x − min) / (max − min). It preserves the shape of the distribution but is sensitive to outliers (one extreme value stretches the range and compresses everything else).
- z-score standardisation
- Recentres and rescales a feature to mean 0 and standard deviation 1 with z = (x − x̄) / s. It is less squashed by outliers than min-max and is the usual choice before distance-based methods like k-NN, k-means and PCA.
- Outlier
- A value far from the bulk of the data. Outliers distort means, inflate variance, and (because they sit far away in feature space) drag centroids in k-means and dominate distance-based methods, so they must be detected and handled before modelling.
- Sampling
- Selecting a subset of records to analyse: random (every record equally likely), stratified (sample within subgroups to preserve their proportions), systematic (every k-th record). Sampling saves cost and scale but can introduce bias if the scheme is not representative.
Data Cleaning, Imputation & Scaling FAQ
When should I drop rows versus impute missing values?
Drop when missingness is rare and looks random, so removing a few rows barely affects the dataset; impute when too many rows would be lost or when the values are systematically missing in a way you can model. The trade-off is the heart of the answer: dropping loses information and can bias the sample, while imputing fabricates plausible values and can dampen real variance. State the trade-off, not just the method.
Why do I need to scale features before k-NN, k-means or PCA?
Because all three rely on distance, and an unscaled feature with a large numeric range (say income in dollars) will dominate a small-range feature (say age in years) purely because of its units. Scaling — min-max to [0,1] or z-score to mean 0, sd 1 — puts every feature on a comparable footing so distance reflects the data, not the measurement units.
Min-max or z-score — which scaling should I use?
Use min-max when you need values bounded in [0,1] and the data has no extreme outliers, since one outlier stretches the range and squashes everyone else. Use z-score when the data has outliers or you want a mean-0, sd-1 feature for distance methods, because it is more robust to extremes. Either way you scale; the exam wants you to know the formula and the trade-off, not just one name.
How is data cleaning examined in COMP20008?
As compact short-answer questions, exactly like the 2024 Q2: name data-quality problems and give an example, describe how you would resolve an inconsistency, or state and apply a scaling formula. The scaling formulas also resurface inside later clustering, k-NN and PCA questions, so knowing min-max and z-score by heart pays off twice.
Exam move
Learn the six quality dimensions as a single mnemonic so you can generate problems on demand — almost every cleaning question is answered by naming a dimension and giving a dataset example. Drill the two scaling formulas (min-max and z-score) until you can write and apply them in seconds, and pair each with its weakness (min-max is outlier-sensitive; z-score is more robust), because that weakness is the "critically evaluate" mark. For imputation, never just name a method — always state the bias-vs-information-loss trade-off, since that is what distinguishes a 2-mark answer from a 1-mark one. Finally, remember scaling is a prerequisite for the clustering, k-NN and PCA chapters, so treat it as a reusable tool rather than a one-off topic.