COMP20008 · Elements Of Data Processing
The Data Pipeline & Data Types
The data pipeline — Input → Cleanup & Enhancement → Interpretation → Output → Use — is the spine of the whole subject, and every later week re-anchors to one of its five stages. This chapter pins down the vocabulary the exam tests in week 1: numerical vs non-numerical vs could-be-either, continuous vs discrete, the four levels of measurement, and the structured / semi-structured / unstructured split (JSON, XML, HTML). It is examined as short-answer classification — the 2024 exam's Q1 is a direct "classify each variable and justify" table.
What this chapter covers
- 011. The five-stage data pipeline: Input → Cleanup & Enhancement → Interpretation → Output → Use
- 022. Numerical vs non-numerical vs could-be-either — and why "looks like digits" ≠ numerical
- 033. Continuous vs discrete numerical data
- 044. Levels of measurement: nominal, ordinal, interval, ratio
- 055. Structured vs semi-structured vs unstructured data
- 066. Semi-structured formats: JSON, XML and HTML
- 077. Where COMP20008 sits — below COMP30027 Machine Learning and INFO20003 Database Systems
Classify data types (mirrors the 2024 Q1 classification table)
- 1 markPostcode "3010" → Non-numerical. It is a categorical identifier; doing arithmetic on it is meaningless even though it looks like a number (the "average postcode" has no sense).
- 1 markTemperature in °C → Numerical. It is continuous on an interval scale — differences are meaningful (a 5° rise is the same gap anywhere on the scale).
- 1 markT-shirt size {S, M, L} → Could-be-either. It is a category, but it has a natural order, so it can be treated as ordinal and coded 1/2/3 when a model needs numbers.
- 1 markPhone number → Non-numerical. It is an identifier: leading zeros matter and arithmetic is meaningless, so it is categorical despite being all digits.
Key terms
- The data pipeline
- The five-stage flow that structures the whole subject: Data Input → Data Cleanup & Enhancement → Data Interpretation (models) → Data Output (visualisation) → Data Use (ethics). Naming the stage a question belongs to is the first step in answering it.
- Numerical vs non-numerical vs could-be-either
- Numerical data supports meaningful arithmetic (temperature, income); non-numerical data is categorical/identifier (postcode, phone number); could-be-either is ordinal data with a natural order that can be coded numerically (T-shirt size, satisfaction rating).
- Continuous vs discrete
- Continuous data can take any value in a range (height, time); discrete data takes separate, countable values (number of children, page count). Both are numerical, but the distinction affects which charts and distributions apply.
- Levels of measurement
- Nominal (labels, no order), ordinal (ordered categories, no fixed gaps), interval (ordered with equal gaps but no true zero, e.g. °C), ratio (interval with a true zero, e.g. mass). The level limits which operations and statistics are valid.
- Structured / semi-structured / unstructured
- Structured data lives in fixed tables (CSV, relational rows); semi-structured data carries tags or keys but no rigid schema (JSON, XML, HTML); unstructured data has no inherent format (free text, images). The format dictates how you must parse and clean it.
- JSON / XML / HTML
- The three semi-structured formats the subject works with: JSON uses nested key–value objects and arrays; XML uses nested tagged elements; HTML is tag-based markup whose DOM tree is the target of web scraping.
The Data Pipeline & Data Types FAQ
Why does the same value sometimes count as numerical and sometimes not?
Because the deciding test is whether arithmetic on the value is meaningful, not whether it is written with digits. "3010" as a postcode is a label — averaging postcodes is nonsense — so it is non-numerical; "3010" as a count of items supports addition and averaging, so it is numerical. Always ask what the number represents before you classify it.
What is the difference between interval and ratio data?
Both are numerical with equal gaps between values, but ratio data has a true, meaningful zero and interval data does not. Temperature in °C is interval (0 °C is not "no temperature", so 20 °C is not "twice as hot" as 10 °C); mass in kg is ratio (0 kg means none, and 20 kg really is twice 10 kg). The presence of a true zero is the test.
How do JSON and XML differ for the same data?
Both are semi-structured and can hold the same nested information, but JSON uses lightweight key–value objects and arrays (favoured in web APIs and pandas workflows), whereas XML wraps every value in opening and closing tags and supports attributes and namespaces. In the workshops you parse both; the concept to carry into the exam is that each is hierarchical, self-describing and lacks a rigid table schema.
How is this examined in COMP20008?
Almost always as a short-answer classification table, exactly like the 2024 Q1: you are given several variables and must label each as numerical / non-numerical / could-be-either (and sometimes the level of measurement) with a one-line reason. The reason is where the marks sit, so always justify with the "is arithmetic meaningful?" or "is there a true zero / natural order?" test rather than just stating the label.
Exam move
Memorise the five-stage pipeline as a fixed frame and label every topic in the subject by the stage it serves — it turns a sprawling syllabus into one map. For data types, build a tiny decision tree you can run in seconds: is arithmetic meaningful? (no → non-numerical; yes → numerical, then continuous or discrete; ordered category → could-be-either). Practise the classification table on mixed examples (IDs, dates, ratings, money, codes) until the justification is automatic, because the exam reuses this Q1 pattern almost every year. Finally, be able to give one concrete example of JSON, XML and HTML and say in a sentence why each is semi-structured rather than structured.