University of Melbourne · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

COMP20008 · Elements Of Data Processing

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters6-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 1 of 12 · COMP20008

The Data Pipeline & Data Types

The data pipeline — Input → Cleanup & Enhancement → Interpretation → Output → Use — is the spine of the whole subject, and every later week re-anchors to one of its five stages. This chapter pins down the vocabulary the exam tests in week 1: numerical vs non-numerical vs could-be-either, continuous vs discrete, the four levels of measurement, and the structured / semi-structured / unstructured split (JSON, XML, HTML). It is examined as short-answer classification — the 2024 exam's Q1 is a direct "classify each variable and justify" table.

In this chapter

What this chapter covers

  • 011. The five-stage data pipeline: Input → Cleanup & Enhancement → Interpretation → Output → Use
  • 022. Numerical vs non-numerical vs could-be-either — and why "looks like digits" ≠ numerical
  • 033. Continuous vs discrete numerical data
  • 044. Levels of measurement: nominal, ordinal, interval, ratio
  • 055. Structured vs semi-structured vs unstructured data
  • 066. Semi-structured formats: JSON, XML and HTML
  • 077. Where COMP20008 sits — below COMP30027 Machine Learning and INFO20003 Database Systems
Worked example · free

Classify data types (mirrors the 2024 Q1 classification table)

Q [4 marks]. Classify each of the following as Numerical, Non-numerical, or Could-be-either, and give the reason: (a) a postcode "3010", (b) temperature in °C, (c) T-shirt size {S, M, L}, (d) a phone number.
  • 1 markPostcode "3010" → Non-numerical. It is a categorical identifier; doing arithmetic on it is meaningless even though it looks like a number (the "average postcode" has no sense).
  • 1 markTemperature in °C → Numerical. It is continuous on an interval scale — differences are meaningful (a 5° rise is the same gap anywhere on the scale).
  • 1 markT-shirt size {S, M, L} → Could-be-either. It is a category, but it has a natural order, so it can be treated as ordinal and coded 1/2/3 when a model needs numbers.
  • 1 markPhone number → Non-numerical. It is an identifier: leading zeros matter and arithmetic is meaningless, so it is categorical despite being all digits.
(a) Non-numerical, (b) Numerical, (c) Could-be-either, (d) Non-numerical.
Sia tip — The decision rule the markers want is "is arithmetic meaningful?", not "does it look like a number?" — apply that test to every item and the justification writes itself.
Glossary

Key terms

The data pipeline
The five-stage flow that structures the whole subject: Data Input → Data Cleanup & Enhancement → Data Interpretation (models) → Data Output (visualisation) → Data Use (ethics). Naming the stage a question belongs to is the first step in answering it.
Numerical vs non-numerical vs could-be-either
Numerical data supports meaningful arithmetic (temperature, income); non-numerical data is categorical/identifier (postcode, phone number); could-be-either is ordinal data with a natural order that can be coded numerically (T-shirt size, satisfaction rating).
Continuous vs discrete
Continuous data can take any value in a range (height, time); discrete data takes separate, countable values (number of children, page count). Both are numerical, but the distinction affects which charts and distributions apply.
Levels of measurement
Nominal (labels, no order), ordinal (ordered categories, no fixed gaps), interval (ordered with equal gaps but no true zero, e.g. °C), ratio (interval with a true zero, e.g. mass). The level limits which operations and statistics are valid.
Structured / semi-structured / unstructured
Structured data lives in fixed tables (CSV, relational rows); semi-structured data carries tags or keys but no rigid schema (JSON, XML, HTML); unstructured data has no inherent format (free text, images). The format dictates how you must parse and clean it.
JSON / XML / HTML
The three semi-structured formats the subject works with: JSON uses nested key–value objects and arrays; XML uses nested tagged elements; HTML is tag-based markup whose DOM tree is the target of web scraping.
FAQ

The Data Pipeline & Data Types FAQ

Why does the same value sometimes count as numerical and sometimes not?

Because the deciding test is whether arithmetic on the value is meaningful, not whether it is written with digits. "3010" as a postcode is a label — averaging postcodes is nonsense — so it is non-numerical; "3010" as a count of items supports addition and averaging, so it is numerical. Always ask what the number represents before you classify it.

What is the difference between interval and ratio data?

Both are numerical with equal gaps between values, but ratio data has a true, meaningful zero and interval data does not. Temperature in °C is interval (0 °C is not "no temperature", so 20 °C is not "twice as hot" as 10 °C); mass in kg is ratio (0 kg means none, and 20 kg really is twice 10 kg). The presence of a true zero is the test.

How do JSON and XML differ for the same data?

Both are semi-structured and can hold the same nested information, but JSON uses lightweight key–value objects and arrays (favoured in web APIs and pandas workflows), whereas XML wraps every value in opening and closing tags and supports attributes and namespaces. In the workshops you parse both; the concept to carry into the exam is that each is hierarchical, self-describing and lacks a rigid table schema.

How is this examined in COMP20008?

Almost always as a short-answer classification table, exactly like the 2024 Q1: you are given several variables and must label each as numerical / non-numerical / could-be-either (and sometimes the level of measurement) with a one-line reason. The reason is where the marks sit, so always justify with the "is arithmetic meaningful?" or "is there a true zero / natural order?" test rather than just stating the label.

Study strategy

Exam move

Memorise the five-stage pipeline as a fixed frame and label every topic in the subject by the stage it serves — it turns a sprawling syllabus into one map. For data types, build a tiny decision tree you can run in seconds: is arithmetic meaningful? (no → non-numerical; yes → numerical, then continuous or discrete; ordered category → could-be-either). Practise the classification table on mixed examples (IDs, dates, ratings, money, codes) until the justification is automatic, because the exam reuses this Q1 pattern almost every year. Finally, be able to give one concrete example of JSON, XML and HTML and say in a sentence why each is semi-structured rather than structured.

A+Everything unlocked
Unlocks this Bible + all 24 of your University of Melbourne subjects - and 1,000+ Bibles across every Australian university.
Sia - your COMP20008 tutor, unlimited, worked the way the exam marks it
The full 6-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full COMP20008 Bible + 24 University of Melbourne subjects解锁完整 COMP20008 Bible + University of Melbourne 24 门科目
$25/mo