University of Melbourne · S1 2026 · FACULTY OF SCIENCE

MAST20034 · Critical Thinking With Data

- one subject, every graph, every model, every mark

50% final exam · hurdle14 Chapters2-page Bible

Our own words - no uploaded lecturer files

Built to mirror S1 2026 · updated this semester

Chapter 10 of 10 · MAST20034

Big Data and Context

Weeks 11–12 close the subject by turning the critical lens on big data and re-centring context as the frame for all data reasoning. The core warning is that volume is not validity: a massive dataset is still biased if it was collected badly, and at scale a new trap appears — statistical significance ≠ practical importance, because with millions of rows even a trivial effect returns a tiny P-value. You meet the pitfalls that grow with the data: algorithmic bias (a model learning society's inequities), proxy variables (a stand-in that smuggles in a protected attribute), spurious correlations (more variables, more coincidences), and privacy erosion — plus a healthy scepticism toward AI in research. The chapter then frames data ethics through five principles and three justice lenses (drawing on data-feminism / data-ethics work), and returns to the universal opener — the who / why / what / how / when context questions — as the move you apply to any data claim. It ends on the figure of the critical data citizen: the whole point of the course. Exam prompts here ask you to critique a ‘big data’ or AI claim and surface its ethical and contextual blind spots.

In this chapter

What this chapter covers

0111.1 Statistical significance ≠ practical importance at scale
0211.2 The four pitfalls that grow with the data — algorithmic bias, proxies, spurious correlation, privacy
0311.3 Data ethics and the three justice lenses
0411.4 The context questions — the universal critique opener
0512.1 The critical data citizen

Worked example · free

Critiquing a ‘big data’ hiring claim, mark by mark

Q [4 marks]. A firm trains an AI on 10 years of its own hiring data to “objectively rank job applicants,” and points to the dataset's size as proof of fairness. In short-answer form, critique the claim using big-data and ethics concepts.

+1Volume ≠ validity: a large dataset is not automatically fair — size cannot cure a biased source. The model learns from past hiring, so it inherits its patterns.
+1Name algorithmic bias: if past hiring favoured some groups, the AI reproduces that algorithmic bias, laundering historical discrimination as an ‘objective’ score.
+1Name proxy variables: even with protected attributes removed, proxies (postcode, school, gaps in employment) can smuggle them back in, so ‘objective’ is illusory.
+1Ethics + fix: invoke a justice lens — who is harmed, who decided — and recommend bias auditing, fairness testing across groups, and human oversight rather than trusting size.

Size is not fairness: trained on historical hiring, the model inherits past discrimination as algorithmic bias, and proxy variables can reintroduce protected attributes even when those fields are dropped — so ‘objective’ is false. An ethics/justice lens asks who is harmed and demands bias auditing and human oversight. The marks are the named concepts (volume≠validity, algorithmic bias, proxies, justice), not a calculation.

Sia tip — Two reflexes win big-data prompts: ‘volume is not validity’ (a big biased dataset is still biased) and ‘at scale, significant ≠ important’. Then layer an ethics/justice lens: who is harmed, who is missing, who decided.

Glossary

Key terms

Volume ≠ validity: The principle that a dataset's size says nothing about whether it was collected well — a huge sample can be just as biased as a small one. Big data does not exempt a claim from the usual design and sampling critique.
Significance ≠ importance (at scale): With very large n, almost any effect becomes statistically significant (a tiny P-value), even when it is too small to matter. So at scale you must judge practical/effect size, not just the P-value.
Algorithmic bias: When a model learns and reproduces (or amplifies) inequities present in its training data, producing systematically unfair outputs while appearing neutral and ‘data-driven’.
Proxy variable: A feature that stands in for, and so leaks information about, a sensitive or protected attribute (e.g. postcode for race/income). Removing the protected field does not remove the bias if proxies remain.
Data ethics / justice lens: A framework for asking not just ‘is this accurate?’ but ‘who benefits, who is harmed, who is counted, who decided?’ — drawing on data-feminism and data-ethics work to surface power and equity in data practice.

FAQ

Big Data and Context FAQ

Does more data make an analysis more trustworthy?

Not by itself. Volume is not validity — a large dataset collected with bias is still biased, and at scale even trivial effects look ‘significant’. Big data needs the same design, sampling and ethics scrutiny as any other data, plus attention to effect size.

Why is statistical significance misleading with big data?

Because the P-value shrinks as n grows, so with millions of records almost everything is significant — including effects far too small to matter. You must report and judge the effect size, not lean on significance alone.

How do I bring ethics into a data critique?

Apply a justice lens: ask who is represented and who is missing, who benefits and who is harmed, and who decided what to measure. Watch for algorithmic bias and proxy variables that encode protected attributes, and weigh privacy and consent.

Study strategy

Exam move

Close your notes sheet with the two big-data reflexes — volume ≠ validity and, at scale, significant ≠ important — plus the named pitfalls (algorithmic bias, proxy variables, spurious correlation, privacy). Keep the context questions (who/why/what/how/when) as your universal opener, since they work on any prompt including this chapter's. Rehearse layering an ethics/justice lens onto a critique (who is harmed, who is missing, who decided), because the Week 11–12 prompts reward exactly that. This chapter also makes excellent revision: it re-tests sampling, significance and bias in a fresh ‘big data’ disguise.

Keep going — explore the course

A+Everything unlocked

Unlocks this Bible + all 72 of your University of Melbourne subjects - and 1,000+ Bibles across every Australian university.

Sia - your MAST20034 tutor, unlimited, worked the way the exam marks it

The full 2-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works