MAST20034 · Critical Thinking With Data
Big Data and Context
Weeks 11–12 close the subject by turning the critical lens on big data and re-centring context as the frame for all data reasoning. The core warning is that volume is not validity: a massive dataset is still biased if it was collected badly, and at scale a new trap appears — statistical significance ≠ practical importance, because with millions of rows even a trivial effect returns a tiny P-value. You meet the pitfalls that grow with the data: algorithmic bias (a model learning society's inequities), proxy variables (a stand-in that smuggles in a protected attribute), spurious correlations (more variables, more coincidences), and privacy erosion — plus a healthy scepticism toward AI in research. The chapter then frames data ethics through five principles and three justice lenses (drawing on data-feminism / data-ethics work), and returns to the universal opener — the who / why / what / how / when context questions — as the move you apply to any data claim. It ends on the figure of the critical data citizen: the whole point of the course. Exam prompts here ask you to critique a ‘big data’ or AI claim and surface its ethical and contextual blind spots.
What this chapter covers
- 0111.1 Statistical significance ≠ practical importance at scale
- 0211.2 The four pitfalls that grow with the data — algorithmic bias, proxies, spurious correlation, privacy
- 0311.3 Data ethics and the three justice lenses
- 0411.4 The context questions — the universal critique opener
- 0512.1 The critical data citizen
Critiquing a ‘big data’ hiring claim, mark by mark
- +1Volume ≠ validity: a large dataset is not automatically fair — size cannot cure a biased source. The model learns from past hiring, so it inherits its patterns.
- +1Name algorithmic bias: if past hiring favoured some groups, the AI reproduces that algorithmic bias, laundering historical discrimination as an ‘objective’ score.
- +1Name proxy variables: even with protected attributes removed, proxies (postcode, school, gaps in employment) can smuggle them back in, so ‘objective’ is illusory.
- +1Ethics + fix: invoke a justice lens — who is harmed, who decided — and recommend bias auditing, fairness testing across groups, and human oversight rather than trusting size.
Key terms
- Volume ≠ validity
- The principle that a dataset's size says nothing about whether it was collected well — a huge sample can be just as biased as a small one. Big data does not exempt a claim from the usual design and sampling critique.
- Significance ≠ importance (at scale)
- With very large n, almost any effect becomes statistically significant (a tiny P-value), even when it is too small to matter. So at scale you must judge practical/effect size, not just the P-value.
- Algorithmic bias
- When a model learns and reproduces (or amplifies) inequities present in its training data, producing systematically unfair outputs while appearing neutral and ‘data-driven’.
- Proxy variable
- A feature that stands in for, and so leaks information about, a sensitive or protected attribute (e.g. postcode for race/income). Removing the protected field does not remove the bias if proxies remain.
- Data ethics / justice lens
- A framework for asking not just ‘is this accurate?’ but ‘who benefits, who is harmed, who is counted, who decided?’ — drawing on data-feminism and data-ethics work to surface power and equity in data practice.
Big Data and Context FAQ
Does more data make an analysis more trustworthy?
Not by itself. Volume is not validity — a large dataset collected with bias is still biased, and at scale even trivial effects look ‘significant’. Big data needs the same design, sampling and ethics scrutiny as any other data, plus attention to effect size.
Why is statistical significance misleading with big data?
Because the P-value shrinks as n grows, so with millions of records almost everything is significant — including effects far too small to matter. You must report and judge the effect size, not lean on significance alone.
How do I bring ethics into a data critique?
Apply a justice lens: ask who is represented and who is missing, who benefits and who is harmed, and who decided what to measure. Watch for algorithmic bias and proxy variables that encode protected attributes, and weigh privacy and consent.
Exam move
Close your notes sheet with the two big-data reflexes — volume ≠ validity and, at scale, significant ≠ important — plus the named pitfalls (algorithmic bias, proxy variables, spurious correlation, privacy). Keep the context questions (who/why/what/how/when) as your universal opener, since they work on any prompt including this chapter's. Rehearse layering an ethics/justice lens onto a critique (who is harmed, who is missing, who decided), because the Week 11–12 prompts reward exactly that. This chapter also makes excellent revision: it re-tests sampling, significance and bias in a fresh ‘big data’ disguise.