University of Sydney · S1 2026 · FACULTY OF SCIENCE

DATA1001 · Foundations Of Data Science

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters4-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 2 of 7 · DATA1001

Exploratory Data Analysis

Before any model or test, you describe the data: where is it centred, how spread out is it, and what shape is it? Two numbers answer each question, and the choice between them turns on one idea — resistance, robustness to outliers. The mean and SD are efficient but sensitive; the median and IQR are resistant. The golden rule the exam rewards: skewed or outlier-heavy data → report median and IQR; symmetric and clean → mean and SD. Shape is read from a histogram (which uses area, not height, to show proportion, and reveals skew and modality) and a boxplot (the five-number summary with a mechanical 1.5·IQR outlier rule). The single most useful fact a histogram encodes is the relationship between mean and median: the mean chases the longer tail while the median holds its ground, so you can diagnose skew from two numbers without ever seeing the plot.

In this chapter

What this chapter covers

  • 01Centre: mean (sensitive) vs median (resistant)
  • 02Spread: SD (sensitive) vs IQR (resistant); the n-1 divisor
  • 03Resistance — the idea that picks the summary
  • 04The histogram: area = proportion, skew, modality; the mean chases the tail
  • 05The boxplot: the five-number summary and the 1.5·IQR outlier rule
Worked example · free

Worked example: EDA on 11 reaction times

Q [6 marks]. A workshop records the simple-reaction time (ms) of n = 11 students; one was distracted. Sorted: 182, 191, 197, 203, 209, 215, 221, 228, 236, 248, 410. (a) Find the mean and median and say what the gap tells you. (b) Find Q1, Q3 and the IQR, then apply the 1.5·IQR rule to check 410. (c) Which summary should you report?
  • +1(a) Mean = 2540/11 ≈ 231 ms; median = the 6th of 11 values = 215 ms.
  • +1(a) Read the gap: mean > median (231 > 215), so the data are right-skewed — the lone 410 ms drags the mean up while the median holds.
  • +1(b) Q1 = 197, Q3 = 236, so IQR = 236 − 197 = 39 ms.
  • +2(b) Upper fence = Q3 + 1.5×IQR = 236 + 58.5 = 294.5 ms; since 410 > 294.5, 410 is an outlier and the upper whisker stops at 248.
  • +1(c) Skewed with an outlier → report the median (215 ms) and IQR (39 ms), which describe the typical student more honestly than the mean and SD.
Mean ≈ 231 ms > median 215 ms, so the data are right-skewed with one outlier (410 ms, by the 1.5·IQR rule); the resistant summary — median 215 ms and IQR 39 ms — describes the typical student more honestly than the outlier-inflated mean and SD.
Sia tip — The exam pays for the in-context sentence: "right-skewed, one outlier (410 ms, likely the distracted student); median and IQR describe the typical student more honestly than the mean and SD the outlier inflates." Shape → outlier → resistant summary → context = full marks.
Glossary

Key terms

Median
The middle value once the data are ordered (average the two middles when n is even). It cares only about rank, so a runaway outlier barely moves it — it is resistant. Report it for skewed or outlier-heavy data.
Interquartile range (IQR)
The width of the middle 50% of the data, Q3 − Q1. It discards the extreme quarters, so it is resistant. Pair it with the median; the range (max − min) uses only the two most extreme points and is almost never the right spread to report.
Resistance
Robustness of a summary to outliers. The median and IQR are resistant; the mean and SD, built from every value (and squared deviations), are pulled hard by extremes. Resistance is the single idea that decides which summary to report.
Skew
The direction of a distribution's longer tail, named for the tail, not the hump. A long right tail (right-skew) drags the mean above the median; a long left tail pulls it below. Symmetric data have mean equal to median.
The 1.5·IQR rule
The mechanical outlier fence: a value is an outlier if it falls below Q1 − 1.5·IQR or above Q3 + 1.5·IQR. The boxplot whiskers reach the most extreme points still inside the fences; anything past a fence is plotted as an outlier dot.
FAQ

Exploratory Data Analysis FAQ

When should I report the mean and SD versus the median and IQR?

Match the summary to the shape. Clean, symmetric data → mean and SD (efficient). Skewed or outlier-heavy data → median and IQR (resistant). Keep the pairs together: mean with SD, median with IQR. The classic deduction is reporting the mean income of a right-skewed population, where a few huge values inflate the mean above what a typical person experiences — the median is the honest centre.

How do I diagnose skew without seeing the plot?

Compare the mean and the median. The mean chases the longer tail while the median holds its ground, so mean − median > 0 means a right tail (right-skew) and mean − median < 0 means a left tail (left-skew); roughly equal means symmetric. That two-number check is faster than drawing the histogram and is exactly what the exam rewards.

What's the difference between a histogram and a bar chart?

A histogram is for quantitative data: bars touch, the axis is a number line, and it is area (not height) that shows the proportion in an interval — important when bins are unequal width. A bar chart is for categorical data: bars have gaps and the order is arbitrary. Swapping them is a common error, as is reading frequency off the height of an unequal-width histogram.

Why divide by n-1 for the sample SD?

The n-1 divisor (Bessel's correction) gives an unbiased estimate of the population variance from a sample; R's sd() and var() use it automatically. Two classic slips: using the n divisor for a sample SD, and reporting SD for skewed data where the IQR is the honest spread. Note too that a boxplot hides modality — a bimodal distribution can look ordinary — so always pair it with a histogram before declaring the shape.

Study strategy

Exam move

Run a fixed pipeline on every EDA prompt: shape → outliers → summary → sentence. Read the shape from the histogram (skew, modality) or diagnose it from mean versus median; flag outliers with the 1.5·IQR rule; if the data are skewed or outlier-heavy, quote the resistant pair (median and IQR) and say so; then write the one-sentence interpretation in the context of the data. Keep the mean-SD and median-IQR pairs married, remember the n-1 divisor, and never report the mean of skewed data without flagging it. The arithmetic is cheap; the marks live in choosing the right summary and the in-context sentence.

A+Everything unlocked
Unlocks this Bible + all 25 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your DATA1001 tutor, unlimited, worked the way the exam marks it
The full 4-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full DATA1001 Bible + 25 University of Sydney subjects解锁完整 DATA1001 Bible + University of Sydney 25 门科目
$25/mo