DATA1001 · Foundations Of Data Science
Exploratory Data Analysis
Before any model or test, you describe the data: where is it centred, how spread out is it, and what shape is it? Two numbers answer each question, and the choice between them turns on one idea — resistance, robustness to outliers. The mean and SD are efficient but sensitive; the median and IQR are resistant. The golden rule the exam rewards: skewed or outlier-heavy data → report median and IQR; symmetric and clean → mean and SD. Shape is read from a histogram (which uses area, not height, to show proportion, and reveals skew and modality) and a boxplot (the five-number summary with a mechanical 1.5·IQR outlier rule). The single most useful fact a histogram encodes is the relationship between mean and median: the mean chases the longer tail while the median holds its ground, so you can diagnose skew from two numbers without ever seeing the plot.
What this chapter covers
- 01Centre: mean (sensitive) vs median (resistant)
- 02Spread: SD (sensitive) vs IQR (resistant); the n-1 divisor
- 03Resistance — the idea that picks the summary
- 04The histogram: area = proportion, skew, modality; the mean chases the tail
- 05The boxplot: the five-number summary and the 1.5·IQR outlier rule
Worked example: EDA on 11 reaction times
- +1(a) Mean = 2540/11 ≈ 231 ms; median = the 6th of 11 values = 215 ms.
- +1(a) Read the gap: mean > median (231 > 215), so the data are right-skewed — the lone 410 ms drags the mean up while the median holds.
- +1(b) Q1 = 197, Q3 = 236, so IQR = 236 − 197 = 39 ms.
- +2(b) Upper fence = Q3 + 1.5×IQR = 236 + 58.5 = 294.5 ms; since 410 > 294.5, 410 is an outlier and the upper whisker stops at 248.
- +1(c) Skewed with an outlier → report the median (215 ms) and IQR (39 ms), which describe the typical student more honestly than the mean and SD.
Key terms
- Median
- The middle value once the data are ordered (average the two middles when n is even). It cares only about rank, so a runaway outlier barely moves it — it is resistant. Report it for skewed or outlier-heavy data.
- Interquartile range (IQR)
- The width of the middle 50% of the data, Q3 − Q1. It discards the extreme quarters, so it is resistant. Pair it with the median; the range (max − min) uses only the two most extreme points and is almost never the right spread to report.
- Resistance
- Robustness of a summary to outliers. The median and IQR are resistant; the mean and SD, built from every value (and squared deviations), are pulled hard by extremes. Resistance is the single idea that decides which summary to report.
- Skew
- The direction of a distribution's longer tail, named for the tail, not the hump. A long right tail (right-skew) drags the mean above the median; a long left tail pulls it below. Symmetric data have mean equal to median.
- The 1.5·IQR rule
- The mechanical outlier fence: a value is an outlier if it falls below Q1 − 1.5·IQR or above Q3 + 1.5·IQR. The boxplot whiskers reach the most extreme points still inside the fences; anything past a fence is plotted as an outlier dot.
Exploratory Data Analysis FAQ
When should I report the mean and SD versus the median and IQR?
Match the summary to the shape. Clean, symmetric data → mean and SD (efficient). Skewed or outlier-heavy data → median and IQR (resistant). Keep the pairs together: mean with SD, median with IQR. The classic deduction is reporting the mean income of a right-skewed population, where a few huge values inflate the mean above what a typical person experiences — the median is the honest centre.
How do I diagnose skew without seeing the plot?
Compare the mean and the median. The mean chases the longer tail while the median holds its ground, so mean − median > 0 means a right tail (right-skew) and mean − median < 0 means a left tail (left-skew); roughly equal means symmetric. That two-number check is faster than drawing the histogram and is exactly what the exam rewards.
What's the difference between a histogram and a bar chart?
A histogram is for quantitative data: bars touch, the axis is a number line, and it is area (not height) that shows the proportion in an interval — important when bins are unequal width. A bar chart is for categorical data: bars have gaps and the order is arbitrary. Swapping them is a common error, as is reading frequency off the height of an unequal-width histogram.
Why divide by n-1 for the sample SD?
The n-1 divisor (Bessel's correction) gives an unbiased estimate of the population variance from a sample; R's sd() and var() use it automatically. Two classic slips: using the n divisor for a sample SD, and reporting SD for skewed data where the IQR is the honest spread. Note too that a boxplot hides modality — a bimodal distribution can look ordinary — so always pair it with a histogram before declaring the shape.
Exam move
Run a fixed pipeline on every EDA prompt: shape → outliers → summary → sentence. Read the shape from the histogram (skew, modality) or diagnose it from mean versus median; flag outliers with the 1.5·IQR rule; if the data are skewed or outlier-heavy, quote the resistant pair (median and IQR) and say so; then write the one-sentence interpretation in the context of the data. Keep the mean-SD and median-IQR pairs married, remember the n-1 divisor, and never report the mean of skewed data without flagging it. The arithmetic is cheap; the marks live in choosing the right summary and the in-context sentence.