University of Sydney · S1 2026 · FACULTY OF BUSINESS & ECONOMICS

BUSS6002 · Data Science In Business

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters8-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 3 of 11 · BUSS6002

Knowledge Discovery & Exploratory Data Analysis

This Week 3 chapter frames the whole knowledge-discovery process (KDDA) with two named process models — CRISP-DM, whose six phases you must recite in order, and the Snail Shell model, defined by its continuous-iteration spiral. Inside those phases sit exploratory data analysis (EDA) and data quality, which is where most of the short-answer marks live.

EDA is taught as an iterative loop on two axes — graphical vs non-graphical and univariate vs multivariate — anchored by the four sample moments (location, scale, symmetry, tails), their robust quantile analogues, the 1.5×IQR box-plot rule, sample correlation, and the three missing-data mechanisms (MCAR, MAR, MNAR). Examinable in both the mid-semester and the final.

In this chapter

What this chapter covers

  • 011. KDDA — discovering valid, non-trivial, previously unknown, actionable patterns
  • 022. CRISP-DM — the six ordered phases from Business Understanding to Deployment
  • 033. Snail Shell model — a KDDA process model built on continuous iteration
  • 044. EDA — the summarise → hypothesise → model loop and its two axes
  • 055. Four sample moments — location, scale (sd), symmetry (skew), tails (kurtosis)
  • 066. Quantiles & the box-plot — five-number summary and 1.5×IQR fences
  • 077. Sample correlation — unit-free linear association in [−1, 1]; cor=0 ≠ independent
  • 088. Data quality — missing-data mechanisms (MCAR/MAR/MNAR) and outliers
Worked example · free

Five-number summary and 1.5×IQR fences

Q [5 marks]. Eleven checkout totals (A$, already sorted) are 28, 35, 42, 46, 58, 60, 75, 86, 90, 150, 210. Build the five-number summary and determine which point(s) a box-plot would draw as a circle.
  • 1 markMedian (Q2): with n = 11 the median is the 6th value, so Q2 = 60.
  • 1 markQuartiles by the position rule (n+1)/4: Q1 is the 3rd value = 42 and Q3 is the 9th value = 90. Five-number summary = (28, 42, 60, 90, 210).
  • 1 markSpread: IQR = Q3 − Q1 = 90 − 42 = 48, so 1.5 × IQR = 72.
  • 1 markFences: lower = Q1 − 72 = −30; upper = Q3 + 72 = 162.
  • 1 markCircle the extremes: only values outside (−30, 162) are circled. 210 > 162 so 210 is circled; 150 < 162 stays inside the upper whisker.
Five-number summary = (28, 42, 60, 90, 210); IQR = 48; fences (−30, 162); the single value 210 is drawn as a circle.
Sia tip — A circled point is flagged as extreme by the 1.5×IQR rule — it is NOT automatically an outlier or data error. Investigate the context before deleting; under a Normal about 0.7% of valid points get circled anyway.
Glossary

Key terms

KDDA
Knowledge Discovery via Data Analytics — the whole process of identifying patterns that are valid, non-trivial, previously unknown and interesting/actionable in (often large) data.
CRISP-DM
The canonical knowledge-discovery methodology with six ordered phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment.
Snail Shell model
An alternative KDDA process model (Li, Thomas & Osei-Bryson, 2016) whose signature is its emphasis on continuous iteration and the inter-relationships between the discovery phases — pictured as an inward spiral.
EDA
Exploratory Data Analysis — discovering patterns, spotting anomalies and testing assumptions with summary statistics and graphics, organised on two axes: graphical/non-graphical × univariate/multivariate.
Sample moments
Four numerical summaries of a variable's shape: 1st = location (mean), 2nd = scale/spread (variance; report the sd), 3rd = symmetry (skewness), 4th = tail behaviour (kurtosis; Normal = 3).
1.5×IQR rule
Box-plot fences are [Q1 − 1.5·IQR, Q3 + 1.5·IQR] where IQR = Q3 − Q1; whiskers stop at the most extreme value inside the fences and any point beyond is drawn as a circle.
Sample correlation
cor(x,y) = cov(x,y) / (sd(x)·sd(y)), a unit-free measure of LINEAR association in [−1, 1]; symmetric in x and y. cor = 0 does not imply independence.
MCAR / MAR / MNAR
The three missing-data mechanisms: MCAR (missingness independent of all variables), MAR (depends on observed variables), MNAR (depends on the unobserved missing value itself).
FAQ

Knowledge Discovery & Exploratory Data Analysis FAQ

Do I have to memorise the CRISP-DM phases in order?

Yes. The exam asks you to sequence the six phases or to place a task in the right one. Learn them in order — Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment — and one job each. Exploring data and checking metadata is Data Understanding; cleaning and building derived features is Data Preparation.

What is the difference between the four moments and their quantile analogues?

The moments (mean, variance/sd, skewness, kurtosis) use every value, so a single extreme can swing them. The unit pairs each one with a robust order-statistic analogue: location → median, scale → half-IQR, symmetry → Galton skewness, tails → Moors kurtosis. The analogues resist outliers, which matters when the data is dirty.

Does a correlation of zero mean the two variables are independent?

No. Correlation measures only LINEAR association. A perfect quadratic relationship (y = x² over a symmetric range of x) has a sample correlation near zero even though the variables are perfectly dependent. Always look at the scatter plot before concluding anything.

How do I tell MAR from MNAR?

Ask what drives the missingness. If it depends on a value you can see (another column, e.g. older customers skip the income field) it is MAR. If it depends on the missing value itself, which you cannot observe (high earners hide income because it is high), it is MNAR — the dangerous case where naive deletion or mean-imputation biases the result.

Is a point that the box-plot circles automatically an outlier I should delete?

No. A circle just means the point lies beyond a 1.5×IQR fence — it is extreme, not necessarily wrong. Under a Normal about 0.7% of perfectly valid points get circled. Investigate context first; delete only if you diagnose it as a genuine error, and consider reporting results with and without an influential point.

Study strategy

Exam move

Treat this chapter as two halves that reward different study. The conceptual half — CRISP-DM phase order, the Snail Shell's iteration signature, and MCAR/MAR/MNAR — is pure recall, so build a one-line flashcard for each and practise placing tasks in the right CRISP-DM phase. The quantitative half — the four-moment map, the five-number summary and 1.5×IQR fences, and sample correlation — rewards a couple of fully worked numerical drills by hand so the steps are automatic under time pressure. Finally, rehearse the three signature traps the examiners reuse every year: cor = 0 is not independence, a circled box-plot point is not automatically an error, and MAR turns on observed (not unobserved) variables. Show all working and use the unit's own notation, since notation discipline is graded.

A+Everything unlocked
Unlocks this Bible + all 203 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.
Sia - your BUSS6002 tutor, unlimited, worked the way the exam marks it
The full 8-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full BUSS6002 Bible + 203 University of Sydney subjects解锁完整 BUSS6002 Bible + University of Sydney 203 门科目
$25/mo