BUSS6002 · Data Science In Business
Knowledge Discovery & Exploratory Data Analysis
This Week 3 chapter frames the whole knowledge-discovery process (KDDA) with two named process models — CRISP-DM, whose six phases you must recite in order, and the Snail Shell model, defined by its continuous-iteration spiral. Inside those phases sit exploratory data analysis (EDA) and data quality, which is where most of the short-answer marks live.
EDA is taught as an iterative loop on two axes — graphical vs non-graphical and univariate vs multivariate — anchored by the four sample moments (location, scale, symmetry, tails), their robust quantile analogues, the 1.5×IQR box-plot rule, sample correlation, and the three missing-data mechanisms (MCAR, MAR, MNAR). Examinable in both the mid-semester and the final.
What this chapter covers
- 011. KDDA — discovering valid, non-trivial, previously unknown, actionable patterns
- 022. CRISP-DM — the six ordered phases from Business Understanding to Deployment
- 033. Snail Shell model — a KDDA process model built on continuous iteration
- 044. EDA — the summarise → hypothesise → model loop and its two axes
- 055. Four sample moments — location, scale (sd), symmetry (skew), tails (kurtosis)
- 066. Quantiles & the box-plot — five-number summary and 1.5×IQR fences
- 077. Sample correlation — unit-free linear association in [−1, 1]; cor=0 ≠ independent
- 088. Data quality — missing-data mechanisms (MCAR/MAR/MNAR) and outliers
Five-number summary and 1.5×IQR fences
- 1 markMedian (Q2): with n = 11 the median is the 6th value, so Q2 = 60.
- 1 markQuartiles by the position rule (n+1)/4: Q1 is the 3rd value = 42 and Q3 is the 9th value = 90. Five-number summary = (28, 42, 60, 90, 210).
- 1 markSpread: IQR = Q3 − Q1 = 90 − 42 = 48, so 1.5 × IQR = 72.
- 1 markFences: lower = Q1 − 72 = −30; upper = Q3 + 72 = 162.
- 1 markCircle the extremes: only values outside (−30, 162) are circled. 210 > 162 so 210 is circled; 150 < 162 stays inside the upper whisker.
Key terms
- KDDA
- Knowledge Discovery via Data Analytics — the whole process of identifying patterns that are valid, non-trivial, previously unknown and interesting/actionable in (often large) data.
- CRISP-DM
- The canonical knowledge-discovery methodology with six ordered phases: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment.
- Snail Shell model
- An alternative KDDA process model (Li, Thomas & Osei-Bryson, 2016) whose signature is its emphasis on continuous iteration and the inter-relationships between the discovery phases — pictured as an inward spiral.
- EDA
- Exploratory Data Analysis — discovering patterns, spotting anomalies and testing assumptions with summary statistics and graphics, organised on two axes: graphical/non-graphical × univariate/multivariate.
- Sample moments
- Four numerical summaries of a variable's shape: 1st = location (mean), 2nd = scale/spread (variance; report the sd), 3rd = symmetry (skewness), 4th = tail behaviour (kurtosis; Normal = 3).
- 1.5×IQR rule
- Box-plot fences are [Q1 − 1.5·IQR, Q3 + 1.5·IQR] where IQR = Q3 − Q1; whiskers stop at the most extreme value inside the fences and any point beyond is drawn as a circle.
- Sample correlation
- cor(x,y) = cov(x,y) / (sd(x)·sd(y)), a unit-free measure of LINEAR association in [−1, 1]; symmetric in x and y. cor = 0 does not imply independence.
- MCAR / MAR / MNAR
- The three missing-data mechanisms: MCAR (missingness independent of all variables), MAR (depends on observed variables), MNAR (depends on the unobserved missing value itself).
Knowledge Discovery & Exploratory Data Analysis FAQ
Do I have to memorise the CRISP-DM phases in order?
Yes. The exam asks you to sequence the six phases or to place a task in the right one. Learn them in order — Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, Deployment — and one job each. Exploring data and checking metadata is Data Understanding; cleaning and building derived features is Data Preparation.
What is the difference between the four moments and their quantile analogues?
The moments (mean, variance/sd, skewness, kurtosis) use every value, so a single extreme can swing them. The unit pairs each one with a robust order-statistic analogue: location → median, scale → half-IQR, symmetry → Galton skewness, tails → Moors kurtosis. The analogues resist outliers, which matters when the data is dirty.
Does a correlation of zero mean the two variables are independent?
No. Correlation measures only LINEAR association. A perfect quadratic relationship (y = x² over a symmetric range of x) has a sample correlation near zero even though the variables are perfectly dependent. Always look at the scatter plot before concluding anything.
How do I tell MAR from MNAR?
Ask what drives the missingness. If it depends on a value you can see (another column, e.g. older customers skip the income field) it is MAR. If it depends on the missing value itself, which you cannot observe (high earners hide income because it is high), it is MNAR — the dangerous case where naive deletion or mean-imputation biases the result.
Is a point that the box-plot circles automatically an outlier I should delete?
No. A circle just means the point lies beyond a 1.5×IQR fence — it is extreme, not necessarily wrong. Under a Normal about 0.7% of perfectly valid points get circled. Investigate context first; delete only if you diagnose it as a genuine error, and consider reporting results with and without an influential point.
Exam move
Treat this chapter as two halves that reward different study. The conceptual half — CRISP-DM phase order, the Snail Shell's iteration signature, and MCAR/MAR/MNAR — is pure recall, so build a one-line flashcard for each and practise placing tasks in the right CRISP-DM phase. The quantitative half — the four-moment map, the five-number summary and 1.5×IQR fences, and sample correlation — rewards a couple of fully worked numerical drills by hand so the steps are automatic under time pressure. Finally, rehearse the three signature traps the examiners reuse every year: cor = 0 is not independence, a circled box-plot point is not automatically an error, and MAR turns on observed (not unobserved) variables. Show all working and use the unit's own notation, since notation discipline is graded.