University of Sydney · S1 2026 · FACULTY OF BUSINESS & ECONOMICS

BUSS6002 · Data Science In Business

- one subject, every graph, every model, every mark

50% final exam · hurdle14 Chapters12-page Bible

Our own words - no uploaded lecturer files

Built to mirror S1 2026 · updated this semester

Chapter 7 of 11 · BUSS6002

Feature Engineering & Data Ethics

Week 7 is two examinable halves in one chapter. The data-ethics half is qualitative — memorise the five guiding principles (Ownership, Transparency, Privacy, Intention, Outcomes) and be able to name the one a case breaches, including the often-missed Outcomes / disparate impact. The feature-engineering half is quantitative — transform x and/or y so a linear-in-parameters model can fit a non-linear relationship.

The marquee result is retransformation bias: when you model log(y), the naive prediction exp(xᵀβ̂) is downward biased by Jensen's inequality, and must be corrected with a smearing factor estimated from the residuals.

In this chapter

What this chapter covers

011. Five ethics principles — Ownership, Transparency, Privacy, Intention, Outcomes, kept distinct
022. Disparate impact & PII — unintended group harm is unlawful even with good intent
033. Legal landscape — HiQ v. LinkedIn (public scraping ≠ hacking) and the Robodebt disaster
044. Responsible AI — 'AI reflects the values of its creators', black-box opacity, deep-fakes
055. Feature engineering — re-express variables so a linear model fits a non-linear relationship
066. The transform toolkit — log response, interaction, polynomial, linear basis functions
077. Choosing a response transform — right-skewed positive y ⇒ log (symmetry + variance stabilisation)
088. Retransformation bias & the smearing correction — exp(xᵀβ̂) understates E(y|x) by Jensen

Worked example · free

Applied data ethics — an automated hiring screen

Q [4 marks]. A company deploys an AI résumé-screening tool that automatically down-ranks applicants from certain postcodes; the firm cannot clearly explain to rejected candidates how the decision was reached. Several months later it emerges that the postcode signal systematically disadvantaged applicants from a particular ethnic group. Using the course's data-ethics framework, name two principles breached and justify each in one sentence.

1 markPrinciple 1 — Outcomes: even if the intention (efficient screening) was legitimate, the system produced disparate impact, inadvertently and systematically disadvantaging a protected group.
1 markJustify: a postcode proxy correlated with ethnicity caused a discriminatory outcome regardless of intent, which the Civil Rights Act framing treats as unlawful.
1 markPrinciple 2 — Transparency: rejected candidates were not told how the automated decision was made and had no workable way to understand or contest it.
1 markJustify: a fair automated decision process must let subjects understand and challenge how their data drove the outcome; an opaque, uncontestable model fails that duty.

Outcomes (disparate impact on a protected group via a postcode proxy) and Transparency (an opaque, uncontestable automated decision). Name each principle in the course's vocabulary first, then ground it in a case fact.

Sia tip — For applied-ethics short answers, write the named principle before the explanation — generic 'it was unfair' earns no structure marks. Outcomes is the principle students most often forget, and it is usually one of the two the marker wants.

Glossary

Key terms

Outcomes (principle) / disparate impact: Even a well-intentioned system must avoid disparate impact — unintended, unjustified harm to a group. Under the Civil Rights Act framing the unit uses, a discriminatory outcome is unlawful even when the intent was legitimate.
Privacy (principle) & PII: Personally identifiable information (PII) is data that can single out an individual. Consent to collect and analyse PII is not licence to publish it, and 'anonymised' data that can be re-identified is still PII.
Feature engineering: Transforming x and/or y so that a model linear in the parameters β can capture a non-linear relationship the raw features cannot — the model stays linear in β; the features change.
Log transform of the response: Modelling log(y) = xᵀβ + ε instead of y, used for a right-skewed, strictly positive response; it compresses the long upper tail toward symmetry and stabilises the variance.
Interaction & polynomial terms: Interaction adds a product term β₃x₁x₂ so one driver's effect depends on another; polynomial adds x², x³, … to bend the fit. Both stay linear in β, so OLS still applies.
Retransformation (back-transform) bias: When log(y) = xᵀβ + ε is correct, the naive prediction ŷ = exp(xᵀβ̂) is downward biased — it does not recover E(y|x), because E(y|x) = exp(xᵀβ)·E[exp(ε)] and E[exp(ε)] > 1.
Jensen's inequality: For a convex function f, E[f(X)] ≥ f(E[X]). Since exp is strictly convex, E[exp(ε)] > exp(E[ε]) = exp(0) = 1 for mean-zero errors — the named result behind the retransformation bias.
Smearing correction: An estimate of the dropped factor from the residuals: ŷ_corrected = exp(xᵀβ̂)·(1/n)Σexp(eᵢ). The factor is always ≥ 1, so it only ever scales the naive prediction up.

FAQ

Feature Engineering & Data Ethics FAQ

How do I keep the five ethics principles straight in the exam?

Run them as a checklist in order — Ownership (was there consent to collect?), Transparency (were subjects told how data is used?), Privacy (was PII exposed?), Intention (why is it being collected?), Outcomes (did it cause group harm?). Eliminating the satisfied principles isolates the breach. Watch for cases where consent was given but data was published — that is Privacy, not Ownership.

Why is exp of the fitted log model a biased prediction?

Because the mean of a transform is not the transform of the mean. With log(y) = xᵀβ + ε, the true conditional mean is E(y|x) = exp(xᵀβ)·E[exp(ε)], and by Jensen's inequality E[exp(ε)] > 1 for mean-zero errors. The naive exp(xᵀβ̂) drops that factor greater than one, so it systematically understates E(y|x).

When should I log-transform the response?

When the response is strictly positive and right-skewed (a long upper tail) — spending, income, claim size, house price. The log compresses large values far more than small ones, making the response more symmetric and stabilising the variance so the linear-model assumptions hold. Justify your choice from the histogram shape.

Does feature engineering mean I am no longer doing linear regression?

No. Every transform here — log, interaction, polynomial, basis functions — keeps the model linear in the parameters β, which is exactly why ordinary least squares still solves it. The 'non-linear' part refers to the x–y relationship, not to β. A question asking to capture a non-linear relationship with a linear model is asking for feature engineering.

How is this chapter actually examined?

Expect a 1-mark ethics MCQ (identify the breached principle), a 4-mark applied-ethics short answer (name two principles, justify each), a 2-mark feature-transform short answer (which transform and why), and a 1-mark MCQ on the retransformation-bias trap. None of it needs the calculator — it rewards precise vocabulary and clean reasoning.

Study strategy

Exam move

Split your revision in two. For ethics, memorise the five principles as a one-pass checklist and drill applied cases until you can name two principles and justify each in single sentences — the marks are for the labelled principle, not for generic outrage, and Outcomes/disparate impact is the one most often missed. For feature engineering, be able to match a data symptom to a transform (right-skewed positive ⇒ log, curvature ⇒ polynomial, effect-depends-on-another ⇒ interaction) while remembering every transform keeps the model linear in β. Above all, lock in the retransformation result: exp of a fitted log model is downward biased by Jensen's inequality, and the smearing factor (1/n)Σexp(eᵢ) corrects it — be ready to both name Jensen and run the four-step correction with numbers.

Keep going — explore the course

A+Everything unlocked

Unlocks this Bible + all 203 of your University of Sydney subjects - and 1,000+ Bibles across every Australian university.

Sia - your BUSS6002 tutor, unlimited, worked the way the exam marks it

The full 12-page Bible + practice bank with worked solutions

Chrome extension - sync your LMS so Sia knows your deadlines

Bilingual EN / Chinese on every Bible and every Sia answer

$25/ month

30-day money-back · cancel in one tap · how it works