University of Melbourne · S1 2026 · FACULTY OF INFORMATION TECHNOLOGY

COMP20008 · Elements Of Data Processing

- one subject, every graph, every model, every mark
50% final exam · hurdle14 Chapters7-page Bible
Our own words - no uploaded lecturer files
Built to mirror S1 2026 · updated this semester
Chapter 11 of 12 · COMP20008

Ethics, IP & Privacy

The "data use" stage of the pipeline asks: just because you can process data, should you? This chapter covers intellectual property (patent vs copyright vs trademark, licensing, GDPR), the three privacy levels, k-anonymity and its weakness the homogeneity attack, the l-diversity fix, global differential privacy (query then add mean-0 noise), and bias & fairness. It is examined as definition + applied reasoning: the 2024 exam's Q11 walks the two-step differential-privacy procedure and the 2025 exam's Q11 sets up the k-anonymity homogeneity attack.

In this chapter

What this chapter covers

  • 011. Intellectual property: copyright (expression, automatic), patent (inventions, registered), trademark (brand identifiers)
  • 022. Control mechanisms: licensing, copyright, patents, trademarks, plus GDPR for personal data
  • 033. Privacy levels: self, local (alter released record), global (alter answers to queries)
  • 044. k-anonymity: generalise/suppress quasi-identifiers so each record matches ≥ k−1 others
  • 055. Quasi-identifiers (age, postcode, employment) vs the sensitive attribute
  • 066. l-diversity: each group has ≥ l distinct sensitive values — defends the homogeneity attack
  • 077. Global differential privacy: (1) query the true data, (2) add mean-0 noise before releasing
  • 088. Global sensitivity and the accuracy ↔ privacy trade-off; bias & fairness across the ML lifecycle
Worked example · free

k-anonymity homogeneity attack and the l-diversity fix (mirrors 2025 Q11)

Q [4 marks]. A released table is 2-anonymous on the quasi-identifiers {age-band, postcode}. One group of two records, both {age 50–59, postcode 30**}, share the same sensitive value "Condition = Diabetes". Identify the privacy problem and state a fix.
  • 1 markThe table is 2-anonymous: each record matches at least one other on the quasi-identifiers, so re-identifying an individual by quasi-identifiers alone is blocked.
  • 1 markBut both records in that group share the same sensitive value, so an attacker who knows their target is in that group learns "Diabetes" with certainty — a homogeneity attack. k-anonymity alone does not protect the sensitive attribute.
  • 1 markFix: enforce l-diversity — ensure every quasi-identifier group contains at least l distinct sensitive values (e.g. by generalising or merging groups, or suppressing records) so the diabetes group also contains other conditions.
  • 1 markTrade-off: more generalisation or suppression to achieve l-diversity reduces the utility of the released data, so privacy and usefulness must be balanced.
It is a homogeneity attack: the 2-anonymous group is not diverse, so its shared sensitive value leaks. The fix is l-diversity (≥ l distinct sensitive values per group), at the cost of some data utility.
Sia tip — Say both layers explicitly: k-anonymity hides who (re-identification by quasi-identifiers), l-diversity hides what (the sensitive value) — naming both is what distinguishes the full-mark answer.
Glossary

Key terms

Copyright / patent / trademark
Copyright protects the expression of works (text, art, code) and is automatic on creation; a patent protects novel inventions or processes and must be registered and is time-limited; a trademark protects brand identifiers (names, logos). They are the main intellectual-property control mechanisms, alongside licensing and GDPR for personal data.
Privacy levels
Self (the individual protects their own data), local (the data owner alters fields in each released record before publishing), and global (the data owner alters the answers it returns to queries). They describe where in the data-release process protection is applied.
k-anonymity
Generalising or suppressing quasi-identifiers so that every record is indistinguishable from at least k−1 others on the quasi-identifier set, blocking re-identification by those attributes. It protects who you are but not necessarily your sensitive attribute.
Quasi-identifier
An attribute that is not a direct identifier but, combined with others, can re-identify someone (e.g. age, postcode, employment). k-anonymity generalises quasi-identifiers; the sensitive attribute (e.g. medical condition) is the value being protected.
l-diversity
A strengthening of k-anonymity requiring that within every quasi-identifier group the sensitive attribute takes at least l distinct values. It defends against the homogeneity attack, where a k-anonymous group that all shares one sensitive value leaks it.
Global differential privacy
A query-answering scheme: (1) compute the true answer to the query, (2) add random mean-0 noise before releasing it, so the presence or absence of any single individual is masked. The noise scale is governed by the global sensitivity, trading accuracy against privacy.
FAQ

Ethics, IP & Privacy FAQ

What is the difference between copyright, a patent and a trademark?

Copyright protects the expression of creative works — text, art, music, code — and arises automatically when the work is created. A patent protects a novel invention or process, must be applied for and granted, and lasts a limited term. A trademark protects identifiers of a brand — names, logos, slogans. They cover different things, so the same product can involve all three (patented mechanism, copyrighted manual, trademarked name), plus licensing and GDPR for personal data.

Why isn't k-anonymity enough on its own?

k-anonymity blocks re-identification by quasi-identifiers — each record looks like at least k−1 others on attributes like age and postcode — but it says nothing about the sensitive value. If every member of a k-anonymous group happens to share the same sensitive value, an attacker who locates their target in that group learns it with certainty. This is the homogeneity attack, and the fix is l-diversity: require at least l distinct sensitive values per group, so the group no longer leaks a single answer.

What are the two steps of global differential privacy?

Step 1: query the true data to compute the real answer to the requested statistic (e.g. the count of respondents over 65). Step 2: add random mean-0 noise to that true answer and release the noisy result. Because the released value varies but averages to the truth, an attacker cannot tell from one answer whether any single individual is in or out of the dataset. The amount of noise is set by the global sensitivity (the maximum effect one record can have) and the privacy budget — more noise means more privacy but less accuracy.

How is the ethics/privacy material examined in COMP20008?

As definitions and applied reasoning, mirroring 2024 Q11 (state the two differential-privacy steps and the guarantee) and 2025 Q11 (diagnose the k-anonymity homogeneity attack and propose l-diversity). The marks reward precise terminology — quasi-identifier vs sensitive attribute, k-anonymity vs l-diversity, the query-then-add-noise skeleton, global sensitivity — and naming the accuracy-vs-privacy or privacy-vs-utility trade-off, plus the IP definitions.

Study strategy

Exam move

This chapter rewards crisp definitions, so build flashcards for the IP trio (copyright = expression/automatic; patent = invention/registered; trademark = brand identifier) and the privacy levels (self/local/global). Memorise the differential-privacy two-step as a skeleton you can write in seconds — query the true value, add mean-0 noise, release — and add the guarantee (masks any one individual) and the trade-off (more noise = more privacy, less accuracy; noise scale set by global sensitivity), because it is the single highest-frequency privacy answer. For k-anonymity, always pair it with its weakness and fix: k-anonymity hides who, the homogeneity attack defeats it, l-diversity (≥ l sensitive values per group) hides what. Keep the privacy-vs-utility trade-off in every privacy answer, and be ready to name a bias/fairness source across the ML lifecycle.

A+Everything unlocked
Unlocks this Bible + all 24 of your University of Melbourne subjects - and 1,000+ Bibles across every Australian university.
Sia - your COMP20008 tutor, unlimited, worked the way the exam marks it
The full 7-page Bible + practice bank with worked solutions
Chrome extension - sync your LMS so Sia knows your deadlines
Bilingual EN / Chinese on every Bible and every Sia answer
$25/ month
30-day money-back · cancel in one tap · how it works
Unlock the full COMP20008 Bible + 24 University of Melbourne subjects解锁完整 COMP20008 Bible + University of Melbourne 24 门科目
$25/mo