COMP20008 · Elements Of Data Processing
Ethics, IP & Privacy
The "data use" stage of the pipeline asks: just because you can process data, should you? This chapter covers intellectual property (patent vs copyright vs trademark, licensing, GDPR), the three privacy levels, k-anonymity and its weakness the homogeneity attack, the l-diversity fix, global differential privacy (query then add mean-0 noise), and bias & fairness. It is examined as definition + applied reasoning: the 2024 exam's Q11 walks the two-step differential-privacy procedure and the 2025 exam's Q11 sets up the k-anonymity homogeneity attack.
What this chapter covers
- 011. Intellectual property: copyright (expression, automatic), patent (inventions, registered), trademark (brand identifiers)
- 022. Control mechanisms: licensing, copyright, patents, trademarks, plus GDPR for personal data
- 033. Privacy levels: self, local (alter released record), global (alter answers to queries)
- 044. k-anonymity: generalise/suppress quasi-identifiers so each record matches ≥ k−1 others
- 055. Quasi-identifiers (age, postcode, employment) vs the sensitive attribute
- 066. l-diversity: each group has ≥ l distinct sensitive values — defends the homogeneity attack
- 077. Global differential privacy: (1) query the true data, (2) add mean-0 noise before releasing
- 088. Global sensitivity and the accuracy ↔ privacy trade-off; bias & fairness across the ML lifecycle
k-anonymity homogeneity attack and the l-diversity fix (mirrors 2025 Q11)
- 1 markThe table is 2-anonymous: each record matches at least one other on the quasi-identifiers, so re-identifying an individual by quasi-identifiers alone is blocked.
- 1 markBut both records in that group share the same sensitive value, so an attacker who knows their target is in that group learns "Diabetes" with certainty — a homogeneity attack. k-anonymity alone does not protect the sensitive attribute.
- 1 markFix: enforce l-diversity — ensure every quasi-identifier group contains at least l distinct sensitive values (e.g. by generalising or merging groups, or suppressing records) so the diabetes group also contains other conditions.
- 1 markTrade-off: more generalisation or suppression to achieve l-diversity reduces the utility of the released data, so privacy and usefulness must be balanced.
Key terms
- Copyright / patent / trademark
- Copyright protects the expression of works (text, art, code) and is automatic on creation; a patent protects novel inventions or processes and must be registered and is time-limited; a trademark protects brand identifiers (names, logos). They are the main intellectual-property control mechanisms, alongside licensing and GDPR for personal data.
- Privacy levels
- Self (the individual protects their own data), local (the data owner alters fields in each released record before publishing), and global (the data owner alters the answers it returns to queries). They describe where in the data-release process protection is applied.
- k-anonymity
- Generalising or suppressing quasi-identifiers so that every record is indistinguishable from at least k−1 others on the quasi-identifier set, blocking re-identification by those attributes. It protects who you are but not necessarily your sensitive attribute.
- Quasi-identifier
- An attribute that is not a direct identifier but, combined with others, can re-identify someone (e.g. age, postcode, employment). k-anonymity generalises quasi-identifiers; the sensitive attribute (e.g. medical condition) is the value being protected.
- l-diversity
- A strengthening of k-anonymity requiring that within every quasi-identifier group the sensitive attribute takes at least l distinct values. It defends against the homogeneity attack, where a k-anonymous group that all shares one sensitive value leaks it.
- Global differential privacy
- A query-answering scheme: (1) compute the true answer to the query, (2) add random mean-0 noise before releasing it, so the presence or absence of any single individual is masked. The noise scale is governed by the global sensitivity, trading accuracy against privacy.
Ethics, IP & Privacy FAQ
What is the difference between copyright, a patent and a trademark?
Copyright protects the expression of creative works — text, art, music, code — and arises automatically when the work is created. A patent protects a novel invention or process, must be applied for and granted, and lasts a limited term. A trademark protects identifiers of a brand — names, logos, slogans. They cover different things, so the same product can involve all three (patented mechanism, copyrighted manual, trademarked name), plus licensing and GDPR for personal data.
Why isn't k-anonymity enough on its own?
k-anonymity blocks re-identification by quasi-identifiers — each record looks like at least k−1 others on attributes like age and postcode — but it says nothing about the sensitive value. If every member of a k-anonymous group happens to share the same sensitive value, an attacker who locates their target in that group learns it with certainty. This is the homogeneity attack, and the fix is l-diversity: require at least l distinct sensitive values per group, so the group no longer leaks a single answer.
What are the two steps of global differential privacy?
Step 1: query the true data to compute the real answer to the requested statistic (e.g. the count of respondents over 65). Step 2: add random mean-0 noise to that true answer and release the noisy result. Because the released value varies but averages to the truth, an attacker cannot tell from one answer whether any single individual is in or out of the dataset. The amount of noise is set by the global sensitivity (the maximum effect one record can have) and the privacy budget — more noise means more privacy but less accuracy.
How is the ethics/privacy material examined in COMP20008?
As definitions and applied reasoning, mirroring 2024 Q11 (state the two differential-privacy steps and the guarantee) and 2025 Q11 (diagnose the k-anonymity homogeneity attack and propose l-diversity). The marks reward precise terminology — quasi-identifier vs sensitive attribute, k-anonymity vs l-diversity, the query-then-add-noise skeleton, global sensitivity — and naming the accuracy-vs-privacy or privacy-vs-utility trade-off, plus the IP definitions.
Exam move
This chapter rewards crisp definitions, so build flashcards for the IP trio (copyright = expression/automatic; patent = invention/registered; trademark = brand identifier) and the privacy levels (self/local/global). Memorise the differential-privacy two-step as a skeleton you can write in seconds — query the true value, add mean-0 noise, release — and add the guarantee (masks any one individual) and the trade-off (more noise = more privacy, less accuracy; noise scale set by global sensitivity), because it is the single highest-frequency privacy answer. For k-anonymity, always pair it with its weakness and fix: k-anonymity hides who, the homogeneity attack defeats it, l-diversity (≥ l sensitive values per group) hides what. Keep the privacy-vs-utility trade-off in every privacy answer, and be ready to name a bias/fairness source across the ML lifecycle.