🤖 AI Summary
In 1997 Latanya Sweeney famously reidentified the Massachusetts governor in a supposedly “anonymized” hospital dataset by linking ZIP code, date of birth and gender to a $20 voter list—showing that removing direct identifiers (name, SSN, etc.) isn’t enough when auxiliary data exists. That insight motivated k‑anonymity: a dataset is k‑anonymous if every combination of quasi‑identifier values (the demographic fields an attacker might know) appears in at least k records, so a linkage yields k indistinguishable candidates rather than a single person.
Practically, k‑anonymity is achieved by generalization (e.g., turning exact ages into ranges or truncating ZIP codes) and suppression (removing outliers). Generalization can be global (same mapping applied across the table) or local (different mappings per record), trading off ease of analysis versus retained utility. Choosing k is nontrivial—healthcare practice often uses 5–15—but there’s no principled, one‑size‑fits‑all value because risk depends on data value, adversary access, and consequences. The takeaway for AI/ML teams: naïve de‑identification invites linkage attacks; k‑anonymity is a simple, interpretable baseline that reduces reidentification risk but requires careful QI selection, utility/privacy trade‑offs, and further analysis of residual risks.
Loading comments...
login to comment
loading comments...
no comments yet