"Anonymized data" is one of those holy grails, like "healthy ice-cream" or "selectively breakable crypto" — if "anonymized data" is a thing, then companies can monetize their surveillance dossiers on us by selling them to all comers, without putting us at risk or putting themselves in legal jeopardy (to say nothing of the benefits to science and research of being able to do large-scale data analyses and then publish them along with the underlying data for peer review without posing a risk to the people in the data-set, AKA "release and forget").
As the old saying goes: "wanting it badly is not enough." Worse still, legislatures around the world are convinced that because anonymized data would be amazing and profitable and useful, it must therefore be possible, and they've made laws that say, "once you've anonymized this data, you can treat it like it is totally harmless," without ever saying what "anonymization" actually entails.
Enter a research team from Imperial College London and Belgium's Université Catholique de Louvain, whose Nature article Estimating the success of re-identifications in incomplete datasets using generative models shows that they can reidentify "99.98 percent of Americans from almost any available data set with as few as 15 attributes." That means that virtually every large-scale, anonymized data-set for sale or circulating for scientific research purposes today is not anonymized at all, and should not be circulating or sold. (Rob discussed this earlier today)
The researchers chose to publish their method rather than keep it a secret so that people who maintain these data-sets can use it to test whether their anonymization methods actually work (Narrator: They don't).
While rich medical, behavioral, and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing datasets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete dataset. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes. Our results suggest that even heavily sampled anonymized datasets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.
Your Data Were ‘Anonymized’? These Scientists Can Still Identify You [Gina Kolata/New York Times]
Estimating the success of re-identifications in incomplete datasets using generative models [Luc Rocher, Julien M. Hendrickx & Yves-Alexandre de Montjoye/Nature Communications]