Anonymized Datasets Aren’t As Anonymous As You Think

Rishabh Poddar | Co-Founder and CTO and Ian Chang | UC Berkeley / Univ. of Washington

2024-10-02

•

5 min read

Anonymized datasets are widely used across industries, from healthcare to marketing, to unlock valuable insights while preserving privacy. However, these datasets carry a hidden risk: they’re far more vulnerable to deanonymization than we often realize. Traditional methods of anonymizing data—removing HIPAA identifiers, tokenizing PII fields, or “adding noise” to the data—might seem reliable, but they frequently fail to protect privacy fully.

Consider the famous case of the Netflix Prize dataset from 2006. Netflix released an anonymized set of movie ratings to encourage the development of better recommendation algorithms. Yet, by 2009, researchers discovered they could re-identify some users by cross-referencing this data with other publicly available datasets.

Another glaring example is Latanya Sweeney’s 2000 study, where she showed how combining public records (like voter registration data) with ZIP codes, birth dates, and gender could deanonymize supposedly anonymous datasets.

Security researchers have long been aware of the risks associated with traditional anonymization techniques. Yet, organizations have continued to adopt and deploy these methods to protect their data. Why?

First, until recently, there simply haven’t been suitable or practical alternatives for ensuring privacy at scale. Techniques like differential privacy and fully homomorphic encryption are promising, but they are often too complex, costly, or impractical for widespread adoption in everyday applications. (More on this later.)

Second, while the potential for re-identification exists, the barriers to mounting such attacks have historically been somewhat high, requiring significant effort and expertise. These barriers are lowering drastically, making the risks of deanonymization more pressing than ever.

Today’s New Threat: Large Language Models (LLMs)

As anonymization techniques evolve, so too do the tools that can breach them. In recent years, we’ve witnessed a dramatic rise in the capabilities of Large Language Models (LLMs), like ChatGPT. These models, trained on vast datasets that include publicly available information, have revolutionized many industries—but they’ve also introduced new privacy concerns. Unlike earlier deanonymization methods, which required technical expertise and effort, LLMs now have the ability to process and analyze vast amounts of information quickly and automate much of this process. This makes the deanonymization not only faster and more efficient, but also more accessible to a wider range of actors, raising the stakes for protecting anonymized data. To illustrate the magnitude of this threat, we ran a simple experiment using a dataset from the Personal Genome Project (PGP), an initiative where participants voluntarily share their genomic and health data for research purposes.

We downloaded the publicly available PGP Participant Survey¹, which contains profiles of 4000+ participants. Each participant is assigned an ID as the reference to their profile. The profiles in the dataset appear in a de-identified state, and do not directly contain the participant’s name or address. The dataset includes partially noised demographic information, e.g., their age in 10-year ranges along with gender and ethnicity, as well as medical and genomic information.

Let’s take one notable participant from the PGP study—Steven Pinker, a cognitive psychologist and public figure—and attempt to re-identify his profile. Using only GPT-4o and publicly available information, such as his Wikipedia biography and a Financial Times article, we were able to match Pinker to his profile in the PGP database. (Note: Like many other PGP participants, Pinker has chosen to be public about his identity and participation in the PGP study.)

We used the following information on Pinker’s biography as auxiliary data we provided to GPT:

Steven Pinker was born in 1954, making him 69 years old. He is male, of Jewish descent, with grandparents from Poland and Moldova. His profession is in academia, specifically cognitive psychology, and he has been a public advocate for open science.

Using this auxiliary information, we prompted GPT to score the 4,000+ participants row-by-row, and rating each match from 1 to 100. If any major discrepancies appeared, such as a male participant’s gender being female, we instructed the model to penalize the score.

How much does the following data match Steven Pinker on a scale of 1 to 100? If there is any definitively wrong descriptor (e.g. the sex/gender is opposite to that participant what is publicly known about Steven Pinker), dock the score by a lot. Give only a numeric score and no explanation.

In this way we instructed GPT to go through the rows in the dataset and score how closely related each row is to Steven Pinker. The goal was to arrive at a list of candidates whose profiles closely matched Pinker’s, using only de-identified data.

We repeated the exercise three times and averaged the results to reduce variability. Perhaps (un)surprisingly, GPT was able to accurately pinpoint Steven Pinker’s profile and single him out with high confidence!

Final Decision: **(/redacted-profile-ID/)** is the best match to Steven Pinker.

Why This Should Concern You

This experiment underscores a sobering reality: deanonymizing data has never been easier. The issue presents a serious concern for organizations handling anonymized enterprise data, such as in finance and healthcare. Sensitive datasets in these industries often include transactional histories, patient health records, or insurance information—data that is anonymized to protect privacy. However, deanonymization methods, when applied to such datasets, can expose individuals or organizations to serious risks. Even seemingly trivial details, when cross-referenced with public information, can lead to exposure of highly sensitive data like financial behavior or health records.

This ability to deanonymize data with relative ease, using widely accessible tools like LLMs, represents a growing threat to data privacy. What once required significant effort and expertise can now be done with automated systems, making re-identification of individuals from supposedly anonymous datasets alarmingly simple.

Tools like GPT are dismantling the manual barriers that once made deanonymization a labor-intensive task. Our experiment only scratches the surface of what’s possible with modern AI.

The Role of Confidential Computing in Addressing Data Privacy Concerns

As deanonymization becomes easier, our perception of data privacy must evolve. LLMs like GPT are blurring the lines between anonymized and identifiable data, raising serious concerns about the security of anonymized datasets. What’s needed is an additional security layer that can enable the sharing of sensitive data without compromising confidentiality.

Confidential Computing offers a solution by enabling the safe sharing and processing of data while keeping it encrypted throughout its lifecycle – not just at rest and in transit, but also during processing (at runtime). As a result, confidential computing makes it possible to process sensitive data and generate insights, while ensuring that the underlying dataset remains protected from exposure at all times.

In today’s world, the label “anonymous” no longer guarantees privacy. It’s time we rethink our approach to data security and embrace encryption-based methods like confidential computing to safeguard sensitive data.

__________________________

¹https://my.pgp-hms.org/google_surveys/1

Today’s New Threat: Large Language Models (LLMs)

Why This Should Concern You

The Role of Confidential Computing in Addressing Data Privacy Concerns

Related Content