Why Data Anonymization Isn’t Enough in the Age of AI

Rishabh Poddar

2024-04-24

•

5 min read

We are living in a time of unprecedented data generation. According to some estimates, 90% of the world’s data was created in the last two years. The same sources say that 328.77 million terabytes of data are created each day — to put that into perspective, you can store 500 hours of movies on 1 terabyte alone. From a business perspective, the more data the better, especially in the age of AI. Yet enterprises are at an impasse because typically, a company’s most valuable data is also its most sensitive—such as confidential data about customers or personally identifiable information (PII).

To date, this data has often gone unused because companies lack a universally reliable mechanism to secure it. Now, with the rise of AI, fears of data exposure are growing among the enterprise. Attackers are already using prescriptive analytics to compromise current-generation anonymization technologies. Generative AI (GenAI) further lowers the barrier for bad actors to mount attacks, as they can now leverage natural language interfaces to design new strategies and toolkits.

What’s needed is an additional security layer that can enable the sharing of sensitive data without compromising its confidentiality. With a purpose-built data collaboration environment, teams can dramatically simplify anonymization and sharing processes.

Data Security As it Stands

Today, 6 in 10 organizations say they use an enterprise-wide solution to protect their proprietary data. These anonymization technologies range from largely manual efforts at data redaction that can be reverse-engineered, to sophisticated technologies that use encryption, noise, algorithms, and other techniques to enable collaborative use of underlying data sets without revealing identifying characteristics.

But with how costly and time-consuming they are, these practices don’t seem to be cutting it anymore. Over two-thirds of IT and security leaders believe the accelerated data growth across their infrastructure is “outpacing their ability to secure data and manage risk.”

Privacy-enhancing technologies businesses use today vary in difficulty to implement and have underlying flaws (e.g. cost prohibitive, time-consuming, lack of verification, etc.):

Synthetic data generation: Creates new data based on the properties of underlying datasets. The easiest way to do this is with GenAI: Data teams can use natural language queries to create and refine new datasets. Alternatively, data, IT, and security teams can randomize data, create generative adversarial networks, or use other approaches to create artificial data for analysis.
Differential privacy: This technique adds noise to original datasets when collected or before sharing to ensure that underlying information can’t be re-identified. Data, IT, and security teams must analyze data; decide how much noise to add; and when to interject it, such as into query results or into algorithms during training.
Homomorphic encryption: This enables computing on encrypted data. This technology can be challenging for data, IT, and security teams to set up, as it requires specialized cryptographic knowledge, libraries, frameworks, and tools. In addition, data must be encrypted to perform computations and then decrypted to review results.

Secure multi-party computation: This approach allows multiple parties to use encrypted data sets while protecting each party’s data from other parties’ approved users and adversaries. These groups must jointly set up a protocol to communicate securely, agree on the desired shared computational method, and protect their data, making this approach difficult to realize.

Each of these technologies provides important data protection techniques for specific use cases. But if human experts can reverse-engineer data, imagine what powerful algorithms can accomplish.

How AI Is Raising the Stakes on Data Anonymization

Large-language models (LLM) and enterprise proprietary models create greater data exposure threats as development races ahead of guardrails. These risks (discussed in more detail here), include the following development stages:

Pre-training models: LLMs use large datasets that are preprocessed to anonymize or redact PII. However, sensitive data may slip in, given the sheer volume of data and the number of sources used.
Fine-tuning models: LLMs are retrained on specific data sets to improve their ability to complete certain tasks, such as synopsizing clinician-patient interactions. However, models can learn new patterns and memorize data components, inadvertently revealing sensitive details—even if confidential information is removed. For example, LLMs analyzing interaction notes at scale could leak individual patients’ medical histories.
Conducting model inference: Enterprise business teams can input sensitive or confidential data, including PII, into models, which can be used to improve output. As a result, this data could potentially be exposed to others. A study of ChatGPT use found that organizations with 10,000 users who enter 660 daily prompts, can expect 158 monthly incidents of sensitive data exposure, with source code being the leading type of exposed data, followed by regulated data, intellectual property, and passwords and keys.
Developing proprietary models: Companies are developing domain-specific models using LLMs. The security of these new models depends on their data privacy practices and security and legal guardrails. For example, companies should rapidly evolve data governance by providing strict business rules on how data should be used, stored, and managed to avoid issues such as teams abandoning cloud data stores used for model training and inference.

As a result, enterprises need to take the time to set up strong security and governance for AI models before they set up experimentation sandboxes—or deploy and scale models.

Realizing the Potential of Confidential Computing as the Optimal Data Security Layer

Enterprise leaders should not have to delay opportunities to work with important clients and partners for fear of disclosing important data. Aware of the challenges that come with current data anonymization practices, technology leaders came together to launch the Confidential Computing Consortium to accelerate the development and adoption of new approaches to secure data in use, wherever it is stored and however it is used.

Confidential computing enables companies to secure sensitive data in trusted execution environments (TEEs). However, they need to extend that security—with end-to-end encryption for data at rest, in transit, and in use. That’s because data that is in use is typically unencrypted, making it more vulnerable to exploitation.

With a purpose-built environment, companies can securely share and analyze data while maintaining complete confidentiality. They can augment current enterprise data pipelines with a confidential layer that protects sensitive and regulated data while enabling internal teams and partners to process encrypted data sets. Companies can also guarantee governance of their data with a hardware root of trust that provides verifiable privacy and control.

With this technology, teams and partners can:

Enable high-performance analytics and AI on encrypted data using familiar tools: With a trustworthy platform for secure data sharing, business teams can isolate sensitive data in TEEs and perform collaborative, scalable analytics and ML directly on encrypted data using familiar tools such as Apache Spark and notebooks.

Allow approved internal and external collaborative analytics, AI, and data sharing: Internal and external teams can share encrypted data or blended data sets with set policies, streamlining collaboration while keeping encrypted analytics results specific to each party.
Scale across enclaves, data sources, and multiple parties: A sensitive data-sharing tool can secure access across enclave clusters and the ability to automate cluster orchestration, monitoring, and management across workspaces without creating operational disruption. This simplifies data management responsibilities and duties.

Move Ahead of Model Manipulation by Securing Data and Analytics Now

The time is now to lay the foundation for secure end-to-end confidential computing processes, preventing corporate gaps or other issues contributing to data, analytics, and model exploitation. If enterprises do nothing, they risk losing control over sensitive data and model output, employee misuse of data, or AI-powered attacks from adversaries.

With a secure collaboration platform, companies and partners can share data and analytics, improving their ability to target audiences, personalize interactions, and deliver exceptional experiences that customers covet. They also can rapidly analyze emerging business opportunities and capture the ones that drive the most value for their organizations.

Read more about Opaque’s approach to data security in the age of AI in our white paper, Securing Generative AI in the Enterprise

Data Security As it Stands

How AI Is Raising the Stakes on Data Anonymization

Realizing the Potential of Confidential Computing as the Optimal Data Security Layer

Move Ahead of Model Manipulation by Securing Data and Analytics Now

Related Content