Our world is awash with data – over 90% of the Earth’s data was created in the last 4 years, and it feels like business leaders can’t go a month without a critical data security failure crashing through the news. With GDPR, CCPA, and PIPEDA all fresh in recent memory, data security has become paramount.
Safeguarding access to sensitive data is a major part of proper security, but so is protecting the use of that data – which is where practicing appropriate data anonymization techniques becomes useful. For business leaders, it’s not essential that you know how to design or deploy an anonymization algorithm; it is imperative that you understand what they are and how to employ them to secure your data and protect your business.
What is Data Anonymization?
Data anonymization is the practice of modifying a dataset in such a way that it is impossible to identify sensitive information from the remaining data. The type of sensitive information can vary – it depends on the data and its application. Personally-identifying information is a standard focus, but the same protections can be applied to corporate financial data, legal documents, health records, and much more.
Data anonymization is destructive by design; that is, it fundamentally alters the content and structure of the data by removing or encrypting identifiers which connect to sensitive elements. Non-destructive techniques, like data pseudonymization or coding, rely on replacement of data with artificial identifiers. These identifiers are often built as UIDs or text tokens. It’s critical to keep this in mind – pseudonymization doesn’t scrub for identifying data; it just reduces links between a dataset and any actual identifying information. This makes coded data more useful for organizations which need to create pinpoint analysis of the dataset, but it also puts much greater pressure on encryption and security protocols to keep access restricted.
Types of Data Anonymization
There are many different types of anonymization techniques available to your company, depending on your data’s architecture and the needs of your business. The list below is a brief overview of some of our favorite techniques here at Baromitr; it is by no means an exhaustive list, nor intended to be a technical guide. Ideally, this gives you just enough knowledge to get by until you can quiz your DBE or data scientist later!
- Perturbation is a technique which makes small modifications to the original dataset by using rounding and/or adding random noise to the data. A subset of this practice is Character Masking, which is employed to make text strings in your dataset less easily-identifiable (ie, swapping out the first n characters and replacing with “ * ”).
- Data Swapping involves changing record attributes in areas where identifiers have minimal impact on sensitive information. On large datasets, this is done algorithmically to ensure statistical parity before and after the process.
- Record Suppression revolves around the complete removal of select records from a dataset. Most often this technique is invoked on data outliers, which are easier to re-identify than other records.
- Synthetic Data is an advanced technique which uses algorithms to generate a new dataset “inspired by” the original dataset. The synthetic data adheres to the same statistical patterns as the original (means, standard deviations, and more) without containing any real sensitive data.
- K-Anonymization is a process where data attributes are converted from discrete values to ranges or groups, where each group contains a “k” number of records. For example, in a dataset that includes annual revenue, k-anonymization would take the discrete revenue data and convert it into categorical ranges ($0 – $3M, $3M – $10M, etc) such that at least “k” companies fall into each range.
- L-Diversity is a standard follow-up to k-anonymization. While k-anonymization focuses on the number of records in each bucketed group, l-diversity algorithms focus on the number of distinct records within each k-bucket. The higher the heterogeneity (l-diversity) in the k-bucket, the lower the chances of disclosing sensitive data.
Which Method Should I Choose?
With all data anonymization techniques, there’s a fundamental tradeoff between the usefulness of the data and the risk of disclosing sensitive information. As an anonymization technique becomes more and more rigorous, the information loss increases and the value of the dataset for business intelligence and analytics decreases. If weaker anonymization techniques are chosen, the analytical value of the dataset increases but this also increases the risk of revealing identifying data.
If you could plot performance of the available techniques against your dataset, the resulting utility curves might resemble the chart below. As the curves move from left to right, they increase in disclosure risk (that’s bad). As the curves move from bottom to top, they decrease in data usefulness (that’s bad). In this scenario, Method 2 is the optimal choice because it gives your business the best combination of value and identification risk (assuming your teams optimize your processes to the right point on the curve).
As you can see, data security through encryption and access control is only part of the important job of keeping sensitive data from being identified. As the amount of critical data stored by businesses continues to grow exponentially, the need for proper data protections will grow as well.
Want to know more about how data anonymization plays a role in a robust peer benchmarking platform? Contact us for more information, and stay tuned for future blog posts!