Privacy Preserving Analytics: Privacy-by-Design for Big Data Analytics

Add bookmark

Advanced analytics, machine learning and other data science techniques are powerful tools for transformation. However, because “big data” entails large and complex data sets, the privacy risks associated with such endeavors are incredibly high. 

Not only are organizations legally obligated to protect personal identifiable information (PII) from external threats, how companies use such data is also coming under increased scrutiny. As a result, preserving privacy of users has become a key requirement for many web-scale analytics and reporting applications.

Organizations looking to enable the sharing, processing or analysis of personal data without compromising privacy are increasingly adopting privacy preserving data analytics (PPDA) strategies. Rather than a specific tool or technology, PPDA represents a privacy-first approach to delivering data analytics.

Though PPDA first and foremost requires an effective, mathematically robust definition of privacy, it also relies on a combination of data protection systems and technologies - most of which result in data anonymization - to secure data. The following is an overview of some of those approaches.

K-anonymity, L diversity and T closeness

Often referred to as the power of ‘hiding in the crowd,’ the concept of k-anonymity revolves around the idea that by combining sets of data with similar attributes, it will inherently obscure identifying information about any one of the individuals contributing to that data. In other words, if an individuals’ data is pooled in a larger group, any information in the group could correspond to any single member.

With more than 140,000 members, Cyber Security Hub is the vibrant community connecting cyber security professionals around the world.

This process is known as ‘re-identification,’ or the practice of tracing data’s origins back to the individual it is connected to in the real world. 

Though k-anonymization is suitable for data with low dimensionality, as it’s difficult to group data with high dimensionality such as time series data, it may not work for every data science project, especially considering the cost of data minimization. In addition, it also falls short when it comes to protecting against homogeneous pattern attacks and background knowledge attacks. 

To help overcome the limitations of k-anonymization, two extensions have been developed: l-diversity and t-closeness. L-diversity works by increasing the entropy and diversity in sensitive attributes, thereby further reducing the granularity of data. T closeness, on the other hand, builds on l-diversity by requiring that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table. 

Randomization

Randomization is the process of adding “noise” to the data to hide the actual values of the individual records. Even though the data is masked, aggregate behavior of the data distribution can be reconstructed by subtracting out the noise from the data. 

The benefits of randomization are that it can be applied during data collection and preprocessing and there is no anonymization overhead. One of the downsides of randomization is that it cannot be applied to large datasets due to time complexity and data utility.

Data distribution

Data distribution is a data protection technique whereby the data is distributed across many sites. In addition to making sensitive data more difficult to access, it also creates a backup and restoration system because if one component of the system is breached or goes down, the rest remain. However, data distribution can increase costs and complexity.

Distribution is typically accomplished in one or two ways:

  • Horizontal distribution of data - data is distributed across many sites with the same attributes.
  • Vertical distribution of data - data is distributed across different sites under custodian of different organizations.

Cryptographic techniques

Cryptographic techniques make information readable by the sender and receiver, but unintelligible to anyone else. For example, encryption, one of the most popular cryptographic techniques, obfuscates plaintext data by transforming into an unreadable, encoded format known as ciphertext. Only those with a digital key can access or read the encrypted information. 

Multidimensional Sensitivity Based Anonymization 

Multidimensional Sensitivity Based Anonymization (MDSBA) is an anonymization algorithm designed to be applied to large data sets with reduced loss of information and predefined quasi identifiers. 

MDSBA builds on two preexisting anonymization algorithms, top–down specialization (TDS) and by bottom–up generalization (BUG), both of which require continuous iterations with conditional statements, which can result in multiple times of heavy scan for the whole data records, data loss, scalability issues and high computation costs.

In a nutshell, MDSBA works by parallelizing data for big data frameworks and reducing the computation overhead of data iteration by providing pre-calculated k-anonymity parameters and pre-determined attributes for anonymization. In addition to protecting the privacy of data, MDSBA provides a fine-grained access control for multi-level of user’s permissions.