The key to data anonymization: aggregate!
At Tune Insight our mission is to enable the use of data whilst ensuring its protection with the aim of transforming the paradigm of the data economy into a more secure and fair system whilst respecting privacy rights. Collaboration is the heart of our solution where different controllers of sensitive data come together in a decentralized way thanks to the power of homomorphic encryption. Despite its immense potential, this technology is not sufficient to achieve the levels of data protection we want to provide our customers. For this reason, our cryptographic primitives are complemented by cutting-edge privacy-enhancing technologies (PETs) that enable us to support this vision.
In this article, we explain how the results of a recently published scientific publication, “Anonymization: The imperfect science of using data while preserving privacy”, co-written by Florimond Houssiau, data protection expert at Tune Insight, apply to Tune Insight work and products. Note that while the article focuses on the centralized trust model (where data is controlled by one entity), many of its findings apply to the decentralized model adopted by Tune Insight.
Note that while the article focuses on the centralized trust model (where data is controlled by one entity) many of its findings apply to the decentralized model adopted by Tune Insight.
Watch the video 'The Key to Anonymization: Aggregating Data' directly on YouTube.
The key to anonymization: aggregating data
The traditional approach to data protection is “de-identification”, where a dataset is modified so that no one can be directly identified. This data set is then shared with third parties. However, as pointed out several times in the article, this approach is not suitable for modern data sets because of their dimensionality and the richness of the data they contain. Indeed, the de-identification of these data leads either make re-identifications easy or modify data so that they become unusable.
“Traditional record-level de-identification techniques typically do not provide a good privacy-utility trade-off to anonymize data.”
In Tune Insight products, individual data is never shared with other parties. Instead, each Tune Insight instance calculates an aggregate result from the data of a large number of inputs distributed among different datasets from different sources. This result is calculated under robust encryption, which ensures that no other information than this result is revealed during or after the calculation to whom this technique is completely in line with the recommendations of the article.
“Aggregate data […] can offer a better trade-off, but they do not inherently protect against privacy attacks.”
“It is important to emphasize that, in general, releasing only aggregate data substantially reduces the vulnerability to attacks compared to record-level data in practice.”
Anonymous aggregates
However, the article points out that aggregating data is not in itself sufficient to ensure data protection. Numerous studies have shown that it is possible for a motivated third party to extract sensitive information about individuals even from aggregates such as enumeration queries or machine learning models. This problem is exacerbated in the so-called interactive framework, where analysts dynamically select which queries are applied to the data.
“In the interactive setting, the adversary can freely define the queries that are answered, and has therefore a lot of flexibility in defining which aggregate information is disclosed. This may allow the adversary to actively exploit vulnerabilities in the system”
Most of the Tune Insight projects are done in this interactive framework. Collaborations often involve many researchers who wish to examine different aspects of the same data set. We are aware of this challenge and we design each project carefully to ensure data protection meets the highest requirements in this field. Fortunately, as demonstrated by the article, the interactive framework offers significant security benefits.
“The interactive nature of data query systems allows data curators to implement additional measures that might mitigate the risk that an adversary can successfully execute attacks. These include, for instance, mandatory authentication for the analysts and keeping a log of all queries issued by any analyst to detect possible attack attempts.”
Risk mitigation techniques are built into the core of the Tune Insight solution, including the two examples mentioned in the excerpt above (robust authentication and tamper-proof log of interactions with the platform). Any calculation launched on a Tune Insight instance must first be reviewed and approved by the appropriate staff of the instance, for example an ethics committee, prior to any use of the data. This ensures that everything happening on an instance is carefully checked, accessible only to authorized individuals, and permanently logged for full traceability.
In addition to these measures at the instance level, each project can be configured to use additional privacy and security policies to prevent information leakage. These measures include a minimum size of the datasets (the instance automatically rejects any calculation that would be made on too small a number of entries) and the limitation of queries (either in number or by limiting to a set of pre-approved queries). Together, these policies allow users to tailor collaborations to meet their security and privacy expectations.
Finally, there are situations where such probing are not considered sufficient, especially when the data or their use is too sensitive. In such cases, projects can be configured to use differential privacy (an advanced definition of confidentiality). Differential confidentiality often leads to a decrease in the quality of results but access to this option unlocks collaborations that otherwise could not take place!
Synthetic data at Tune Insight
Obtaining the necessary approvals for data access is often the most time-consuming step in a project’s launch which can be frustrating for analysts (and generally all stakeholders). In addition, developing a pipeline of analyses on sensitive data can also be problematic or even impossible. Tune Insight avoids these obstacles by using synthetic data automatically generated datasets that resemble real data.
“Overall, we thus see synthetic data as a very useful tool for testing new systems and for exploratory analysis, but its accuracy strongly depends on the use case and any findings may need to be validated on the real data.”
Some suggest that synthetic data may be a useful substitute for actual data. However, recent research has shown that synthetic data cannot achieve a good compromise between confidentiality and usefulness (comparable to de-identification). At Tune Insight, we align with article and modern research and we focus on generating synthetic data for preliminary phases of work, with robust confidentiality guarantees, allowing the final analyses to be carried out with the actual data.
Confidential Machine Learning
Machine learning is a leading application of modern data. As the article points out, even if the parameters of a machine learning model are a black box calculated from many inputs, they can still reveal sensitive information about their training data sets. We take these risks very seriously, as stated in our IA manifest. Our hybrid federated learning approach mitigates some of the risks associated with model learning on sensitive data. In situations where confidentiality is of paramount importance, our approach can also be complemented by the use of differential confidentiality during training, enabling the development of confidential collaborative AI without your data leaving your server.
Unity is strength
One of the key conclusions of the article is that analyzing data while ensuring privacy is a difficult task! But we are convinced that Tune Insight is uniquely positioned to deliver solutions that achieve an optimal compromise between utility and privacy because our solution unlocks the power of cooperation. Beyond the power of friendship, cooperation has a very down-to-earth advantage: it increases the raw amount of data that analysts have access to. It is a recurring observation of the article that increasing the size of data helps to improve the quality of analyses and mitigate confidentiality risks.
The more data available, the less impact an input will have on the final outcome.