What is synthetic data?
Synthetic data is data that has been artificially generated from a model trained to reproduce the characteristics and structure of the original data.
The goal is for the synthetic data to be sufficiently similar to the original data for the results of the statistical analysis to have the same usability. It is important, for example, to maintain quality control in which the distribution tables of the values of the most relevant variables are analyzed to ensure that they are the same; or if two characteristics were co-dependent in the original data, that they remain so in the synthetic data.
The need to generate data similar to the original data arises from the desire to feed the models with a larger number of samples and thus avoid model accuracy problems caused by training with a small amount of data. Occasionally, the amount of data is limited and even difficult to obtain if it has to be generated by humans, so generating synthetic data is faster, more flexible and scalable.
To train a model, it does not matter what the nature of the data is as long as the intrinsic characteristics and patterns of the data are preserved. These characteristics, which make up the “essence” of the data, are quality, balance, and bias. Real data, besides being limited and difficult to obtain, is very sensitive to errors, imperfections, and bias, so using synthetic data can improve the quality of the model.
There are multiple ways to generate synthetic data, from decision trees to deep learning. The most common examples are Generative Adversarial Networks that have recently been introduced, and are commonly used in the field of image recognition. Some examples of their application are transforming an image to a painting in the style of Monet, creating images of people that do not exist, or turning a horse into a zebra.
More information:
This method is not only effective for generating images but is also a good way to generate synthetic text while preserving the intrinsic characteristics of the data.
What is anonymized data?
Data anonymization is a procedure that removes or modifies information linking personally identifiable information; in other words, anonymized data cannot be associated with any natural person. Anonymizing a file means replacing this original data with another replacement pattern.
During the recent years, we have made great technological advances that allow us to share information and evolve as a society, but we are also more exposed to the technological advances of hacking. It is common for sensitive information to be contained in the data depending on the nature thereof, which increases the risk of a possible cybersecurity attack that can relate the information to real people.
Despite the fact that the concept of sensitive information is ambiguous, in 2018 the European Union presented the General Data Protection Regulation (GDPR), which defines and delimits the data understood as sensitive, to protect the privacy of individuals resulting in the information being subject to a data protection regularization.
Some sensitive data includes, for example, a person’s name, gender, credit card details, telephone number, passwords, among others. This is data that identifies a natural person and therefore must be anonymized.
More information:
There are different anonymization techniques, the best known of which are permutation, randomization, and generalization. On the other hand, there is another technique called pseudonymization of data, defined by the EU as data that can no longer be attributed to a natural person without the use of additional information (see Article 4 (3) of the GDPR). This definition includes some encryption elements that do not correspond to the definition most commonly used at Pangeanic.
In this text, we will call pseudonymization the anonymization that occurs when private data is replaced by a similar data, real in nature, which allows a text to be read in sequence without hindering its understanding by the presence of labels or crossed-out sections. This method does not use encryption techniques and replacement data can be generated synthetically or by dictionaries or algorithms whose output has an exact pattern, as is the case with dates.
Related content: Compliance with Pseudonymization According to the GDPR
A comparative analysis between synthetic data and anonymized data
Main differences
The main difference between synthetic and anonymized data is its vulnerability. It is not only the customers who are concerned when it comes to data privacy, but it is also essential to comply with data protection policies.
As explained in the previous sections, the concepts of synthetic and anonymized data are linked, as one way to obtain anonymized data is to use the same techniques as when generating synthetic data, but with the purpose of protecting sensitive information when sharing it with third parties within the framework of privacy protection.
When should each type of data be used?
Most of the techniques used for data anonymization today are actually nothing more than pseudonymization methods. According to the GDPR's definition of pseudonymization discussed above, because information can be attributed to an individual through the use of additional information, it must be considered information about an identifiable natural person and, therefore, pseudonymized data is not anonymous. In this sense, if sufficiently good tools and models are available to avoid pseudonymization, anonymized data is the best option. On the other hand, if it is necessary to use additional data and information to complete or reveal sensitive data, then synthetic data is the best option.
The advantages and disadvantages
The main advantage of synthetic data is that it is a way to optimize and enrich data, generating more data with the same characteristics as the original data.
On the other hand, the main disadvantage of synthetic data is that the privacy of the resulting data must be ensured and not match the real person’s information. A privacy assurance evaluation must be performed that assesses the extent to which data subjects can be identified in the synthetic data and how much new data about those data subjects would be revealed after successful identification.
Another disadvantage of synthetic data, a consequence of the first, is the fear of sharing insufficiently anonymized data with third parties and incurring a risk related to customer or employee privacy. Finally, the data may lose coherence and become less meaningful. Some of the techniques that generate synthetic data have the disadvantage that they may remove more information than necessary due to the aggressiveness of the methods, thus losing meaningfulness.
Anonymized data has the main advantage of being a measure against the risks of sharing sensitive data with third parties, thus complying with the regulations established by the GDPR. It is a way to ensure data security and compliance with privacy policies, while reducing exposure to potential cybersecurity attacks. Pseudonymization, on the other hand, also makes it possible to keep documents and data sources in a readable state similar to the original, even making masking imperceptible. The data masked with this technique can go into production processes immediately and be useful to third parties such as researchers or external auditors. Finally, using data anonymization indicates that the company understands the importance of protecting data, which generates confidence in its customers and security in the business.
More information: Anonymizing Databases: Tools and Techniques
Anonymization could prove to be a fairly reliable way to ensure data and combine them with other aspects of data management, but it has some disadvantages as well. One of the less obvious ones is that it is time-consuming to ask users for permission to handle and perform any operation on the data.
Conclusion: anonymized or synthetic data?
Institutions or companies whose data is needed for processes involving human actors that could constitute a source of risk for the original data, could use anonymized data as a viable and very efficient option for maintaining their processes with third parties without assuming risks. The use of synthetic data may alter, in some cases, the underlying patterns in the data that could be the fundamental interest in the research or use of those data sources such as, for example, demographic studies, health studies related to high incidence diseases, etc. Anonymized data retain, due to the nature of the method, all non-sensitive patterns of individuals, and private data cannot be inferred from them unless you have additional information.
After having reviewed and differentiated both concepts and their main advantages and disadvantages, we can conclude that the best way to ensure data privacy is to use anonymized data. They ensure the protection of sensitive data, comply with the GDPR, and better preserve the consistency and meaningfulness of the text.
Since 2020, Pangeanic has been leading the MAPA (Multilingual Anonymization toolkit for Public Administrations) project, which is supported by the European Union’s CEF (Connecting Europe Facility) program and the NTEU (Neural Translation for the EU) project. MAPA’s objective is to develop a multilingual anonymization tool, based on named entity recognition (NER) and applicable to all EU languages. Learn more in this video:
While Pangeanic recommends using anonymization techniques to ensure data privacy, synthetic data is a good technique to generate data with intrinsic characteristics and patterns similar to the original data in order to feed the models themselves, which are trained to generate the anonymized data.