Featured Image

4 min read

6 personal data anonymization techniques you should know about

Organizations generate and store a large amount of information across their departments, from personal data to purchasing behavior and location details. This information can be very valuable when carrying out research and development projects, however, it is of increasing concern to users, especially on the Internet.

As a consequence, these days, guaranteed privacy requires data anonymization techniques, and sometimes even procedures for eliminating the possibility of reverse engineering for data retrieval. So much so, that in 2018 the EU GDPR made personal data removal mandatory for companies and organizations.


Data anonymization techniques

Monolingual and multilingual anonymization techniques help companies and organizations comply with legislation and avoid fines related to data publication and disclosure.

Below, we propose a list of the main anonymization methods and their particular use in each scenario involving sensitive information, such as personal and banking details, passwords or home address data.


1. Data masking

Data masking allows you to hide certain parts of data by placing random characters or other data in their place. In this way, substitution is used to alter key values, allowing the data to continue to be identified without revealing identity.

Alteration strategies are implemented, such as character shuffling, encryption, or character or term substitution. For example, a value character can be replaced by a symbol, and a person's name can be replaced by a number.


Tips and recommendations

Data masking ensures that sensitive customer information is not available outside the production environment. One of its most widespread uses is in billing scenarios.

In this case, the card information is masked, changing part of the digits to an X. It should be used if you are looking to protect datasets that will not affect the performance of functions, such as personal identification or payment information.


2. Data pseudonymization

While other anonymization techniques, such as data masking, ensure that anonymized datasets are difficult to retrieve, pseudonymization merely reduces the linking of personal data to the identity of the individual. It replaces private identifiers with false identifiers or pseudonyms, but maintains a specific identifier that allows access to the original data.

Data pseudonymization maintains the data's statistical accuracy and confidentiality. On the one hand, it complies with ethics and imposed legislation, and on the other, it continues to allow the modified data to be used for studies, research, statistics or other beneficial actions.


Tips and recommendations

Pseudonymization prevents the identification chain from breaking, so that even if the data is dissociated, it is possible to achieve re-identification. It is usually found in the health field, where identifying data is separated from health data, preventing sensitive information from being traced.

Pseudonymization is useful, for example, for verifying specific and unique problems in a test environment. It is, therefore, often the only solution that allows applications to operate normally and perform the integrity of test scenarios.


3. Data swapping

Also known as data shuffling or permutation, data swapping involves changing the order or position of the elements of an ordered set.

This technique introduces a random distortion into a set of microdata, maintaining the detail and structure of the original information. Its main feature is, therefore, reordering the attribute values so that they are still present, but do not correspond to their original records.


Tips and recommendations

In general, the data swapping approach is implemented by creating pairs of records with similar attributes and then swapping confidential or identifying data values between the pairs.

The process of mixing personal datasets to reorganize them causes them to no longer conform to the original information. It is commonly used in polls, where attributes (columns) that include recognizable values, such as date of birth, are changed.


4. Synthetic data

Although synthetic data are technically not part of the anonymization tools, they are increasingly used when processing personal data so that their use does not interfere with the law.

Synthetic data refers to datasets created by an algorithm with no relation to existing events or reality. Statistical models powered by artificial intelligence are able to create synthetic prototypes from the original datasets.

The synthetic data method involves the construction of mathematical models based on patterns contained in the original dataset. Relying on deep learning, it uses methods such as standard deviations, linear regression or medians, among others, to produce synthetic results.


Tips and recommendations

Synthetic data offer highly accurate simulation environments, allowing datasets to be used to gain strategic insights on the future of, for example, markets, without putting users' privacy at risk.

They are used to construct artificial datasets instead of modifying or using the original dataset and compromising privacy. Some experts consider this to be simpler than making modifications to the original datasets.

5. Data perturbation

Data perturbation is a data security technique that adds "noise" to databases, advocating the confidentiality of individual records. This method of anonymizing datasets is applicable to numerical data entries by altering the datasets with a specific value and operation.

This technique slightly changes the initial dataset by using rounding and random noise methods. The values used must always be proportional to the perturbation used.


Tips and recommendations

Data perturbation can add an amount to all numeric values in your database, or use a given number as the basis of your operation, dividing all numeric values by it.

It is important to carefully select the base used to modify the original values, because if the base is too small, the data will not be sufficiently anonymized, and if it is too large, the data may not be recognized and its value may not be extracted.


6. Generalization

Data generalization is the process of creating a broader categorization of the data in a database, creating a more general picture of the trends or insights it provides. Generalization involves deliberately excluding some data to make them less identifiable.

Data can be modified within a series of ranges with logical limits. The result is a reduced granularity of the data, making it difficult or even impossible to retrieve the exact values associated with an individual.


Tips and recommendations

The goal is to remove certain identifiers without compromising data accuracy. For example, you can remove or replace the house number of a specific address, but the street name will not be removed.

In certain cases, it is possible to generalize the information by classifying it into groups, as would be the case when replacing the exact ages of individuals in a database by age groups (65-74, 75-84, 85 , etc.).


The advantages and limitations of anonymization techniques


Main advantages

In addition to enabling organizations to comply with regulatory laws (including GDPR), anonymization techniques promote digital transformation in businesses, providing anonymized and protected data that will be used to generate new market value.

No organization can do anything without a secure and consistent database. These techniques isolate data governance and help maintain privacy from intruders, while acting as a barrier against outside influence.


Main limitations

Absolute anonymization is very difficult to achieve, since guaranteed and irreversible anonymization of a dataset is practically impossible in most cases.

Taking this into account, it is necessary that, at the very least, the re-identification that could occur would entail such a large effort that it would not be feasible for the person trying to recover the data.

On the other hand, non-reversible and more stringent forms of data anonymization may restrict the ability to extract meaningful information from the results, so their use in some cases loses value compared to the original version.

It is, therefore, important to study each case and find the right balance between hermetically protecting the user's security and privacy, and maintaining some of the characteristics of the data in a way that will continue to be useful.


cta anonymization




Artículos relacionados

The Importance of Intent Recognition in NLP

Technologies are constantly evolving and people are relying on them more and more for everyday tasks, which means that the volume and availability of text data keeps on growing exponentially. With the rise of online services, it has been difficult...

Leer más

Pangeanic: the solutions you need in 2023

After years of standstill and uncertainty about what the future held in all sectors, 2022 gave us a taste of long-forgotten "normality" as different ways of working and new opportunities emerged and stimulated markets.

Leer más

Interview With María Grandury on Artificial Intelligence and NLP

At the young age of 25, María Grandury has already made a name for herself in the field of Artificial Intelligence in Spain. Just two years ago, in the middle of the pandemic, she was finishing her double degree in mathematics and physics. During...

Leer más