Data anonymization refers to the process of de-identifying personal information from text. A type of information sanitization with the intention of protecting privacy.
Businesses, governments and citizens generate prodigious amounts of data everyday – and the race to hoard and control it is intensifying. Expectant start-ups seek to make their fortunes by exploiting and analyzing data, while governments around the world are harnessing data for effective policy-making. This unparalleled proliferation of data sparks new and complex questions about how it should be managed and protected.
The data landscape
Recent years have marked a shift in the way we think about personal information online. High-profile data misuse has made us wary of whom we choose to divulge our information to.
The 2018 Facebook-Cambridge Analytica scandal saw millions of Facebook users' personal data harvested by the consultancy to be used as fuel for political advertising. Users’ confidence in Facebook’s handling of privacy plunged. Unwanted disclosures of personal information became a ‘hot topic’, many rushed to block third-party trackers and insist companies delete their information from shadowy databases.
The fact is, trendy advertising firms, and creative agencies are not alone in housing our personal information. Financial institutions, banks, GP practices, insurance companies and travel agents all hold sensitive information about us, and for different reasons. Financial data may be shared across platforms to foster new digital banking experiences, while health data may be shared to enable effective secondary research. Besides, accumulating data en masse becomes all the more attractive when modish technologies like machine learning and algorithms programmed to analyze heaps of input data can get the speedy and informed answer you're looking for.
Knowledge is power
As the Latin aphorism goes, ‘scientia potentia est’ – commonly, ‘knowledge is power’. Well, if data is knowledge, and knowledge is power, then harvesting hefty amounts of data and sharing it with others can only bear fruit, right? Well, not quite.
The problem starts when that data is sensitive, confidential, or simply private.
In 2018, the European Commission matched the public mood and growing data privacy concerns by implementing the Data Protection Privacy Regulation (GDPR), to avoid unconsented sharing of personal information. Companies deemed to have breached EU citizen’s privacy quickly came under scrutiny. The Danish cab hailing service, Taxa 4x3, was made into an example of how the GDPR requirements for data anonymization were to be taken seriously when it was fined DKK 1,2 million after the Danish data protection agency found the company had collected information when it was no longer necessary for the purpose it was collected for.
Importantly, it is not the collection of data necessarily, but the lack of effective anonymization that leads to privacy violations. As recognised by the European Commission itself, ‘data-driven innovation is a key enabler of growth and jobs in Europe’, optimizing and sharing data effectively can help businesses and governments alike.
Use case for anonymization
Anonymization allows for this heterogeneous data to move freely and safely, while complying with privacy regulations.
GDPR for instance, allows companies and public entities to collect and share anonymized data. Text including names, dates of birth, addresses, bank account details can be handed over to third parties as long as those personal details are erased. This addresses issues associated with unconsented disclosures of information and makes data accessible.
At Pangeanic, we have developed the MAPA Project to lead a European-wide anonymization project and provide public administrations with an open source toolkit for effective and reliable data anonymization. Pangeanic will make use of cutting-edge Natural Language Processing tools such as Named-Entity Recognition and Classification (NERC) techniques using both Deep Learning and neural networks for MAPA, with a focus on the medical and legal fields – where sensitive information must be protected.
MAPA will be trained to detect named entities, (names, bank account details, address) involving sensitive information. This way public administrations and users in general will be able to effectively comply with GDPR and protect citizens’ private details while sharing data.
Data is a powerful resource. The daily generation of information through the Internet of things is mind-boggling. On Twitter alone, around 500,000 tweets are posted per minute. For the world’s 4.57 billion active internet users, entering personal data online and sharing it is part of what modern life is all about. It is virtually impossible to think that there isn’t a company, organization or public entity that doesn’t hold some sort of information about you in some type of format.
The right to safeguard personal data should be exercised where possible, and to a great extent, the right to become anonymous can be solved by effective and reliable anonymization measures.