INEA has awarded Pangeanic's consortium almost €1M to develop a multilingual anonymization toolkit based on AI processing of health, life science,...
At the SwissText & KONVENS 2020 conference from June 23-25, 2020, Pangeanic presented the Multilingual Anonymisation toolkit for Public Administrations ( MAPA) project. MAPA is supported both by the European Union’s CEF (Connecting Europe Facility) Programme, and the NTEU (Neural Translation for the EU) project led by Pangeanic. The PangeaMT technical team is developing a tool to anonymise text in all of the official languages of the EU. The linguist team oversees annotating the corpora being created to train the neural models, which will allow for the prediction of text containing personal data to de-identify said data. These models are based on named entity recognition and classification and include the pre-trained language model based on transformers called multilingual BERT, which will enable transfer learning from rich-resource languages such as English to low-resource languages such as Maltese. This tool will help European public administrations to share data while protecting privacy and complying with GDPR requirements. The code will be shared as open source to help in the development of this technology. The project is being carried out with Tilde, CNRS, ELDA, the University of Malta, Vicomtech and SEDIA as partners. This year the SwissText & KONVENS 2020 conference took place in Zurich, but we attended virtually via videoconferencing. On the first day, several workshops were held on natural language processing tasks in German and the other two days consisted of Keynote presentations and several parallel sessions on speech recognition, biomedicine, pre-trained language models, business and text and sentiment analytics. Pangeanic’s presentation of MAPA was held in the business session on the last day and caught the interest of fellow attendees.