Pangeanic introduces anonymization and neural machine translation at META-Forum 2020

Pangeanic has been an active partner at the META-Forum 2020 conference introducing two projects it is currently leading, with deliverables to the European Commission CEF program. The first presentation was about anonymization software, the MAPA CEF project on Wednesday 2nd and the second one about the largest ever direct combination neural engine farm, the NTEU CEF project. Due to current traveling restrictions, the conference was held online and hosted from Berlin in Germany from 1-3 December 2020. The conference was organized by the European Language Grid (ELG) and included presentations of the ELG Pilot projects, current EU-funded projects in the Language Technology area, state-of-the art and future work, as well as reports from language technology companies and industry. The talks allowed the community to share current knowledge and efforts from different AI communities in Europe. META-FORUM

Pangeanic was in charge of 2 presentations:

MAPA presentation

MAPA stands for Multilingual Anonymization for Public Administrations. The goal of this CEF project is to develop an open-source de-identification toolkit for all official European Union languages. Pangeanic designed the concept and presented to proposal to the EC. It now leads the consortium, which including recognized European data organizations, Spanish government Data Agency and the French National Research Center, among others. The MAPA anonymization toolkit will rely on Named Entity Recognition and Classification (NERC) techniques using the latest neural networks and deep learning techniques. MAPA will manage a large multilingual annotation data collection activity and provide the necessary training and testing data for the toolkit development as a docker. Data is currently being identified and collected for the 24 official European languages. As part of the project, a connection to eTranslation, the online machine translation service provided by the European Commission will be established to foster the provision of machine translation services to and by public administrations with the possibility to anonymize the content. The toolkit, in its basic form, will be publicly available to European Public Administrations and EU institutions themselves. It will also foster the growth of language technologies as a key component of new digital and AI societies, helping to ensure personal data anonymization. MAPA is particularly targeted to public administrations in the health and legal domains, as a result of the specific use cases addressed during the development of the project. Partners will be able to customize the package further for particular national or commercial use.

NTEU presentation

The objective of the NTEU project is to build a neural engine farm with all the 24 European official language combinations for eTranslation, without the necessity to pivot through a high-resourced language. This project is creating 506 near-human quality neural translation engines in total in all EU official language combinations. NTEU stands for Neural Translation for the European Union which Pangeanic leads together KantanMT and Tilde, two leading language technology companies in Europe and Spanish government agency SEDIA. NTEU will provide a capacity service to eTranslation by building a near-human, professional-quality neural engine farm that can be deployed as an infrastructure for machine translation. Lower-resourced languages are a known challenge, and more effort is required to obtain well-performing engines for them. Techniques to supplement the original data, such as generating synthetic data and transfer learning are performed. The machine translation output from the engines is manually evaluated following industry and WMT practices in an open-source platform created by the consortium. In addition, the NTEU consortium will gather and clean a large data set from all language combinations so that the engines can be retrained with other technologies in the future. Both Pangeanic presentations raised a lot of interest in the language community and the MAPA anonymization digital booth was the most popular as anonymization services are in high demand by the public and private sector. Deliveries of the projects will be shared in the ELG and ELRC-SHARE repositories.