In EXPloiting Empirical appRoaches to Translation

As recently published in our news section, Pangeanic will take part as a full member in the EU-funded EXPERT Project.

EXPERT aims to train young researchers to promote the research, development and use of hybrid language translation technologies. In practice, EXPERT aims at improving translation practices and enhancing the productivity of relevant actors in the translation market. In this respect, EXPERT’s findings will set an agenda for new skills and jobs by promoting new job profiles based on empirical data from translation professionals (language service providers) and academia. The assumption of the project is that true potential of MT remains to be exploited as a result of non-user-friendly interfaces, lack of awareness of translator's feedback, etc. However, Pangeanic already created and released a web-based tool that is able to organize material for Machine Translation by domain, maintain it and perform some cleaning routines, a key factor in our participation in the project. This web tool is also able to directly create engines by domain or by TMs and perform several operations on training sets before engine training. Following a revolutionary concept, Machine Translation engines are created or updated depending on domains, and a few clicks can set in motion several actions to provide ready-for-use (S)MT. The web tool already incorporates hybrid features (such as those presented at JTF in Tokyo, 2011), and these will be tested, expanded and improved upon in EXPERT. Our role within the 4-year project is to concentrate on results-driven testing of hybridization on the 6 official United Nations languages, carrying out a series of experiments on EN/FR/ES/ZH/ RU/AR. These will include general pre- and post-processing rules designed to improve machine translation output. For example, some tests will alter training sets and evaluate the impact of reordering in certain language combinations, measuring gains when using purely statistical, syntax-based or factorial models. Pangeanic will focus on the automatic generation of bilingual written texts for multiple language combinations, alignment, segment cleaning and segment selection for bilingual engine building. We will also look at what hybridation language technology techniques need to be incorporated and improved to tackle re-ordering issues and other linguistic phenomena in non-related languages. When dealing with language-specific issues, we will also delve into automatic quality metrics and how these can correlate to human, non-objective qualitative appreciations. Using our tool, users can check engine statistics (e.g. BLEU score, number of segments, number of words) and behavior. For example, the engines can be used for translation and can be automatically updated with new or post-edited material, with further retraining possible via Pangeanic's MT API or web. In this way we can measure the impact of new datasets and hybrid techniques over time on translation quality and the project will benefit from existing, state-of-the-art technologies.

Next time you think languages, think Pangeanic Translation Services, Translation Technologies, Machine Translation