Machine Learning Datasets and Neural Engines available from NTEU Consortium

The NTEU Consortium that Pangeanic has been leading since 2019 has completed the massive data upload to ELRC, with neural engines being available for European Public Administrations via the European Language Grid. The NTEU project goals were the gathering and re-use of many of the language resources of several European CEF projects to create near-human quality machine translation engines for use by Public Administrations by EU Member States. This massive engine-building endeavor encompassed all possible combinations among all EU official languages, in combinations ranging from English to Spanish, German or French to low-resource languages such as Latvian, Finnish or Bulgarian into Greek, Croatian or Maltese. Every engine has been tested using the project’s specific evaluation tool MTET (Machine Translation Evaluation Tool), which has been specifically developed for the project. MTET ranked the performance of direct combination engines (eg, not “pivoting” through English) versus a set of free online engines. Two graders had to rank every single engine (language combination) in order to normalize human judgement and asses how close the engines’ ouput was to a reference human expression.

A view of Machine Translation Evaluation tool MTET Human graders could leave some unclear evaluations unfinished (if they needed to stop and come back later), although segment evaluation done consecutively, one sentence after another was preferable. As we can see below, some language combinations (Irish Gaelic into Greek) were a challenge!

Fig. 2 Typical Evaluation Screen In order to guarantee final quality, human graders did not know which input came from the NTEU engines and which input came from a second translation by a generalist, online MT provider that was used as benchmark). They ranked each input by moving a slider from right to left and from 0 to 100. The aim was that during the evaluation, they could assess whether the machine-generated sentence adequately expressed the meaning contained in the source, that is, how close it was to how a human would have written it.

Evaluation Criteria

Another challenge was to standardize human criteria. Different people may have different linguistic preferences which can affect sentence evaluation. Thus, it was important from the beginning to follow the same scoring guidelines. To standardize criteria, Pangeanic laid out a set of instructions, together with the Barcelona SuperComputing Centre, and that had been proven as academic methods to guarantee all evaluators follow the same scoring methods across languages. Unlike SMT methods (based on BLEU scores) NMT needed to be ranked on accuracy, fluency and terminology. Those 3 key items were defined as followed Accuracy: defined as a sentence containing the meaning of the original, even though synonyms may have been used. Fluency: the grammatical correctness of the sentence (gender agreements, plural / singular, case declension, etc.) Adequacy [Terminology]: the proper use of in-domain terms agreed by the client and the developer and that are for use in production but may not be standard or general terms (the specific jargon). When ranking a sentence, the following weights were typically applied :

Accuracy : 33%
Fluency : 33%
Adequacy [terminology] : 33%

In general, we human graders evaluated from 5 to 10 points for every serious error. The evaluation was the result of applying these discounts. For instance, one grader might have found two accuracy errors in a sentence (some information is missing and non-related additional information had been added). The grader then subtracted 5% for the small error and 20% for the serious error from the Accuracy total. If the grader (evaluator) additionally found a small fluency error, he/she could decide to additionally deduct -5%, too. “ We are very happy this massive effort has crystalized into tangible results for the potential users , the European Public Administrations, which now can run MT privately as an internal infrastructure. Th ese engines can also serve as a benchmarking tool for the wider academic MT community” said Manuel Herranz, Pangeanic’s CEO.