Try ECO LLM Try ECO Translate

2 min read

30/08/2021

Machine Learning datasets and Neural Engines from the NTEU consortium are now available.

The NTEU consortium, led by Pangeanic since 2019, has successfully completed the bulk data upload to the ELRC, enabling neural machine translation engines to be made available to European public administrations through the European Language Grid. The NTEU project aimed to collect and reuse a wide range of linguistic resources from various European CEF projects to develop high-quality, near-human translation engines specifically designed for use by public administrations in EU Member States. This extensive effort involved creating translation engines for all possible language pair combinations among the EU’s official languages, covering directions such as English to Spanish, German, or French, as well as low-resource language pairs, such as Latvian, Finnish, or Bulgarian into Greek, Croatian, or Maltese. All translation engines have been tested using the project's dedicated evaluation tool, MTET (Machine Translation Evaluation Tool), which was developed specifically for this purpose. MTET assessed the performance of direct translation engines (i.e., without using English as a pivot language) by comparing them with a set of freely available online translation engines. Each language combination was evaluated by two independent reviewers to ensure consistency in human judgment and assess how closely the machine-generated output aligned with a human reference expression.A view of Machine Translation Evaluation tool MTET

A Look at the MTET Machine Translation Evaluation Tool

Reviewers had the option to leave some assessments incomplete if they found them unclear or needed to pause and resume the task later. However, they were encouraged to evaluate segments consecutively, processing one sentence after another.

As shown below, certain language combinations—such as Irish Gaelic to Greek—proved particularly challenging.

Typical Evaluation Screen

Fig. 2 Typical Evaluation Screen

To ensure the final quality of the assessments, human reviewers were unaware of which results came from the NTEU engines and which originated from a second translation performed by a generalist online MT provider used as a reference.

Each output was rated using a sliding scale, moving from right to left and ranging from 0 to 100. The goal was to assess whether the machine-generated sentence accurately conveyed the meaning of the source language—essentially, how closely it resembled what a human would have written.

Evaluation Criteria

Another of the challenges that arose was the standardization of human criteria. Each individual is prone to having different linguistic preferences that can affect the evaluation of sentences. For this reason, it was important to follow the same punctuation guidelines from the very beginning. To standardize these criteria, Pangeanic established a set of instructions, in collaboration with the National Supercomputing Center of Barcelona, which had been tested as academic methods to ensure that all evaluators followed the same scoring methods across all languages. Unlike the methods employed with statistical MT (based on BLEU scoring), neural MT engines were to be rated based on accuracy, fluency, and terminology. These three key elements were defined as follows: Accuracy: Defined by whether the sentence conveys the meaning of the original text, even if synonyms have been used. Fluency: The grammatical correctness of the sentence (gender agreement, singular/plural, case declension, etc.). Adequacy [Terminology]: The proper use of terms in the domain agreed upon by the client and the developer, intended for production purposes, but which may not be standard or general terms (specific jargon). When scoring a sentence, the following weightings are generally applied:

  • Accuracy: 33%
  • Fluency: 33%
  • Adequacy [Terminology]: 33%

In general, reviewers assess each major error with a score deduction of 5 to 10 points. The evaluation was the result of applying these deductions. For example, a reviewer might have found two accuracy errors in a sentence (missing information and the addition of unrelated information). The reviewer would then subtract 5% for the minor error and 20% for the major error. If the reviewer (evaluator) had also found a minor fluency error, they could have deducted an additional -5%. "We are very pleased that this enormous effort has resulted in tangible outcomes for potential users, namely European Public Administrations, who can now use MT privately as internal infrastructure. These engines can also serve as a benchmarking tool for the academic MT community in general."
Statements by Manuel Herranz, CEO of Pangeanic.