Techniques for Measuring Machine Translation Quality

Written by Ángela Franco | 07/21/22

Technological progress in the field of machine translation (MT) is indisputable. The processing of large volumes of texts to transfer content from a source language to a target language in just a few seconds is a mechanism that is constantly becoming faster and more precise.

Nowadays, we are surrounded by MT, and it is becoming more and more indispensable for companies, organizations, scientists, etc., when sharing information globally.

But at the same time as this progress is taking place, other issues are arising: How do we know if the translation quality is being maintained? What techniques do we have for measuring the quality of machine translation?

Why is it important to measure the quality of machine translation?

On the one hand, machine translation is available on a variety of public platforms, within everyone's reach. And on the other hand, there are numerous language service providers offering large-scale translation and the processing of specialized texts.

In this context, it is essential for companies with a global reach to know which solutions offer high-quality MT and which MT system is best suited to the specialized terminology and language they use.

In addition, the machine translation developers themselves need to know the scoring metrics, both automated and human, in order to evaluate the output of MT systems.

Developers evaluate the memory, fluidity, adequacy, and accuracy of the system to obtain a quality estimation for the MT output. Then what do they do with it? They use the results of the evaluation to modify and optimize the responses of the system's initial algorithms.

Recommended reading:

Everything you need to know about machine translation

How to measure machine translation quality

Estimating the quality of machine translation is vital. This estimate is produced by automated methods and/or techniques applied by human translators.

The selection of MT quality assessment methods always depends on what you need to know. If it is necessary to find out whether the text that has been machine-translated meets the required quality, then a human technique must be applied.

But if the need is focused on evaluating an MT system, then automated methods, such as BLEU, should be used.

Human techniques

The techniques applied by human translators (evaluators) in the evaluation of machine translation quality include the following:

Evaluation with post-editing. This is the process of editing the translated text. It consists of showing the evaluator both the machine translation and the reference translation performed by a human. The evaluator must paraphrase the latter translation.
The classification of complete sentences. The evaluator is presented with several translation options for the same source sentence and must rank them in order (1st, 2nd, etc.) according to the quality of the translation.
Error classification. In this case, evaluators classify the types of errors in a given translation. In addition, they can make any comment they consider pertinent in the case of a complex error.
Sentence classification. This consists of presenting the evaluator with the source sentence and its reference translation, plus two translation options so that they can classify the sentences.

Automated metrics

Automatic scoring systems are also used for the evaluation of machine translation. They are indispensable for objective and rapid evaluation. For example:

BLEU (bilingual evaluation understudy). This is the most commonly used metric in assessing the quality of MT systems. By comparing the translated text and a human reference translation, an accuracy calculation is made, resulting in a score between 0 and 1.
TER (translation error rate). This metric calculates the number of edits that would be required for a translation generated by an MT system to become the reference translation produced by a human translator.
WER (word error rate). This automated method measures error at the word level.
METEOR (metric for translation evaluation with explicit ordering). This is a method that works on the basis of unigram precision and recall. It also uses stemming and synonymy.

It is worth mentioning that these metrics are usually based on how much editing is necessary to arrive at a reference translation performed by a human, so they are only an estimate of how good the translation is.

Read more:

Ensuring good machine translation using BLEU scoring

Myths and Misconceptions about MT

There are many different types of myths and misplaced beliefs about the quality of machine translation.

One of them is that "Google has the best MT systems." It's true that Google has excellent machine translation systems, but they are generic and do not respect data privacy. So for a customized or specialized translation, there are better options.

Another misconception is that the quality scores of MT systems are static, which is not true. MT systems, provided they are operated by specialized companies, are updated on a regular basis. Therefore, instant comparisons are true only at a given time and for an exclusive test group.

The idea that the best machine translation system can only be determined by the linguistic quality of the output translation must also be discarded. In fact, linguistic quality is only one of the criteria necessary in the evaluation of an MT system.

The system must also be evaluated according to the requirements of the companies requesting the service. For example, the security and data privacy a system offers should be taken into account, as well as the speed and customization possibilities.

Pangeanic guarantees machine translation quality

At Pangeanic, we guarantee the quality of our machine translation through the use of non-static models, post-editing by expert linguists—essential for the continuous learning of the MT system—and quality assessment, including BLEU, ChrF, and TER metrics.

To apply machine translation quality assessment techniques, a test corpus is automatically translated using the MT system and then compared to a reference translated by a native linguist.

This analysis allows us to obtain the necessary metrics to detect possible model failures and modify the algorithms. It is an iterative process for the continuous improvement of the system.

At Pangeanic, we know how important quality and accuracy are in translation. We specialize in different types of translation services and can advise you on the services that best suit your needs.

View full post