Ensuring good machine translation using BLEU scoring

As time goes by, machine translation (MT) systems have become indispensable. Innovative technologies have been developed with sophisticated artificial intelligence algorithms that, in addition to providing translation quality, enable large volumes of text to be processed in short amounts of time.

However, machine translation quality must be guaranteed. The way to do this is through a mechanism that evaluates the performance of the MT system. Among the most effective methods is the BLEU score


What is a BLEU score? 


BLEU, or Bilingual Evaluation Understudy, is a method based on scores that evaluate the quality of the work performed by the natural language processing (NLP) system for translation.

Basically, BLEU compares the text generated by machine translation with the reference translations that have been performed by humans and are deemed to be correct.

BLEU NLP scoring, during its evaluation process, performs comparisons between MT sentences and the corresponding sentences in the reference translation. According to the number of matches and degree of similarity, BLEU calculates a score.

This scoring system has a range between 0 and 1. If the match is complete and perfect, then BLEU results in a value equal to 1. If there are no matches at all, then BLEU assigns a score of 0.

Obtaining a result equal to 1 is almost impossible, as it would mean that the machine translation result was exactly the same as the one performed by a professional translator.


How is the BLEU score calculated and how does it work? 

When speaking at the algorithm level, BLEU computes its score according to the matching n-grams in the texts it is comparing. An n-gram, in statistical and computational language, is a sequence of "n" amount of elements, either from a text or speech sample. These elements can be words, syllables, letters, etc.

BLEU compares the translated text (candidate translation) with the reference translation and counts the matching n-grams. But to ensure accuracy, the BLEU NLP score also modifies the calculated count, a process known as modified n-gram precision.

The modified precision of n-grams is based on a simple calculation:

  • It counts the total number of times each word (n-grams) of the candidate translation appears in all the reference translations, and assigns the variable mmax.
  • It clips the number of times the word (mw) appears within the candidate translation and makes it equal to mmax.
  • All the clipped mw for each of the words (within the candidate translation) are added up.
  • The above sum is divided by the total number of n-grams of the candidate text that match in the reference texts.
  • The result is the modified precision.

For the final calculation, BLEU uses a statistical formula in which a brevity penalty (BP) is applied to the modified precision. This factor is calculated as follows:

  • The brevity penalty is equal to 1 only if the number of words in the candidate translation is greater than the number of words in the reference translation.
  • The brevity penalty is equal to: e(1-r/c), if the number of words in the candidate text is less than, or equal to, the number of words in the reference text.

Finally, the BLEU score is obtained with the following formula:

BLUE=BP ∙exp⁡n=1Nwnlog pn

This mathematical expression denotes the importance of the number of matching n-grams. The summation starts when 1-gram match is detected, i.e., when n=1. And the summation continues up to the total number of matches (N).

Pn results in the value of precision for each matching n-gram; so, n=1 corresponds to P1.


Machine translation and BLEU scoring 


Machine translation is the process performed by a computer program in order to understand a text and express it in another language, accurately and without human intervention.

Currently, machine translation is performed by artificial intelligence mechanisms capable of processing natural language successfully, generating texts with a quality translation.

Along with the evolution of machine translation systems, automated methods have also been developed to guarantee the quality and accuracy of machine-generated text; like BLEU scoring.

Although in some circumstances a translation may require a human review, it is also true that the current sophisticated technologies used for translation, in conjunction with the BLEU method as a quality guarantee, generate optimal results and without the need for review.

In addition, machine translation and BLEU scoring offer many advantages. Just as a good machine translation is a faster and less costly method than a translation done by professional translators, the BLEU score is a mechanism that can process a large number of texts for evaluation at high speed.

BLEU scoring limitations 


Although BLEU scoring is a fast and reliable method, it also has some limitations:

  • It does not take into account synonymous words.
  • It scores more accurately in the evaluation of long sentences.
  • It does not take into consideration the proper use of grammar.

To avoid low scoring because of the length of sentences, BLEU inserts in its formula a brevity penalty (BP). In addition, it establishes a limit for counting terms.

The calculation system and correction mode make BLEU one of the most widely used mechanisms for quality assurance in translation. The BLEU score for NLP is an automated method, easy to apply, and generates very similar results to the reviews performed by translation professionals.


BLEU, the bilingual evaluation understudy, is widely used to evaluate demanding translations, such as in legal and financial fields, international law, medical documents, etc.

To find out which translation suits your needs, please contact us. At Pangeanic, we provide various natural language processing services, such as machine translation. We believe in the democratization of artificial intelligence, so that everyone has access to information.