Machine translation quality: How to measure it

Many people have asked me how they can reliably use a system to measure/benchmark the quality of their translation system (rule-based, example-based or statistical). They have bought some commercial rule-based software and are trying it, building dictionaries and normalization rules or they are having a first try at what it means to deal with a Moses engine. There are two free systems which can be used as input/output and that will give you an idea of how your system is scoring. Some people use them to test their system versus Google Translator, raw MT output or other texts. You can use it, for example, to check how your system is doing in comparison with free GT, Systran online tools, BabelFish, etc. It may give you an idea of your progress as you customize your own tool for a particular application, taking generalist online tools as a basic reference. The tests are not so difficult to carry out. All you will need is some help at installation stage if you are not familiar with Linux and running a few command lines. Once you get used to it, you can run progress check tests at will. BLEU is the standard in the industry. Most MT systems will show some kind of BLEU score to prove their progress and reliability sooner or later. However, there are some drawbacks on BLEU and you may feel some of its high scores do not actually represent the same kind of improvement when you look at the translated files. We favour Meteor at Pangeanic. It not only takes into account word-per-word occurrences. It also takes into account some linguistic tree-like info to the tests. Your scores in BLEU will usually show as lower results in Meteor, although this is just a very wide rule-of-a-thumb. We have experienced higher Meteor scores at Pangeanic when measuring engines providing marketing texts or general translation. This is because not just the word occurrence was taken into account, but other relations (i.e. whole family). Looking at the results from a post-editing point of view, this may be more relevant because it takes little time to correct the wrong tense in a verb, a singular or a plural. We recommend you take 60% of the BLEU score as your productivity target initially. Once an MT system is up-and-running, and it has been updated and perfected over some months, your scores and productivity will go up exponentially.