NMT versus SMT results in Japanese

The Pangeanic neural translation project

The last few months have been extraordinarily busy at Pangeanic, with a focus on the application neural networks for machine translation (neural machine translation) with tests into 7 languages (Japanese, Russian, Portuguese, French, Italian, German, Spanish), the completion of a national R&D project (Cor technology as a platform for translation companies offering an integrated way of analyzing and managing website translation and document analysis), the integration of CAT-agnostic translation memory system ActivaTM into Cor and our neural engines, and the award by the European Union’s CEF (Connecting Europe Facility) of the largest digital infrastructure project to build secure connectors to commercial MT vendors and the EU’s own machine translation service (MT@EC) for public administrations across Europe. Leading machine translation developers such as KantanMT, Prompsit, Tilde and our PangeaMT join forces with consulting company Everis to build IADAATPA, a system that will intelligently work on domain adaptation and the selection of the most appropriate engines through secure connectors for Public Administrations in the EU.

So, time to recap and describe our experience with neural machine translation and how Pangeanic has decided to shift all its efforts into neural networks and leave the statistical approach as a support technology for hybridization.

The Pangeanic neural translation project

We selected training sets from our SMT engines as clean data to train the same engines with the same data and run parallel human evaluation between the output of each system (existing statistical machine translation engines) and the new engines produced by neural systems. We are aware that if data cleaning was very important in a statistical system, it is even more so with neural networks. We could not add additional material because we wanted to be certain that we were comparing exactly the same data but trained with two different approaches.

A small percentage of bad or dirty data can have a detrimental effect on SMT systems, but if it is small enough, statistics will take care of it and won’t let it feed through the system (although it can also have a far worse side effect, which is lowering statistics all over certain n-grams).

Visual sample of statistical candidates with best candidate proposed in a statistical machine translation system

Visual sample of statistical candidates with best candidate proposed in a statistical machine translation system

We selected the same training data for languages which we knew were performing very well in SMT (French, Spanish, Portuguese) as well as those that have been known to researchers and practitioners as “the hard lot”: Russian as the example of a very rich morphologically language and Japanese as a language with a radically different grammatical structure where re-ordering (that’s what hybrid systems have done) has proven to be the only way to improve.

Japanese neural translation tests

Let’s concentrate first with the neural translation results in Japanese as they represent the quantum leap in machine translation we all have been waiting for. These results were presented at TAUS Tokyo last April. (See our previous post TAUS Tokyo Summit: improvements in neural machine translation in Japanese are real).

Japanese neural translation engine for the electronics and IT field

Tokenizer.perl and Mecab were used for English and Japanese tokenization respectively.

We used a large training corpus of 4.6 million sentences (that is nearly 60 million running words in English and 76 million in Japanese). In vocabulary terms, that meant 491,600 English words and 283,800 character-words in Japanese. Yes, our brains are able to “compute” all that much and even more, if we add all types of conjugations, verb tenses, cases, etc. For testing purposes, we did what is supposed to do not to inflate percentage scores and took out 2,000 sentences before training started. This is a standard in all customization – a small sample is taken out so the engine that is generated translates what is likely to encounter. Any developer including the test corpus in the training set is likely to achieve very high scores (and will boast about it). But BLEU scores have always been about checking domain engines within MT systems, not across systems (among other things because the training sets have always been different so a corpus containing many repetitions or the same or similar sentences will obviously produce higher scores). We also made sure that no sentences were repeated and even similar sentences had been stripped out of the training corpus in order to achieve as much variety as possible. This may produce lower scores compared to other systems, but the results are cleaner and progress can be monitored very easily. This has been the way in academic competitions and has ensured good-quality engines over the years.

The standard automatic metric in SMT did not detect much difference between the output in NMT and the output in SMT.

BLEU does not detect the huge difference in perceived quality - WER is a better indicator

BLEU does not detect the huge difference in perceived quality – WER is a better indicator

However, WER was showing a new and distinct tendency.

NMT versus SMT results in Japanese

NMT shows better results in longer sentences in Japanese. SMT seems to be more certain in shorter sentences (training a 5 n-gram system)

And this new distinct tendency is what we picked up when the output was evaluated by human linguists. We used Japanese LSP Business Interactive Japan to rank the output from a conservative point of view, from A to D, A being human quality translation, B a very good output that only requires a very small percentage of post-editing, C an average output where some meaning can be extracted but serious post-editing is required and D a very low quality translation without no meaning. Interestingly, our trained statistical MT systems performed better than the neural systems in sentences shorter than 10 words. We can assume that statistical systems are more certain in these cases when they are only dealing with simple sentences with enough n-grams giving evidence of a good matching pattern.

We created an Excel sheet (below) for human evaluators with the original English to the left and the reference translation. The neural translation followed. Two columns were provided for the ranking and then the statistical output was provided.

A table showing original English and Japanese reference translation

Neural-SMT ENJP ranking comparison showing the original English and the reference translation, with the neural ranking to the left and the statistical system to the right

German, French, Spanish, Portuguese and Russian neural translation results

The shocking improvement came from the human evaluators themselves. The trend pointed to 90% of sentences being classed as perfect translations (naturally flowing) or B (containing all the meaning, with only minor post-editing required). The shift is remarkable in all language pairs, including Japanese, moving from an “OK experience” to a remarkable acceptance. In fact, only 6% of sentences were classed as a D (“incomprehensible / unintelligible”) in Russian, 1% in French and 2% in German. Portuguese was independently evaluated by translation company Jaba Translations.

Human evaluation of neural translation in German, French, Russian

Human evaluation of neural translation in German, French, Spanish, Portuguese, Italian, Russian

This trend is not particular to Pangeanic only. Several presenters at TAUS Tokyo pointed to ratings around 90% for Japanese using off-the-shelf neural systems compared to carefully crafted hybrid systems. Systran, for one, confirmed that they are focusing only in neural research/artificial intelligence and throwing away years of rule-based work, statistical and hybrid efforts.


Systran’s position is meritorious and very forward thinking. Current papers and some MT providers still resist the fact that despite all the work we have done over the years, Multimodal Pattern Recognition has got the better hand. It was only computing power and the use of GPUs for training that was holding it behind. The above article at PangeaMT provides some information about what is changing in the automated translation landscape as we speak and an example of the first neural papers back in the 90′s which has guided much of our own R&D.

Neural networks: Are we heading towards the embedment of artificial intelligence in the translation business?

BLEU may be not the best indication of what is happening to the new neural machine translation systems, but it is an indicator. We were aware of other experiments and results by other companies pointing in a similar direction. Still, although the initial results may have made us think that there was no use to it, BLEU is a useful indicator – and in any case, it was always an indicator of an engine’s behavior not a true measure of an overall system versus another.  (See the wikipedia article https://en.wikipedia.org/wiki/Evaluation_of_machine_translation).

Machine translation companies and developers face a dilemma as they have to do without the research, connectors, plugins and automatic measuring techniques and build new ones. Building connectors and plugins is not so difficult. Changing the core from Moses to a neural system is another matter. NMT is produces amazing translations, but it is still pretty much a black box. Our results show that some kind of hybrid system using the best features of a SMT system is highly desirable and academic research is moving in that direction already – as it happened with SMT itself some years ago.

I brought some useful tips from my attendance to SlatorCon in London. One is that translation buyers are still in sheer need of affordable translation solutions that can centralize assets and workflows. Another one is that neural MT is taking center stage as the technology that can truly change the game. The most important one, I would say is that venture capital money is pouring into the translation industry because it sees strong similarities with other industries (advertising, for one) that were disrupted years ago and produced something new.

“There was not a lot of technical innovation in the advertising industry until the late 1990s,” observed Marcus Polke, Investment Director from Acton Capital Partners. “And then came the Internet, which bypassed and marginalized ad agencies as online and offline advertising transformed into a complex landscape.

Yes, the translation industry is at the peak of the neural networks hype. But looking at the whole picture and how artificial intelligence (pattern recognition) is being applied in several other areas, in order to produce intelligent reports, tendencies and data, NMT is here to stay – and it will change the game for many, as more content needs to be produced cheaply with post-edition, at light speed when good machine translation is good enough. Amazon and Aliexpress are not investing millions in MT for nothing – they want to reach people in their language with a high degree of accuracy and at a speed human translators cannot.

Leave a Reply

Your email address will not be published. Required fields are marked *

eight + 5 =

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>