If you run a translation company or translation department or have some sort of connection with the translation industry, you have noticed without a doubt that MT (or automatic translation) is the flavor of the year in 2010... and will be for many years to come. It has and will change the way do things in this industry. Several factors have been an unstoppable increase in the globalization of services and support, smaller budgets from buyers, an increase in international trading of services and the need for more content and more multilingual content in more languages. As of May 2009, there were 487 billion gigabytes of data which were increasing 50% a year (Oracle) or doubling every 11 hours (IBM). There are both exogenous and endogenous factors for things to reach maturity level now and not earlier or later. Among the latter factors we may include the fact that the bases did already exist due to rule-based technologies (still successfully in use in certain language pairs) and to the coming of age of 1990's CAT tools that have generated trillions of bilingual sets of data. We cannot forget the ever-increasing pressure on timely delivery and on cost. However, the key elements of this revolution have been exogenous, not coming from the traditional linguistic community: the maturity and availability of statistical systems as applied to language data (as much as other areas), which have pushed the boundaries of automatic translation beyond simple formulas or academic articles into tangible software applications. Statistical analysis has brought power to language processing and the availability of massive amounts of bilingual data (already envisaged by Chomsky in the 60's) has also made it possible, as well as the emergence of academic and open source initiatives, many funded by the EU or the American Government. I do want to miss this opportunity to remind us that most of the older (and ubiquitous) rule-based MT providers were born at a time when espionage needed vast amounts of data translated in the form of patents and paperwork and that the first statistical system received large funding from DARPA to make Arabic one of its main priorities (the focus of espionage had shifted after the Cold War). Therefore, non-military, non-governmental and free or open source initiatives all receive my praise, and that is one of the reasons (I suspect) why Moses has been so successful and has done a lot to bring attention to automatic translation /machine translation. Let us not forget Google's tremendous contribution, even from a very wide, non-specialist scope, dropping rule-based technologies in favor of statistical processes and creating a remarkable translation environment to feed its databases. A myriad of things have contributed to the new surge in MT. I do not want to leave out fruitful initiatives such as the Jaap van der Meer's TDA and the very TAUS, capable of combining collaborative efforts across the Atlantic and across many industries and organizations but with a focus on software (logical addition) which has enabled data sharing and has made data availability a reality. This worthy initiative is also beginning to reach countries producing massive amounts of translation and in need of automation such as Japan, where I will soon speak on the subject and PangeaMT's career in the customization (I prefer the word "adaptation") of the academic translator Moses into a powerful useful product from an LSP perspective during Japan's Translation Federation exhibition on the 13th December. Moses fever has also caught up quickly in Japan, and SourceForge shows over 6,000 downloads worldwide this year alone. Everyone is experimenting. The most popular star, without a doubt, because of its availability and relative easy-of-use has been the SMT Moses toolkit, part of the EuroMatrix project (the link will take you to interesting results and tests conducted with several other kits). It is beginning to empower companies to create their own solutions but many are discovering that implementing an open source solution for MT is not as easy as it seems (even those that are "out of the box"), despite the attractiveness of the powerful word "free" or "open source". DIY'ing MT into one's workflow is also not for the faint-hearted. Due to its popularity and zero cost, Moses has acquired a kind of Messianic status, as the solution for everything, the magic wand that will reduce translation costs upon installation, the solution for producing tens and millions of words instantly. Far from it. As an experienced Moses customizer, I would like to list a few of the advantages and limitations of the system for LSPs and organizations in general, and how much work building around it has taken us here at PangeaMT (a good summary of it can be viewed on-line from our recent presentation in Portland).
What Moses can do
- It is absolutely free. Go to SourceForge, type Moses SMT and download it. (It needs to be installed in a Linux server).
- It excels at translating close language pairs.
- It provides an excellent environment for testing MT and driving pilots, to actually see how MT works and what is required.
- You can re-use all your bilingual translated assets as training material.
- It requires little power to use (once the system has been trained, it can run even in a run-of-the-mill Linux-PC, but remember it is a Linux application with no interface). The training does require high-spec servers.
- It comes with a BLEU score facility to see how well you are doing.
- It is a scalable, open program. This means that you can build around it yourself and overcome any limitations by programming your own modules for pre- and post-processing.
What Moses cannot do
- It does not reorganize output, i.e there are no grammar rules telling the target language where things go. This is one of the reasons German, Basque or Japanese always get a lower score than more predictable Romance language when English is the source, as they split verbal information apart (and with English as well, to an extent). Agglutinating languages such as Turkish or Finnish are clearly not prone to statistical MT - but as far as I know are not easily dealt with by rule-based systems either because of their intrinsic characteristics. Only Apertium has had a limited amount of success dealing with Basque.
- Moses only translates from plain text and it only produces plain text. You need to remove all the tags prior to training or input/requesting text.
- Moses does not translate "off the box", it is not a CAT tool and it does not store nor update TMs. It requires the training of a) a Language Model b) SMT kit (Moses itself)
- Training cannot be done in an ordinary server. Training of both the LM and the Kit requires a lot of computing power. Typically, you will need a huge server (2-3 recommended) to speed things up and of course a capable programmer.
- Moses does not run in Windows, although we have successfully packaged it in Cygwin in several occasions - this is not the ideal environment, though, and it slows the process.
- Moses does not include data update features and cannot be updated without retraining. This means that each update with new data requires the run of the same routine commands as for all the training, with no back-up copy of the previous version. It is hardly a "re-training" but a new, larger version each time.
- There are no terminology or DNT (Do Not Translate) features.
As you have probably noticed by now, running Moses and putting it to work does not require translators but computer linguists or computer programmers. We have overcome practically all these limitations with our PangeaMT series. However, the effort can never be underestimated. Many LSPs are currently experimenting and considering whether the effort is worth it at all considering the set-up and running costs. It is, after all, a change from being a service to becoming a kind of developer. Some LSPs will do it and create an internal environment which fits their needs, others will prefer CAT tools with MT interface, others will likely buy some kind of MT software solution to plug in their system (to minimize self-promotion I will only mention that we have customized and installed PangeaMT at LSP's, too, you can read it in PangeaMT's website). It makes sense if you have many customers in the same vertical (even with different terminology). Most LSPs choose a specialist area or several (legal, patents, automotive, engineering, electronics, software, etc) We no longer translate as we did 50, 20 or even 10 years ago. What form will the translation process take 10, 20 or 50 years from now? I envisage MT will play a fundamental part in the process, with data sets being picked to match, automated training and light human post-editing. Frankly, I see all chance of a conversion with speech technologies once we get to a MT-2 stage. We are witnessing the start of a new leap for our industry which will affect not only pricing, but output and the relationship between clients and vendors.
Next time you think languages, think Pangeanic Your Machine Translation Customization Solutions