Tag Archives: MT

Machine Translation in Short

It is evident that certain documents require a human translator in order to interpret the subtleties of a language. Nevertheless, no matter how skilled a human translator may be, machine translation (also known as automatic translation or MT for short) exceeds the efficiency of a human translator.

Machine translation is generally used for subject-specific cases and this is where results and productivity rates are spectacularly higher. It allows individuals and companies to tailor their work according to the topic. Consequently, this enriches the output and quality of machine translation by cutting down on the number of choices for each word(s) to be translated.

This form of translation is extremely helpful in areas where formal language is used or phrases are repeated without much variation, such as administrative documents, which do not require the use of colloquial language and expression.

The potential of machine translation has been increasingly explored. In 2009, even President Obama mentioned that “highly precise automatic translation…could reduce the barriers faced in international commerce and collaboration.

Companies such as Microsoft are pushing this field to its forefront to create the most efficient forms of translation. Simultaneous-translation devices are being explored worldwide, ranging form London to Japan, where large mobile-phone companies like NTT DoCoMo, have introduced an apparatus that translates phone calls between English and Japanese, or Chinese and Korean. More about this form of technology can be read in a recent article in The Economist.

Although simultaneous-translation seems to be at the height of the translating industry’s innovation, machine translation remains an extremely sought after technology; Microsoft’s Translator API (application programming interface) alone attracts over 10,000 commercial users. Its increasing investment in this field may have to do with the accumulation of information on the Internet and the value of social media- for example Amazon, Facebook, and Twitter have integrated Microsoft’s Translator Hub into their websites.

Our machine translation division PangeaMT has been a leader in developing, fast-training and self-updating (DIY SMT) routines since 2011. This allows users to create small engines with their own material (TMX bilingual files) whilst profiting from the language coverage offered by larger engines – with a very rich set of quality features and functionalities.

Next time you think languages, think Pangeanic
Machine Translation Engines from PangeaMT

follow us on –> Follow manuelhrrnz on Twitter  @Pangeanic   @manuelhrrnz

Pangeanic Christmas Party… All for Translation Automation!!

Let’s change our machine translation and translation automation focus for once and share the happiness of Christmas period with everyone. All Pangeanic staff work very hard in all types of translation projects and translation consultancy so… it was time to celebrate!

Next time you think languages, think Pangeanic
Machine Translation Engines from PangeaMT

follow us on –> Follow manuelhrrnz on Twitter  @Pangeanic   @manuelhrrnz

EU reduces translation budget – Machine Translation and Post-editing, one future

by Manuel Herranz

On 21st November 2012, lawmakers approved a report by Stanimir Ilchev, a Bulgarian Liberal MEP, that will bring change to the procedural rules recording plenary debates. This decision could be a Godsend for machine translation and language technology developers as the EU plans to increase translation productivity (or times) by 25% – this being a target in current R&D Language Technology Funding Calls.

Starting from the next plenary, on 10th December, the European Parliament is not going to be required to translate the session into all the 23 official languages of the EU. Over the years, this requirement has proved quite costly and can take up to four months. However, a bias towards the English language has been pointed to in many circles and instances. For example, Jean Quatremer, a renowned French political journalist from the French daily Libération, complained about the official press statements containing the Commission’s economic recommendations to member states, published on 30th May 2012. These statements had been eagerly awaited by the press because of the euro debt crisis, but initially were only made available to journalists in English. The translations into other languages followed a few hours later that day. Mr. Quatremer said that initial monolingual release provided the Anglo-Saxon press with an “incredible competitive advantage” and it threw into doubt the institutions’ democratic legitimacy, making very clear his position on a very strong-worded blog entry

From December 2012, the EU legislative will only record proceedings in the original language of the speaker. Nevertheless, the proceedings will still be required to be translated into a particular language if there is a request by a member state.  However, in the European Parliament many official press statements are currently published only in English and a very limited amount of them are translated in other languages – despite huge efforts and money invested into translation services and increasingly, in machine translation technology.

“This is one of our struggles – that the press releases and all publications and communications with society (tenders, contracts, etc.) are translated,” said Miguel Angel Martinez Martinez, the Parliament’s Vice-President in charge of multilingualism.

Numbers speak for themselves: 72% of all EU documents are drafted in English, with French coming a far second with 12%. Only 3% are originally drafted in German. On the other hand, 88% of the users of the Commission’s Europa website speak English. In reality, “providing documents in English, French, German, Spanish and Italian would cover close to 100% of all the EU’s linguistic needs”, said the DG Translation Director-General Lönnroth, speaking at a debate hosted by the Centre for European Policy Studies on 22nd February. The Union “will just have to cope” with increasing linguistic pressures brought on by future enlargements because “no decision-maker would dare to touch the main principles” of the EU’s language policy.

Mr Ilchev rejected proposals to translate the sessions only in English, as it would “appear linguistically unjust”. In the current EU, having 23 official languages means 506 translation and interpreting combinations, said Translation Director-General Lönnroth, a figure which can increase significantly when Croatia, Serbia join, and even Turkey in the foreseeable future.
Acknowledging he is not a “language fanatic”, the director-general claimed he thinks “about how to reduce the workload every day” as it was “not in the taxpayer’s interest” to provide every language combination. Lönnroth said back in February that “it would be easier if everybody accepted that English and French were the main EU languages”.  This is what (partially) is going to happen, although Mr. Ilchev assures that the initiative will not harm multilingualism, a principle enshrined in EU treaties: “of course this principle is not in question and everyone can listen to our debates in plenary in their own language” – through interpretation. Some of the EU’s research funding actually goes into technology solutions and research. For example, the SUMMAT project aims at creating an online service for subtitling by machine translation.

Next time you think languages, think Pangeanic
Machine Translation Engines from PangeaMT

follow us on –> Follow manuelhrrnz on Twitter  @Pangeanic   @manuelhrrnz

NTT DoCoMo prepares Japanese machine translation through Android

Japan is unique in many ways and this is reflected and expressed in its culture and its challenging language.  Japanese is controversially an Altaic language spoken by around 127 million people. Its intrinsic characteristics make it a challenge for machine translation and other forms of translation automation, although Pangeanic, in collaboration with Toshiba, has reported several advances in hybrid MT (as published in the Asian Association of Machine Translation in 2011) and presented in Japan Translation Festival (see presentation here).

Making calls to other countries a challenge for Japanese speakers: locals often don’t have much choice but to learn someone else’s language or hope there’s a Japanese speaker on the other end of the line.

All going well, NTT DoCoMo’s planned Hanashite Hon’yaku automatic translation service, international calls will be as comfortable as phoning a store in Nagano. As long as a subscriber has at least an Android 2.2 phone or tablet on the carrier’s moperaU or sp-mode plans, the service will automatically convert spoken Japanese to another language, and reverse the process for the reply, whether it’s through an outbound phone call or an in-person conversation.

The service is scheduled to operate from 1st November, when it will translate from Japanese to Chinese / English / Korean. Machine translation from Japanese into other European languages like French, German, Italian, Portuguese, Spanish plus two more Asian languages (Indonesian and Thai) will be added for this application in late November, raising the number of non-Japanese languages to 10, according to NTT Docomo’s press release.

If you are not so patient, NTT DoCoMo will provide a holdover on October 11th through Utsushite Hon’yaku, a free Word Lens-like augmented reality translator for Android 2.3 that can convert text to or from Japanese with a glance through a phone camera.

The app will be available free of charge. Users pay call and data charges for phone-to-phone conversations and translation data for screen text and voice readouts. Only data charges apply for face-to-face conversations,since no call is required. Subscription to DOCOMO’s “sp-mode” or “moperaU” connection service is required.

Utsushite Hon’yaku translates short written text between Japanese and either English, Chinese or Korean.

Translation is virtually instantaneous after the device’s camera captures the text. This commercial version of Menu Translator, which DOCOMO is trialing in Japan until October 31, will translate words and phrases not only in menus, but also street signs, signboards and more. Translation from Japanese also is possible, so DOCOMO expects the app to be quite useful for foreign people visiting Japan.

The Utsushite Hon’yaku app will be available free for download (data charges may apply). Usage will not incur any transmission fee since the translation process does not require network connection. It can be used on any smartphone or tablet equipped with an outer camera and running Android 2.3 or higher.

Next time you think languages, think Pangeanic
Machine Translation Engines from PangeaMT

follow us on –> Follow manuelhrrnz on Twitter  @Pangeanic   @manuelhrrnz

Post-editing of machine translation: the skills and the views of the experts

Ramping up to his role as moderator in the forthcoming Proz’s post-editing debate on 24th September, Jeff Allen (Engineering Tools Integration Expert) from SAP, exchanged views with Pangeanic‘s Manuel Herranz as practitioners and implementors of machine translation solutions.

Jeff has been a champion of machine translation and post-editing for decades, with hands-on experience with practically every MT technology, rule-based, knowledge-based and statistical machine translation. His practical approach to language technologies and interest in humanitarian causes led him to deploy a first publicly-available Creole MT solution during the Haiti crisis in 2010. (Click here to see video on how Jeff was able to create a basic machine translation for aid relief system even with little data  The successes and challenges of making low-data languages available in online automatic translation portals and software.)

It looks like MT has become a “must-have” technology for all language companies. Some, like Pangeanic, decided to make use of open-source Moses to develop its own flexible and modular systems. Looking back at the years of development, off-the-box solutions and customized solutions, Jeff’s views are clarifying in several ways. “Systems have now become ready for mass consumption. In the public-facing arena, we dealt with non-customized systems for decades and this, in part, gave post-editing and the whole machine translation experience a bad press”.

Regarding the buzz, frenzy and hype about machine translation nowadays, Jeff thinks that “perhaps we started too early marketing it, whatever technology we look at, SMT, rule-based MT, post-editing, building dictionaries … we just could not wait for the market to mature with the need but for a moment in time. Now, the technology has become visible, its strengths and viability can be proven”.

However, the same danger and the same mistake seems to be happening now with post-editing as it once did with machine-translation, and even with translation memory a decade before: lack of customization and preparation. “Google Translate has become the reference for post-editing, and MS Word the tool. That’s it. And that’s terrible. Both lack the functionalities that can make the whole MT experience successful. The engine is not customized but a generalist (a mistake that builds hopes high with “ready made” systems). There is no chance of pre-processing formats, tags, not to make a my terminology prevail. The same with Word – little can be done to spot errors in consistency, terminology, moving the words around, etc., apart from  search and replace. Thus, translators keep referring to the cheap post-editing jobs being offered in marketplaces, which make things sound as “do it as before but cheaper”. The problem is that there are few specialists capable to make systems fully customized. My advice is that the same logic that applies to a translation job offer should apply to a post-editing job offer. Just ask the same questions you would ask to an LSP offering you a translation job and you will soon know if the person/company knows what they are doing. If they know the client, the text and have prepared good TMs and glossaries, etc, the project manager will soon give a clear and quick answer and the information you need. The same with a post-editing job: if the company has trained the engines, done the homework customizing dictionaries and applying terminology and is offering you information about clear post-editing instructions, then you know they are applying machine translation technology well and the post-editing effort and compensation will be fair.”

In short, and with years of translation industry behind them, Jeff Allen and Manuel Herranz have a perspective on translator resistance to the technology. It is not so dissimilar to the resistance towards Translation Memory systems in the 1990’s. Eventually, the technological change brought about by machine translation will benefit users and translation consumers as a whole, making translation more and more ubiquitous. What we need, is clear guidelines, scoring systems, and more experts… as it has happened with TM systems.

Next time you think languages, think Pangeanic

follow us on –> Follow manuelhrrnz on Twitter  @Pangeanic   @manuelhrrnz

7th Machine Translation Marathon 2012: More Open Source Machine Translation

The MosesCore consortium, an EU Project aimed at promoting open source machine translation,  is sponsoring the Machine Translation Marathon 2012. The event will be held at the University of Edinburgh, Scotland, and will take place on 3-8 September 2012.

The Machine Translation (MT) Marathon gathers researchers, developers, students and also users of machine translation.  Published results are often a source of innovation and research among the increasingly avid MT community.

Currently at its 7th edition, the MT Marathon travels to different European Universities involved in machine translation development. For example, the last 6th Machine Translation Marathon took place in Trento, Italy, at the Fondazine Bruno Kessler (FBK) in September 2011 and it was promoted by the Moses EU programmes EuroMatrix and EuroMatrixplus.  The First MT Marathon also took place at the University of Edinburgh seven years ago.

The 7th edition comes back to Edinburgh and is organized  by the Statistical MT group of the School of Informatics of the University of Edinburgh.

This week long event will include:

* Lectures and labs on machine translation, ranging from beginners
tutorials tutorials to showcase talks by leading researchers. Everyone can
learn or strengthen their knowledge.
* Technical talks about open source tools for MT.
* Week-long open source machine translation  hacking projects, led by
experienced developers and researchers.

There are several ways you can participate in the MT Marathon:

* attending lectures and labs: these range from beginners tutorials to
showcase talks by leading researchers.
Everybody can learn or strengthen their knowledge!
* attending technical talks about open-source tools for MT
* taking part in open source MT hacking projects, led by experienced
developers and researchers.

You can download a pdf copy of the program clicking this link.

Next time you think languages, think Pangeanic

follow us on –> Follow manuelhrrnz on Twitter  @Pangeanic   @manuelhrrnz

For Europe, no (new) CAT tool is good enough

by Manuel Herranz

And why should it be? Decisions coming from Brussels tend to be misunderstood, shallowly analyzed or directly criticized whichever way the wind blows. Let us remember 2010’s first ever report on the Size of the Translation Industry in Europe, which was a very comprehensive view of the current status, country by country, and facts and figures into several areas, even if revenues could only take into account certain activities. It also contained words and forecasts from personalities in the industry. Liking reports is like choosing a favourite colour – everyone has one liking. Nevertheless, it set detailed information where there was none.

However, the decision not to award the contract to any CAT tool in the market points to a very clear state-of-affairs in the language industry: despite massive innovations in computing (from open cloud to internal or managed clouds: Eucalyptus (built on Amazon EC2), OpenNebula, the solid Ubuntu Enterprise Cloud and the latest from what I envisage will be a winner OpenStack), the advent of SaaS models and even great advances in machine translation, no existing tool is exciting enough to justify a 5M€ expenditure of tax-payer’s money.

The story goes like this: the EU’s Directorate General for Translation (DGT) published a Call during early 2010 to substitute the existing CAT system (Trados 2007) with more modern technology. It is to be assumed that all the major players in the CAT market will have put in a tender according to specifications. The latter may be more or less to the bidders’ liking, but every administration, so long as it is the repository and granter of public funds, has to administer them wisely. Given the fact that there are some 4,600 staff translators working at the EU, and that the EU is by far the largest producer and consumer of translation services, the backing of one option over others would have set a massive market trend for the years to come.

I had the pleasure to share open-source MT solutions as an invited speaker last April in Brussels. I saw first-hand the internal drive to introduce Moses and Apertium as solutions which can set a minimum standard upon which to build a solution (or at least set a trend). I was particularly impressed by the work done internally by the Portuguese Department, which with minimum staff and resources was able to set up a small Moses-based solution that fitted their needs, giving preference to translation domains by choosing translation tables. They also did this following a TMX workflow and TM update with penalisation, which reflects our early stages in MT. I could only congratulate and praise their work. Other presentations from Dr Sharon O’Brien and Dr Andreas Eisele pointed out the need for translator acceptance being a key point, as productivity increases are nowadays beyond the question (whether they are 30%, 50% or 300% remains still the case study for in-domain machine-translation presentations). Progress done internally at the EU was presented and reported at TAUS Barcelona (see previous blog entry for a summary).

Going back to the decision not to choose to update from the existing CAT tool, the message is clear:

  • There is no justifiable leap in quality in existing CAT tools.
  • CAT leveraging, as a technique to make the most from previous translations, has reached its ceiling.
  • There is a lack of on-line help documentation in CAT tools
  • There is hardly any justification for more CAT tools unless they offer something truly revolutionary. (Now, there are several new tools which do make sense at LSP and corporate level, but not to the extent and cost the Directorate General for Translation required).

Some have seen a dark hand and there has been controversy – finally settled by the chairwoman of the committee explaining why there was no chosen one among the candidates. The explanation was made public in The Tool Kit. There was simply a lack of adherence to the requirements and no true innovation.

Personally, my favourite CAT tool has been Swordfish for a long time: it is nimble, agile, easy-to-use, built on and favouring open-standards, compatible with all major formats and it contains good QA features. The latest version even adds a powerful LAN TM collaborative option. Furthermore, its useful Goggle-Translate plug-in will probably be hit by Google’s decision to deprecate its free API, something that had been in the cards for some time. At a fraction of the cost of other tools, it gets the job done pretty well. Sadly, Swordfish is not based in Europe and most probably did not enter the tender.

To conclude, I am not only justifying the DGT’s decision not to award a 5M€ contract to any CAT tool provider (not even the latest versions of Trados in the shape of SDL 2009 made it). I am saying that it was the only likely outcome.

Why? Look around at the new machine-translation offerings and how these will become essential and perhaps every day life in a matter of years (check a possible future by Andrew Joscelyne where machine-translation engine creation becomes so easy and so common as to be the main work LSPs have to offer). Look at how the gathering of language resources is being automated by initiatives such as Panacea or the work done for lesser-resourced languages by Let’sMT and Tilde in particular.

Any doubts? Think about the advantages of an integrated, stable XLIFF workflow for documentation which not also leverages your existing content from a fast database system (not a “translation memory“), but also uses it to create your own MT eco-system in the background, growing with every new translated content you feed into it. Apologies for the DIY SMT self-promotion :) .

So, how long for really ground-breaking, open-source CAT+MT ecosystem tool?

Further reading:
Next time you think languages, think Pangeanic

follow us on –> Follow manuelhrrnz on Twitter


Multilingual Web in Pisa & Translator MT Awareness at EU

by Manuel Herranz

April was a busy month within the Northern-hemisphere, conference-rich season. Gala took place in Lisbon and, although Pangeanic – PangeaMT could not participate, there were plenty of other specialist venues to choose from.

MultilingualWeb took place in Pisa during 4th and 5th April. The venue was true to its premise to discuss standards and best practices for the Multilingual Web and gathered a good, specialist and multi-disciplined crowd. The mixture was well planned: the sessions were standard by a keynote address by Oreste Signore and Kimmo Rossi, as well as a report from Ralf Steinberger about multimedia news reports service from the JRC (the organisation has made all presentations and videos available, so the blog entry was worth the waiting). You can check all presentations here.

Not being an expert on the creation or development of the web, but being a keen open-standards supporter, I found it very enlightening to be a witness to the discussions, as I am sure the localizers session (mostly dealing with interoperability problems faced at localization time) was to creators and developers. Enlightening was indeed to hear Richard Ishida talk about HTML5 and all the novelties it will bring about. Richard is the head of the W3C Internationalization Department, and possibly the only one at the conference truly living between both worlds.

The exchange of ideas and experiences was extremely useful: there were many thinking heads from many different areas. SEO was an area a learnt a lot about by the time I spent with Gustavo Lucardi, from Trusted Translations. His blog-report entry on events on the second day should not be missed by fellow translation professionals. Trusted Translations has set new standards for the rest of us on SEO and Multilingual SEO.

On the first day, David Filip (now at Limerick) provided some food for thought when he said that TMX was dead, now that Lisa is no more, and that XLIFF is the future (a more standard and better interchangeable XLIFF 2.0, let’s say). I agree XLIFF is the foreseeable and recommended interchange format and a lot more information and data could be exchanged if it was properly promoted (I wish it was) and not become “some kind of property of” like SDL’s support by creating its own version (sdlxliff). However, TMX 1.4b has become the de facto exchange format and it may take some time before we see the back of it, just as MP3 did not killed the CD simply because there are so many things which are compatible with it, from PC’s CD-ROM to in-car CD readers (my reply). Finding out new initiatives like InterOperability Now! (Sven C. Andrä) and M4Loc was very refreshing for us at PangeaMT, as we base our offering in open-standards to ease free information exchange. They show an increasing interest to develop tools and standards that ease the rapid exchange of information, to which machine-translation cannot but contribute.

Finally, a word about a busy trip which took me straight to Brussels from Pisa to participate in a video-linked forum (Luxembourg and Strasbourg) aimed at raising translator awareness and acceptance of machine translation at European Institutions. Translators’ fears and mistrust about MT are almost universal  (check a recent tirade at Proz about introducing a new price scheme among freelancers  from a company which has introduced machine translation purely as a “price down” strategy).  If LSPs like to talk about “working together” and “partnering” with their clients long-term, they need to think how to work better with their translators. Many information sessions/seminars are still required at institutions and organizations and not just at management level -although Google Translator’s ubiquity has done a lot of preparation work for what it is to come in the next few years. At least machine translation is not a black box any more. Translators, as practitioners, not only need good MT output, but also an understanding of what is behind it, how they can make a difference and above all a well-planned workflow.  Post-editing is fun and those who get used to it never go back to a TM-based workflow. A lot of evangelism is still required during this transitional period and those closest to the output need to feel in control of things and see for themselves that they can influence MT with their feedback and are more necessary than ever.

It was good to see first-hand that the EU has embraced open-source Moses (and Apertium) and has a whole team now dedicated to develop its own solution. With over more than 420  language combinations into and out of complex and challenging languages (from Semitic Maltese to Romance, Germanic, non-related Baltic, and even non-Indo-European Finnish and Hungarian) often under-resourced if it was not for the EU as a producer, this is a task worth keeping a close eye to, particularly after the EU’s parting from Systran.

I should particularly praise the Portuguese presentation there, by Hilàrio Fontes and the lessons learnt. The Portuguese Department at the EU has been a leader in MT adoption within the EU and has been able to develop its own Moses customization with limited resources and a lot of hard work and goodwill. I found many things similar to our own development and how existing tools can be used to integrate MT within existing production environments.

The organizers in Multilingual Web, Pisa, via Richard Ishida,  who I have to thank for the good co-ordination work and the co-hosters Istituto di Informatica e Telematica and Istituto di Linguistica Computazionale, Consiglio Nazionale delle Ricerche have kindly made one of my presentations available (below).


Open Standards in Machine Translation

Next time you think languages, think Pangeanic

follow us on –>Follow manuelhrrnz on Twitter

Moses is not the new Messiah

by Manuel Herranz

If you run a translation company or translation department or have some sort of connection with the translation industry, you have noticed without a doubt that MT (or automatic translation) is the flavour of the year in 2010… and will be for many years to come. It has and will change the way do things in this industry.  Several factors have been an unstoppable increase in the globalization of services and support, smaller budgets from buyers, an increase in international trading of services and the need for more content and more multilingual content in more languages. As of May 2009, there were 487 billion gigabytes of data which were increasing 50% a year (Oracle) or doubling every 11 hours (IBM).

There are both exogenous and endogenous factors for things to reach maturity level now and not earlier or later. Among the latter factors we may include the fact that the bases did already exist due to rule-based technologies (still succesfully in use in certain language pairs) and to the coming of age of 1990′s CAT tools that have generated trillions of bilingual sets of data.  We cannot forget the ever-increasing pressure on timely delivery and on cost. However, the key elements of this revolution have been exogeneous, not coming from the traditional linguistic community: the maturity and availability of statistical systems as applied to language data (as much as other areas), which have pushed the boundaries of automatic translation beyond simple formulas or academic articles into tangible software applications. Statistical analysis has brought power to language processing and the availability of massive amounts of bilingual data (already envisaged by Chomsky in the 60′s) has also made it possible, as well as the emergence of academic and open source initiatives, many funded by the EU or the American Government. I do want to miss this oportunity to remind us that most of the older (and ubiquitous) rule-based MT providers were born at a time when espionage needed vasts amounts of data translated in the form of patents and paperwork and that the first statistical system received large funding from DARPA  to make Arabic one of its main priorities (the focus of espionage had shifted after the Cold War).

Therefore, non-military, non-govermental and free or open source initiatives all receive my praise, and that is one of the reasons (I suspect) why Moses has been so successful and has done a lot to bring attention to automatic translation /machine translation. Let us not forget Google’s tremendeous contribution, even from a very wide, non-specialist scope, dropping rule-based technologies in favour of statistical processes and creating a remarkable translation environment to feed its databases. A myriad of things have contributed to the new surge in MT. I do not want to leave out fruitful initiatives such as the Jaap van der Meer’s TDA and the very TAUS, capable of combining  collaborative efforts across the Atlantic and across many industries and organizations but with a focus on software (logical addition) which has enabled data sharing and has made data availability a reality.

This worthy initiative is also beginning to reach countries producing massive amounts of translation and in need of automation such as Japan, where I will soon speak on the subject and PangeaMT’s career in the customization (I prefer the word “adaptation”) of the academic translator Moses into a powerful useful product from an LSP perspective during Japan’s Translation Federation exhibition on the 13th December. Moses fever has also caught up quickly in Japan, and SourceForge shows over 6,000 downloads worlwide this year alone. Everyone is experimenting.

The most popular star, without a doubt, because of its availablity and relative easy-of-use has been the SMT Moses toolkit, part of the EuroMatrix project (the link will take you to interesting results and tests conducted with several other kits). It is beginning to empower companies to create their own solutions but many are discovering that implementing an open source solution for MT is not as easy as it seems (even those that are “out of the box”), despite the attractiveness of the powerful word “free” or “open source”. DIY’ing MT into one’s workflow is also not for the faint-hearted.

Due to its popularity and zero cost, Moses has acquired a kind of Messianic status, as the solution for everything, the magic wand that will reduce translation costs upon installation, the solution for producing tens and millions of words instantly. Far from it. As an experienced Moses customizer, I would like to list a few of the advantages and limitations of the system for LSPs and organizations in general, and how much work building around it has taken us here at PangeaMT (a good summary of it can be viewed on-line from our recent presentation in Portland).

What Moses can do

  1. It is absolutely free. Go to SourceForge, type Moses SMT and download it. (It needs to be installed in a Linux server).
  2. It excels at translating close language pairs.
  3. It provides an excellent environment for testing MT and driving pilots, to actually see how MT works and what is required.
  4. You can re-use all your bilingual translated assests as training material.
  5. It requires little power to use (once the system has been trained, it can run even in a run-of-the-mill Linux-PC, but remember it is a Linux application with no interface). The training does require high-spec servers.
  6. It comes with a BLEU score facility to see how well you are doing.
  7. It is a scalable, open program. This means that you can build around it yourself and overcome any limitations by programming your own modules for pre- and post-processing.

What Moses cannot do

  1. It does not reorganize output, i.e there are no grammar rules telling the target language where things go. This is one of the reasons German, Basque or Japanese always get a lower score than more predictable Romance language when English is the source, as they split verbal information apart (and with English as well, to an extent). Agglutinating languages such as Turkish or Finnish are clearly not prone to statistical MT – but as far as I know are not easily dealt with by rule-based systems either because of their intrisic characteristics. Only Apertium has had a limited amount of success dealing with Basque.
  2. Moses only translates from plain text and it only produces plain text. You need to remove all the tags prior to training or input/requesting text.
  3. Moses does not translate “off the box”, it is not a CAT tool and it does not store nor update TMs. It requires the training of
    a) a Language Model
    b) SMT kit (Moses itself)
  4. Training cannot be done in an ordinary server. Training of both the LM and the Kit requires a lot of computing power. Typically, you will need a huge server (2-3 recommended) to speed things up and of course a capable programmer.
  5. Moses does not run in Windows, although we  have successfully packaged it in Cygwin in several occasions – this is not the ideal environment, though, and it slows the process.
  6. Moses does not include data update features and cannot be updated without retraining. This means that each update with new data requires the run of the same routine commands as for all the training, with no back-up copy of the previous version. It is hardly a “re-training” but a new, larger version each time.
  7. There are no terminology or DNT (Do Not Translate) features.

As you have probably noticed by now, running Moses and putting it to work does not require translators but computer linguists or computer programmers. We have overcome practically all these limitations with our PangeaMT series. However, the effort can never be underestimated. Many LSPs are currently experimenting and considering whether the effort is worth it at all considering the set-up and running costs. It is, after all, a change from being a service to becoming a kind of developer. Some LSPs will do it and create an internal environment which fits their needs, others will prefer CAT tools with MT interface, others will likely buy some kind of MT software solution to plug in their system (to minimise self-promotion I will only mention that we have customized and installed PangeaMT at LSP’s, too, you can read it in PangeaMT‘s website).

It makes sense if you have many customers in the same vertical (even with different terminology). Most LSPs specialize in a couple of areas (legal, patents, automotive, engineering, electronics, software, etc)

We no longer translate as we did 50, 20 or even 10 years ago. What form will the translation process take 10, 20 or 50 years from now? I envisage MT will play a fundamental part in the process, with data sets being picked to match, automated training and light human post-editing. Frankly, I see all chance of a conversion with speech technologies once we get to a MT-2 stage.

We are witnessing the start of a new leap for our industry which will affect not only pricing, but output and the relationship between clients and vendors.

Next time you think languages, think Pangeanic

                                                                                                                         follow us on –>Follow manuelhrrnz on Twitter

Pangeanic’s participation in TAUS Copenhagen 2010

by Elia Yuste

TAUS has been tracking the exciting experiences of companies pioneering in a radical new MT engine training space for the last year or so. Pangeanic is one of the most outstanding cases, and so we were advertised as the first LSP to create a new business stream with TAUS Data Association (TDA) data earlier on this year. Then, PangeaMT, Pangeanic´s technological division geared at customized MT solutions and consulting, was invited to take part in the proof-of-concept of TAUS MT Trainer and present its results on the occasion of the TAUS Executive Forum in Copenhagen in late May 2010.

The idea behind this MT Trainer, a web-based facility from TAUS TDA that will materialise within the current year, is twofold: first, to foster pro-active adoption of TDA data for MT engine training; and second, to connect MT service commissioners and providers under the TAUS umbrella, whereby the former may submit their data files (reference files for engine training and files for translation) and the latter would turn around the MT output in a short time. The MT Trainer has a counterpart facility called MT Evaluator, which lets the commissioner or client evaluate the uploaded MT output by means of standard metrics-based figures.

To test the viability of such double initiative, the so-called MT Trainer pilot was discussed among the selected partners and then launched about two weeks before the Copenhagen meeting. Would it be possible to automate workflow for MT customization using client data and data from TDA? On the one hand, Adobe, eBay and McAfee were the three prospective MT commissioners seeking trained engines and metrics to measure the quality of output. On the other, Languagelens, PangeaMT, and Tilde were the three selected MT companies. We all could turn around customized MT engines in 24 hours or less, from which the output was measured for quality using BLEU scores. In the specific case of Pangeanic, the challenges of speed and acceptable quality could be met without any problem.

If these two TDA service offerings, the MT Trainer and Evaluator, get well accepted and regularly deployed by members, it will instigate more data uploads/downloads and reinforce the usefulness and applicability of relevant, domain-specific data sharing for MT training. This should also lead to a much more desired increase in memberships and overall member pro-activity within TAUS.  For Pangeanic it will mean more visibility in the MT arena, a quicker access to high-calibre clients, whose content and domain specificities are btw. already familiar to us, and a controlled workspace to offer our MT services.

Apart from the MT Trainer & Evaluator proof-of-concept, the Copenhagen event gave rise to lots of fruitful discussions among MT practioners and newcomers. In our case, apart from describing the ins and outs of our engine training experience for eBay under the MT Trainer pilot scenario, we engaged in interesting conversations about how PangeaMT has been able to overcome Moses shortcomings. Our TMX filter or inline mark-up parser were acclaimed features that are much needed in our industry and have made us stand out of the (S)MT crowd.

Other takeaways of the TAUS Copenhagen event were the convergence of MT, open platforms and contexts of application (e.g. in corporate support), learning more about TAUS TDA member experiences, and gathering collective wisdom resulting from future-projecting, table discussions on a number of hot language industry topics. A full report about the event can be found here and also downloaded from the TAUS website.

Next time you think languages, think Pangeanic

Follow manuelhrrnz on Twitter