MT wars | Pangeanic

Written by Manuel Herranz | 10/03/10
I must thank all contributors to the discussion as to "ownership" or "customization" by LSPs will be trend in the near future in LinkedIn. (  http://bit.ly/dkQ7YD). For those who have missed it or would like to contribute other perspectives on what the future holds for MT and how LSP's and general users of translation services (professional or casual), here is a copy of my LinkedIn post. Thanks to all for your contributions, some interesting perspectives there. My intention when mentioning that LSPs will need to "own" their MT technology was not to say they will have to "develop" it necessarily. I quote  "4. It also signals that provided you have the technological resources and will, it is essential to own, develop or at least customise your MT solution for your clients. Pressure will mount on the smaller and medium-sized LSP’s..."  A full scale development is truly outside the scope of most companies and even the larger ones (Lionbridge, SDL) have not developed any but bought technology (Systran and then IBM in Lionbridge's case, LW in SDL's case).  Where I see the trend is in companies customizing MT for their clients. Lori Thicke's Lexcelera, for example has done pretty good work with RB engines in many areas and there were other pioneers with RB at different levels (Jeff Allen, who I had the pleasure to meet some months ago, an evangelist of MT even at the times when there was negative interest and who has done amazing work with rule-based engines). I do believe, however, that the future of customization requires some kind of statistical system at its core. I can see foresee many experimenting with Moses up to a degree and statistics, but as it happens with most open source software, there are limitations in Moses' results. One soon starts to build around it, before and after, pre- and post-processing until you end up with a system which may have Moses, but has as many other things around - of course depending on the language combination, too.  Furthermore, the often discussed issue of "data cleaning" is a big grey area, unknown to most companies which begin to experiment with SMT - they simply assume that "their" TMs are clean, no sentences are wrongly segmented, there is no "noise", no typos, no code, in-lines don't matter and things like metaphors or phrasal verbs are not translated in a standard way. Pangeanic has found more "unclean" data coming from international institutions (a quick look at UN's and EU material and sites will show many "versions" - web pages are not necessarily a good material). Those who opt for RB as a "ready made" or "out of the box" soon find out that they never thought about how long building dictionaries takes (and other features, programming and improving sentences to your style...)  Alon and Kirti are right in the sense that medium and smaller LSPs cannot take all this work on their own. Pangeanic was a pioneer a few years ago and it has changed the company forever as a provider of language and automation, but the road was never easy, in terms of manpower, financially and time invested.  As far as the "creativity" mentioned by Alex, this can be programmed and flagged. RB does a good job at it and SMT can prioritize some expressions. I can only speak about SMT and hybridation as I only use rules for certain "fixed" expressions in a language and collocations. Nevertheless, it's amazing the amount of "right" metaphors, etc a system will get right given a large enough corpus - or programming those phrasals into the stats.  Anabela: True, GT provides you with a good service but gets hold of your data in return. At least, it tells you about it. There has been a noticeable increase in quality in certain languages lately (measurable) particularly in romance languages. I presume they have the largest number of Google Translate users. Still, as Franz Och declared recently in an interview, it is not a matter of having more powerful machines, with more processors: higher processing speeds would not provide better statistical and rule-based results. But that's for a general system that tries to translate or "give a gist" to everything. My prediction: MT becoming as ubiquitous as CAT tools, with different levels of skill at each organization - and here "bigger" does not mean "better".
Next time you think languages, think Pangeanic

follow us on -->