Category Archives: MT Evaluator

Final dominance by final technology? The beginning of the MT wars (II)

I must thank all contributors to the discussion as to “ownership” or “customization” by LSPs will be trend in the near future in LinkedIn. ( http://bit.ly/dkQ7YD).

For those who have missed it or would like to contribute other perspectives on what the future holds for MT and how LSP’s and general users of translation services (professional or casual), here is a copy of my LinkedIn post.

Thanks to all for your contributions, some interesting perspectives there. My intention when mentioning that LSPs will need to “own” their MT technology was not to say they will have to “develop” it necessarily. I quote 

“4. It also signals that provided you have the technological resources and will, it is essential to own, develop or at least customise your MT solution for your clients. Pressure will mount on the smaller and medium-sized LSP’s…” 

A full scale development is truly outside the scope of most companies and even the larger ones (Lionbridge, SDL) have not developed any but bought technology (Systran and then IBM in Lionbridge’s case, LW in SDL’s case). 

Where I see the trend is in companies customizing MT for their clients. Lori Thicke’s Lexcelera, for example has done pretty good work with RB engines in many areas and there were other pioneers with RB at different levels (Jeff Allen, who I had the pleasure to meet some months ago, an evangelist of MT even at the times when there was negative interest and who has done amazing work with rule-based engines). I do believe, however, that the future of customization requires some kind of statistical system at its core. I can see foresee many experimenting with Moses up to a degree and statistics, but as it happens with most open source software, there are limitations in Moses’ results. One soon starts to build around it, before and after, pre- and post-processing until you end up with a system which may have Moses, but has as many other things around – of course depending on the language combination, too. 

Furthermore, the often discussed issue of “data cleaning” is a big grey area, unknown to most companies which begin to experiment with SMT – they simply assume that “their” TMs are clean, no sentences are wrongly segmented, there is no “noise”, no typos, no code, in-lines don’t matter and things like metaphors or phrasal verbs are not translated in a standard way. Pangeanic has found more “unclean” data coming from international institutions (a quick look at UN’s and EU material and sites will show many “versions” – web pages are not necessarily a good material). Those who opt for RB as a “ready made” or “out of the box” soon find out that they never thought about how long building dictionaries takes (and other features, programming and improving sentences to your style…) 

Alon and Kirti are right in the sense that medium and smaller LSPs cannot take all this work on their own. Pangeanic was a pioneer a few years ago and it has changed the company forever as a provider of language and automation, but the road was never easy, in terms of manpower, financially and time invested. 

As far as the “creativity” mentioned by Alex, this can be programmed and flagged. RB does a good job at it and SMT can prioritize some expressions. I can only speak about SMT and hybridation as I only use rules for certain “fixed” expressions in a language and collocations. Nevertheless, it’s amazing the amount of “right” metaphors, etc a system will get right given a large enough corpus – or programming those phrasals into the stats. 

Anabela: True, GT provides you with a good service but gets hold of your data in return. At least, it tells you about it. There has been a noticeable increase in quality in certain languages lately (measurable) particularly in romance languages. I presume they have the largest number of Google Translate users. Still, as Franz Och declared recently in an interview, it is not a matter of having more powerful machines, with more processors: higher processing speeds would not provide better statistical and rule-based results. But that’s for a general system that tries to translate or “give a gist” to everything. My prediction: MT becoming as ubiquitous as CAT tools, with different levels of skill at each organization – and here “bigger” does not mean “better”.

Next time you think languages, think Pangeanic

Follow manuelhrrnz on Twitterfollow us on –>

Pangeanic’s participation in TAUS Copenhagen 2010

by Elia Yuste

TAUS has been tracking the exciting experiences of companies pioneering in a radical new MT engine training space for the last year or so. Pangeanic is one of the most outstanding cases, and so we were advertised as the first LSP to create a new business stream with TAUS Data Association (TDA) data earlier on this year. Then, PangeaMT, Pangeanic´s technological division geared at customized MT solutions and consulting, was invited to take part in the proof-of-concept of TAUS MT Trainer and present its results on the occasion of the TAUS Executive Forum in Copenhagen in late May 2010.

The idea behind this MT Trainer, a web-based facility from TAUS TDA that will materialise within the current year, is twofold: first, to foster pro-active adoption of TDA data for MT engine training; and second, to connect MT service commissioners and providers under the TAUS umbrella, whereby the former may submit their data files (reference files for engine training and files for translation) and the latter would turn around the MT output in a short time. The MT Trainer has a counterpart facility called MT Evaluator, which lets the commissioner or client evaluate the uploaded MT output by means of standard metrics-based figures.

To test the viability of such double initiative, the so-called MT Trainer pilot was discussed among the selected partners and then launched about two weeks before the Copenhagen meeting. Would it be possible to automate workflow for MT customization using client data and data from TDA? On the one hand, Adobe, eBay and McAfee were the three prospective MT commissioners seeking trained engines and metrics to measure the quality of output. On the other, Languagelens, PangeaMT, and Tilde were the three selected MT companies. We all could turn around customized MT engines in 24 hours or less, from which the output was measured for quality using BLEU scores. In the specific case of Pangeanic, the challenges of speed and acceptable quality could be met without any problem.

If these two TDA service offerings, the MT Trainer and Evaluator, get well accepted and regularly deployed by members, it will instigate more data uploads/downloads and reinforce the usefulness and applicability of relevant, domain-specific data sharing for MT training. This should also lead to a much more desired increase in memberships and overall member pro-activity within TAUS.  For Pangeanic it will mean more visibility in the MT arena, a quicker access to high-calibre clients, whose content and domain specificities are btw. already familiar to us, and a controlled workspace to offer our MT services.

Apart from the MT Trainer & Evaluator proof-of-concept, the Copenhagen event gave rise to lots of fruitful discussions among MT practioners and newcomers. In our case, apart from describing the ins and outs of our engine training experience for eBay under the MT Trainer pilot scenario, we engaged in interesting conversations about how PangeaMT has been able to overcome Moses shortcomings. Our TMX filter or inline mark-up parser were acclaimed features that are much needed in our industry and have made us stand out of the (S)MT crowd.

Other takeaways of the TAUS Copenhagen event were the convergence of MT, open platforms and contexts of application (e.g. in corporate support), learning more about TAUS TDA member experiences, and gathering collective wisdom resulting from future-projecting, table discussions on a number of hot language industry topics. A full report about the event can be found here and also downloaded from the TAUS website.

Next time you think languages, think Pangeanic

Follow manuelhrrnz on Twitter