Comment to SDL’s “Sharing Data between Companies – is it the Holy Grail?”

by Manuel Herranz

Eye openers about data sharing (or data mixing) abound nowadays. The kick start for TM leveraging, automation and faster solutions has come from outside our beloved language industry in the shape of

– algorithms that create language (SMT) and their application/business by players inside and outside the industry (from Google Translate to new MT entrants and offsprings)
– a credit crunch and a financial crisis that is leading companies to rethink the unthinkable

A few times (exceptions) language professionals have joined to actually innovate and come up with something really new, mostly crowdsourcing, in translation, in frameworks, in workflows. Never mind, it is seldom the norm that busy people have the time to innovate. It takes a shot from outside a particular industry to shake the foundations or to force to change things. (Let’s assume from a positivist point of view that change is for the better).

While sharing initiatives will not solve the problems of the industry, they are a good starting point, at least to make data available for a variety of purposes, from academic to business. This has been our case at PangeaMT, which has utilized part of TDA donors’ data to pre-train engines and combinations. The push for language automation needs to be customization, not generalization. Specific domain engines are due to improve accuracy as more data becomes available in public domain, in club initiatives like TDA or in many other formats.

Sophie’s statement in SDL’s latest blog (now deleted) entry:

“You will gain less if you have high volumes of your own, personalized content that has been translated”

 data sharing blog entry at SDL website has disappeared, shows 404is precisely right. Studies and our own findings point to sharing as one the key points in increasing the training data and thus obtaining good results. However, this does not mean any type of sharing and throwing any data for training will give you good results. In fact, you may need some kind of data pre-processing and editing so that the feeding corpus “speaks” as intended and thus provides language as you expect. Otherwise, one may be only adding noise.

Data-sharing can be beneficial in many other areas, not just MT development. It can be a way of leveraging, of gathering more terminology, of centralizing assets, etc. As linguists and through a narrow-minded approach to translation, we have assumed that mixing data was sinful, whereas there are more benefits in working with large datasets than in purity of blood.

Data Cloud from TAUS

I am glad to see the new approach at SDL and wonder if this will permeate the whole company philosophy. As a company anchored in TM and workflow technologies, machine translation and crowdsourcing are their sword of Damocles, more a threat than an opportunity. See Renato Beninatto’s presentation at ATA for an illuminating insight of what the future may bring for 20th-century technology companies. SDL has been including a kind of general MT plug-in to their interfaces since 2007, possibly as a reaction of the plug-ins released by smaller tools like Omega-T and Swordfish (which initially provided “whole document GT translation” and then was cut to a segment-by-segment approach) to Google Translator. SDL has a working partnership with Language Weaver, and so I wonder about the benefits of buying a piece of software at over 2000€ a piece which, in effect, is going to work as a multiple TM manager (for basic leveraging), terminology manager, with added Auto-suggestion features (a handy feature present in Deja Vu, and MultiCorpora), but which, eventually, may become a post-editing tool of their online MT service. Uhm! Let me think about it….

By the way, the initial comments to Sophie Hurst’s article on SDL’s blog are typical of the general user’s mindset. I have heard scaremongering “end of the world” since the release of Trados 3, which was despised by many as “an invention by the agencies to pay less”. I found even resistance to that low-level kind of automation even from manufacturers around 2000. Another question is how far we would have gone in accomplishing the our real mission as linguists (i.e. communicators) if the push in the late 90’s had gone for MT (and later on statistical processes) and not for penny-grabbing strict % matching. Let us agree that TM tools, as useful as they may be, do not help to develop a very imaginative approach to our sector’s real needs…..How many terminology and post-editing tools we would be using nowadays, and how much faster content would be shared.

Next time you think languages, think Pangeanic
Translation Services, Translation Technologies, Machine Translation

6 thoughts on “Comment to SDL’s “Sharing Data between Companies – is it the Holy Grail?”

  1. Kirti

    While in some cases data sharing does leverage and improve overall results in SMT systems, I have been involved with several data sharing experiments which show very clearly that MORE DATA IS NOT ALWAYS BETTER.

    Data quality matters and consistency and cleanliness matter and just sharing data is hardly a holy grail. Unfortunately much of the data is of poor quality and noisy and unlikely to provide long term benefit to SMT.

    This is clear from the test we did with TAUS member data and this reverse synergy has also been seen in many other data consolidation experiments we have attempted. See for details.

    The GIGO (Garbage In Garbage Out) principal rules.

    TM often comes with embedded process data, meta data and thus becomes much less useful for SMT or even other TM leverage.

    My sense is that a lot of the new content that is demanding to be translated is different from the kind of TM assets that most companies have. So new kinds of linguistic leverage are needed.

    I expect that consistent terminology becomes much more important – that TM will need to be transformed and significantly modified and made more compatible with one’s own foundation data before SMT development for best results.

    My sense is that data sharing will matter but not quite in the way we envision it today. Corporate TMs are linked to really static content and thus will have much more limited value for new translation challenges.

    Perhaps some of the new Web 2.0 initiatives like WWL, TED, Meedan, The Rosetta Foundation will show what real open sharing looks like. This kind of diversity in the data assets is much more likely to have leverage.

    1. pangeanic

      Following the TM/data availability issue.

      We all come from a world where there was not enough data available, and we were data hungry. This is not however the case any longer, as large repositories are available in several places, from EU data sets at OPUS, EuroParl, DGT for law (which is the basis of the TMX used by Carlo at, with many other contributions and sets from UN, etc). Of course TDA and the kits that you get when you train Moses.

      The availability of data will not be issue fairly soon. The quality of the input data will, for SMT as well as for other types of MT and even straight TM leveraging.

      Let me quote Jost from his latest post:
      “I have come to realize that while large amounts of data are very powerful, they can also be very distracting if they a) originate from a subject matter or client that uses a different terminology or style; b) come from dated or obsolete sources; and c) come from sources with a different quality level.”

    2. pangeanic

      Well, as I recently twitted, Google insists on the size of data being relevant. I quote from the Irish Times (today)

      “Norvig was in Ireland this week to give the annual Boyle Lecture at University College Cork. Entitled The Unreasonable Effectiveness of Data – How Billions of Trivial Data Points can Lead to Understanding, it was a look at how and why some computing challenges can be solved by aggregating and analysing patterns in huge amounts of data.

      His title is a play on a famous 1960 mathematics paper by Eugene Wigner, “The Unreasonable Effectiveness of Mathematics in the Natural Sciences”, which observed how basic mathematical formulas rather inexplicably explain everything from the way a snail’s shell spirals to the movement of ripples on a pond.

      “In the physical world, mathematical formulas work well to explain things, but in other areas, not so well,” says Norvig. “So I’m asking, if you observe millions and millions of data points, does that help?”

      And his answer – as might be expected from someone from a company synonymous with massive data crunching – is “yes”. As an example, he points to machine translation – getting computers to translate text from one language to another.

      “There was this idea in the 50s that translating language was just like decoding codes,” he says. Well known linguistic scholar and philosopher Noam Chomsky countered that belief by suggesting language was too complex for machines to approach translation in such a basic way.

      And true enough, machine translation had generally frustrated computer scientists for decades. But, cue today’s cheap availability of both computing power and storage and Google’s expertise at working with bottomless amounts of data.

      “We said, instead, why not get samples of phrases in different languages and match them up.” A few examples – even a few thousand are not enough to enable a computer to translate a document, Norvig says. “But get millions and millions, and it works fairly well.”

      And that is how Google Translate works – by breaking a document into phrases, searching a vast database of millions of similar or identical phrases to find what one generally means, then going to the next phrase.”

      OK, I know Google’s “availability of data” far surpasses the data we may have collected or even AsiaOnline. I see two issues here:
      a) Our results do not show significant improvements with larger sets of data (even “clean TM” as you quote above). It is not an automatic transfer model. Happens in pre-determined language sets because of similarity and regularity present in controlled language. It doesn’t happen overall.
      b) Customized developments (not only SMT even Example-Based as KCSL points to) always score higher than Google on BLEU.

      Another question is if BLEU should be the absolute rule by which to measure all results, but let’s take it as a neutral rule.

      I believe GT “success” is based on
      1. Their starting jump-off (they were famous for getting things right before in other areas “for free”)
      2. A “Google” fan, hungry and expectant community
      3. The fact that they have managed to provide more combinations than anybody else, even if results are arguable in some
      4. A focus on general and free, thus not a result-oriented service

      Thanks for the input Kirti 🙂

  2. Manuel

    The post is more about SDL “embracing” MT, but I see your point.

    The idea that by simply adding millions of data and pressing a button we would get better results was too nice (or naive in retrospect). Many, including myself, fell for it. I tried to word it carefully in the post every time I approached the assertion because I am swimming agains the current

    – “While sharing initiatives will not solve the problems of the industry, they are a good starting point,”
    – “The push for language automation needs to be customization, not generalization.”

    We have used a lot of TDA data for our own purposes (basically because it comes handy on a word/effort ratio for the price of it). We have used it for engine training and for testing. Had it not been for TDA, we would have spent a lot of time gathering data and less time training engines. Web data crawlers work to gather data, but there is always the post-processing. The “cleaning” process or standardization you describe is an effor that is often overlooked, so getting your competitors’ website to pump up your own development is not always a good idea – particularly if there are conflicting terminologies. Not all data is fit for SMT, much less if there is no prep work. I totally agree with the GIGO principle. Miracles don’t happen as often in the 21st century as they did in the Middle Ages.

    For practical reasons, we concentrate more in the size and volume of the data than in providing a large variety of language combinations. The number of combinations in our tests is lower than your 29 engines and table combinations, our market and needs are different and we go more for several industry domains into EU languages. I did read the AsiaOnline’s report and it was rather surprising at the time, but it is less so as time goes by. In fact, many of our findings coincide. Our conclusions are close to yours, particularly when it comes to prefering smaller sets of controlled data than huge amounts of uncontrolled [dirty] data. There’ll be a Pangeanic White Paper on Sharing Q2 2010. I guess not to the liking of some conventional wisdom.

    We have managed to sort some of the points you make about noise and code with different approaches and placeholders. That has not had a significant impact on the tests and results. A second finding has been to post-process SMT combined outputs on SMT initial results (a kind of “pick the best from the best results”, which is new even in academia). We need to study more if this could have a significant impact on uncontrolled inputs like TDA. It often has a positive impact. Again, I do agree with you that consistency/control and reliable data are far more important than large sets. Numbers do not ensure success. And terminology interfacing/ use/ normalization or QC, depending how you approach it, is definitely the key issue. It will become the central issue once SMT really gets off the ground. That has been often overlooked by SMT enthusiasts.

    A lot of TDA data contained too many unusable parts (code apart). We got best results when (hand)picking sections, deleting and selecting the data, even by large chunks. This was true in areas like “electronics”. Contradicting established wisdom, we got better results in “software” when developing single-company engines than by mixing 3-4-5…9 TMs. In several cases and languages, sharing was detrimental. Apart from TDA data, our own domain-specific sharing/data gathering has provided good results in automotive, but that was fairly controlled and it did have a dominant set (8M from a whole corpus of 12,5M).

    Self-generation (i.e editing and adding similar content) always improved the results. But one important point you make and that I had not thought about is that a lot of new content out there simply cannot be put to automation with fixed TM data. Very good point. I do see a future for more monolingual training and non-company TMs and data sets alien to company assets. Very good point.

    For now, my point is that any MT development has to be customized or at least domain specific and that a “one fits all” solution is still far in the horizon, at least if we take the Holy Grail to be uncontrolled sharing or simply sharing, which is what SDL seems to point to regardless.

    May I add globalvoices to your diversity list in the same spirit.

  3. Kirti

    I think the comment thread here should be very useful to anybody who who is interested in sharing data from TDA or other public data sharing forum as many useful and practical points are made.

    I do agree that it can provide some leverage once the user understands the data , the pitfalls and takes specific action to minimize these problems.

    I suspect that the quality of the data cleaning tools will become more important in time as well as this will enable much of the public data on the web to be more useful


Leave a Reply

Your email address will not be published. Required fields are marked *