Eye openers about data sharing (or data mixing) abound nowadays. The kick start for TM leveraging, automation and faster solutions has come from outside our beloved language industry in the shape of - algorithms that create language (SMT) and their application/business by players inside and outside the industry (from Google Translate to new MT entrants and offsprings) - a credit crunch and a financial crisis that is leading companies to rethink the unthinkable A few times (exceptions) language professionals have joined to actually innovate and come up with something really new, mostly crowdsourcing, in translation, in frameworks, in workflows. Never mind, it is seldom the norm that busy people have the time to innovate. It takes a shot from outside a particular industry to shake the foundations or to force to change things. (Let's assume from a positivist point of view that change is for the better). While sharing initiatives will not solve the problems of the industry, they are a good starting point, at least to make data available for a variety of purposes, from academic to business. This has been our case at
PangeaMT, which has utilized part of TDA donors' data to pre-train engines and combinations. The push for language automation needs to be customization, not generalization. Specific domain engines are due to improve accuracy as more data becomes available in public domain, in club initiatives like TDA or in many other formats. Sophie's statement in SDL's latest blog (now deleted) entry:"You will gain less if you have high volumes of your own, personalized content that has been translated"
is precisely right. Studies and our own findings point to sharing as one the key points in increasing the training data and thus obtaining good results. However, this does not mean any type of sharing and throwing any data for training will give you good results. In fact, you may need some kind of data pre-processing and editing so that the feeding corpus "speaks" as intended and thus provides language as you expect. Otherwise, one may be only adding noise.
Data-sharing can be beneficial in many other areas, not just MT development. It can be a way of leveraging, of gathering more terminology, of centralizing assets, etc. As linguists and through a narrow-minded approach to translation, we have assumed that mixing data was sinful, whereas there are more benefits in working with large datasets than in purity of blood.
I am glad to see the new approach at SDL and wonder if this will permeate the whole company philosophy. As a company anchored in TM and workflow technologies, machine translation and crowdsourcing are their sword of Damocles, more a threat than an opportunity. See Renato Beninatto's presentation at ATA for an illuminating insight of what the future may bring for 20th-century technology companies. SDL has been including a kind of general MT plug-in to their interfaces since 2007, possibly as a reaction of the plug-ins released by smaller tools like Omega-T and Swordfish (which initially provided "whole document GT translation" and then was cut to a segment-by-segment approach) to Google Translator. SDL has a working partnership with Language Weaver, and so I wonder about the benefits of buying a piece of software at over 2000€ a piece which, in effect, is going to work as a multiple TM manager (for basic leveraging), terminology manager, with added Auto-suggestion features (a handy feature present in Deja Vu, and MultiCorpora), but which, eventually, may become a post-editing tool of their online MT service. Uhm! Let me think about it.... By the way, the initial comments to Sophie Hurst's article on SDL's blog are typical of the general user's mindset. I have heard scaremongering "end of the world" since the release of Trados 3, which was despised by many as "an invention by the agencies to pay less". I found even resistance to that low-level kind of automation even from manufacturers around 2000. Another question is how far we would have gone in accomplishing the our real mission as linguists (i.e. communicators) if the push in the late 90's had gone for MT (and later on statistical processes) and not for penny-grabbing strict % matching. Let us agree that TM tools, as useful as they may be, do not help to develop a very imaginative approach to our sector's real needs.....How many terminology and post-editing tools we would be using nowadays, and how much faster content would be shared.
 
 
       
       
        

