[embed]https://www.slideshare.net/manuelherranz/loc-world2011-kbiermherranz-8730502[/embed]
I do not know any MT system builders who claim that using unclean data will not affect the output. Or that leave such freedom to untrained MT system users, without training. That is a key differentiator for PangeaMT: we train users so they can have an impact on how their MT will evolve and develop. Initial revision of (at least) part of the material or typical chunks of text within the domain is the first step to MT engine customization. I summarize some key steps for a good DIY SMT implementation, whether on-site or off-site (SaaS): 1. Gather relevant, in-domain material. Your own material is key for the best engine performance. The material you have translated in the past is likely to be similar to the material you will translate in the future. Those expressions, terminology lists, translation memories, HTML files, parallel data, even monolingual texts, will form the basis of your customized engine. However, there may be times when you cannot share all your data. This is the advantage of PangeaMT. Do not despair. Any general, related data will serve purpose for the engine set up. We will train you and show you potential pitfalls with training sets and cleaning. 2. Ask your vendor to analyze the data provided and run cleaning procedures. Your MT vendor should be transparent about "dirty data", segments discarded and present an analysis of the troublesome segments or datasets which should not be used for machine learning. Dirty data does not mean "bad translation" but very often "noise" that has been introduced by the translation management tool itself, rendering a segment unusuable for machine learning. Explaining rather than translating, or offering bilingual versions will of course confuse learning patterns. So will adding - " ", ; : profusely when they should not be there, or bad alignments. Source same as target Data cleaning is a key step in the system. We recommend deleting segments rather than trying to "repair" them. Most of the time, it is not worth the time - unless your data is really dirty. A lot of cleaning can be done prior to the material entering the system (see below).Untranslated "to" would affect machine translation learning

Those four steps are basic checkpoints you should bear in mind when moving your organization towards higher automation and adopting MT. Above all, you should also consider the cost of "ownership" or "SaaS" according to your needs and how far deep you want to go in MT. Do you wish to position yourself as an authority with fully customized machine translation technology in your language pair / field? PangeaMT will help you. Or do you simply wish to save time and translate faster, without changing tools? Our TMX workflow will help you.
Many tools are fully compatible with PangeaMT, and our philosophy is to engage with tool and platform providers to offer open standards solutions, no tie-ins. Our SDL plug-in allows you to work with a well-known tool and, simultaneously, benefit from being the owner of your own engines and use the translation memory to build, customize and re-train the engine(s) for the next jobs. With PangeaMT, you will get an instant suggestion from your engine and choose whatever is more relevant, the translation memory match or the suggestion translated by the engine. Post-editing takes a few seconds, whereas translating sentences from scratch can take almost a minute sometimes.
Because every engine is built with your own material, it is specific to you only and trained to perform and translate in the fields you specialise and nothing else. Following strict TMX cleaning procedures and engine training methods, customized engines become extremely useful translation tools that
aid translators in their every day tasks. Your future post-edited material can retrain the engine very fast, improving accuracy more and more with every job.
Next time you think languages, think Pangeanic Your Machine Translation Customization Solutions