Tag Archives: Machine translation

Tekom and EXPERT Hybrid MT Winter School

Pangeanic attended two major events during November, promoting its flexible machine translation technologies to translation experts/LSPs and corporate users. We also took part in the first training event of the EU’s Marie Curie EXPERT program which aims at training young researchers in hybrid machine translation technologies and link up experienced researchers with industry.

Traditional Japanese music Tekom

Traditional Japanese music at Tekom

TAUS' Razheb Choudhury - Manuel (Pangeanic) - Diego Bartolome (TAUYOU)

TAUS’ Razheb Choudhury – Manuel Herranz (Pangeanic) – Diego Bartolome (Tauyou)

Tekom concentrated pretty much on terminology management this year. With machine translation now being mainstream in most LSPs, or being adopted to some degree, we were happy to see that worth of a mouth as a result of practical use of PangeaMT by adopters is the best publicity for a technology developed free-of-chains. Despite typical noise by some developers, more and more LSPs are becoming increasingly interested in owning a technology which empowers them to grow and become technologically savvy, as well as enabling them to design better solutions to their clients, without being technological dependent.

Lucia Specia - Post-Editing

Lucia Specia – Post-Editing Presentation

Pangeanic’s technology has been made available as a powerful platform to language service providers and corporations for some years. This has led the company to become part of national research projects and currently a hosting organization where young and experienced researchers will learn and develop novel hybridation skills in the field. The first event within the EXPERT project was held in Birmingham and organized by Wolverhampton University.  The project brings academia and industry together, and aims at training the next generation of researchers at leading European institutions, research centres and technology companies. Apart from being an accolade to work already done, industrial partners such as Pangeanic will be able to expand machine translation capabilities, language combinations and new hybridation and combination techniques in several areas, making the best of computer-assisted translation and machine translation technologies (example-based, statistical and hybrid approaches) as well as including input from respected figures in post-editing research.

Pangeanic translation technology in the press

Pangeanic’s translation technology developments have often been the focus of international media and think tanks, like TAUS or Localization World, where we have showcased our technologies and Use Cases. Now, Pangeanic’s efficient translation workflows using Moses-based machine translation customized developments have also attracted local media attention.
The prestigious Spanish newspaper ElMundo.es printed a 3-page report with a full description of the company’s history, star use cases and applications, renown machine translation applications and developments.
pangeanic staff

Click here to obtain a free PDF of the article inno14oct.pdf.

Days ago, Valencia’s regional Finance Minister was interviewed by the online newspaper 20minutos. He quoted Pangeanic’s taking part in the Valencian Global program and machine translation technologies as key to create an “innovation ecosystem” that can create “highly qualified jobs” due to its significant technological component. Pangeanic expects to grow its services as a result of taking part in the program and expand its global sales network.

Pangeanic has also appeared in other digital media, such as notasdeprensa, again within the Valencian Global internationalization framework and entornointeligente, about its business development and coaching with leading entrepreneurship figures like MIT’s Bill Aulet.

Next time you think languages, think Pangeanic
Your Machine Translation Customization Solutions


Four Steps to Understanding DIY Machine Translation Customization

There has been some recent controversy in LinkedIn and blogs about claims to higher technical levels of engine customization, what is machine translation engine customization, DIY MT, and the understandings of it.

PangeaMT specializes in custom-built systems which users (typically LSPs and translation language departments) can later re-train in two different ways

1. on-site if they have a full system installation (when data privacy is an issue)
2. using our own servers, in SaaS and via our API.

K. Vashee states that “The reality is that running an open source MT solution or using a “upload and pray” solution like that of many DIY MT vendors has become very easy.” This is a gross misunderstanding of what DIY MT is. DIY is about empowering the MT user to take control of the system or at least part of the process, rather than being a passive receiver of MT output that has to be quickly post-edited.

Building an MT engine has become pretty popular (that’s different from easy) and widespread in 2013. Systems are getting better as more and more data is available. Yet, data is not everything. One of our largest engines at PangeaMT holds more than 190 million words, and other engines contain five or six TMX files with over 300Mb of text data inside each. Some little engines with under 5M words perform very well for the documentation task they have been built (see our common presentation with Sybase at Localization World 2011 below).
[slideshare id=8730502&style=border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px&sc=no]

I do not know any MT system builders who claim that using unclean data will not affect the output. Or that leave such freedom to untrained MT system users, without training. That is a key differentiator for PangeaMT: we train users so they can have an impact on how their MT will evolve and develop. Initial revision of (at least) part of the material or typical chunks of text within the domain is the first step to MT engine customization. I summarize some key steps for a good DIY SMT implementation, whether on-site or off-site (SaaS):

1. Gather relevant, in-domain material.
Your own material is key for the best engine performance. The material you have translated in the past is likely to be similar to the material you will translate in the future. Those expressions, terminology lists, translation memories, HTML files, parallel data, even monolingual texts, will form the basis of your customized engine.
However, there may be times when you cannot share all your data. This is the advantage of PangeaMT. Do not despair. Any general, related data will serve purpose for the engine set up. We will train you and show you potential pitfalls with training sets and cleaning.

2. Ask your vendor to analyze the data provided and run cleaning procedures. Your MT vendor should be transparent about “dirty data”, segments discarded and present an analysis of the troublesome segments or datasets which should not be used for machine learning. Dirty data does not mean “bad translation” but very often “noise” that has been introduced by the translation management tool itself, rendering a segment unusuable for machine learning. Explaining rather than translating, or offering bilingual versions will of course confuse learning patterns. So will adding – ” “, ; : profusely when they should not be there, or bad alignments. Source same as target

Data cleaning is a key step in the system. We recommend deleting segments rather than trying to “repair” them. Most of the time, it is not worth the time – unless your data is really dirty.

A lot of cleaning can be done prior to the material entering the system (see below).

Untranslated "to" would affect machine translation learningUntranslated “to” would affect machine translation learning

There are more complicated “cleaning” routines which fall outside the scope of this article and involve revising alignments in phrase tables. We will leave that for keen system users.

3. Perform initial tests (first engines) together with your vendor.
Your vendor may do this and just present your with the final “good” engine or with a variety of engines depending on your specialization.  A habitual training method is to separate 2,000 segments from the training material and then ask the engine to translate those segments, thus obtaining a BLEU score (i.e a measure of how good the system thinks it is). However, this is not the only way nor the most efficient and % BLEU scores cannot be compared across languages nor even within the same language for different domains. An engine providing a 55% BLEU is no good when asked to translate out-of-purpose material, whereas PangeaMT systems have been reported to provide productivity increases from 50% – 300% in German with small engines scoring 38% BLEU but built for very specific purposes like software documentation or automotive manuals.

Put the engine to test with previous translations you have not provided or similar material.

4. Learn about engine re-training and the impact of post-edited material.
How big is your engine? How many words does it contain? What is the BLEU score/Meteor, etc? How many words do I need to retrain my engine? Does my vendor ask for 5%, 10% of the engine size or does it promise on-the-fly re-training with jsut one sentence? Even though that sounds pretty good, a 20-word sentence will have little impact on any engine, particularly considering that the “small” MT engines may contain 5 million words.

We recommend a route whereby your post-edited material can enter the re-training cycle at any time, and a system where you are in control of both cleaning and re-training. PangeaMT offers both. You can upload new material any time after you complete a translation or finish a post-editing job. The latter is extremely good material and several papers point to benefits of post-edited material in MT engines. You can also schedule or set immediate re-training.

PangeaMT engine control panel

PangeaMT engine control panel

Those four steps are basic checkpoints you should bear in mind when moving your  organization towards higher automation and adopting MT. Above all, you should also consider the cost of “ownership” or “SaaS” according to your needs and how far deep you want to go in MT. Do you wish to position yourself as an authority with fully customized machine translation technology in your language pair / field? PangeaMT will help you. Or do you simply wish to save time and translate faster, without changing tools? Our TMX workflow will help you.
Many tools are fully compatible with PangeaMT, and our philosophy is to engage with tool and platform providers to offer open standards solutions, no tie-ins. Our SDL plug-in allows you to work with a well-known tool and, simultaneously, benefit from being the owner of your own engines and use the translation memory to build, customize and re-train the engine(s) for the next jobs.  With PangeaMT, you will get an instant suggestion from your engine and choose whatever is more relevant, the translation memory match or the suggestion translated by the engine. Post-editing takes a few seconds, whereas translating sentences from scratch can take almost a minute sometimes.
Because every engine is built with your own material, it is specific to you only and trained to perform and translate in the fields you specialise and nothing else. Following strict TMX cleaning procedures and engine training methods, customized engines become extremely useful translation tools that aid translators in their every day tasks. Your future post-edited material can retrain the engine very fast, improving accuracy more and more with every job.
Next time you think languages, think Pangeanic
Your Machine Translation Customization Solutions


PangeaMT Webinar on Translation Customization and DIY MT

by Manuel Herranz

Machine translation is a hot topic – and will be a hot topic for some years to come. But it is not only a hot topic with a lot of mystifying hype around it.

Elia Yuste, Andi Frank and (I hope) myself came a little bit closer to discerning doubts about what is a customized machine translation engine, what is MT DIY,

Our Gala Webinar aimed at clarifying some misconceptions about language companies building their own tools and applying them successfully in the language market.

The language industry is a very varied industry, with few technological players which dominate the landscape and a myriad of smaller tools which fit many purposes, large and small. PangeaMT was born as the technological division solving the needs of a translation company. PangeaMT now has a life of its own and it is a well-respected, mature technology.

The presentation is available already in slideshare.
[slideshare id=26586613&style=border: 1px solid #CCC; border-width: 1px 1px 0; margin-bottom: 5px;&sc=no]

Pangea Machine Translation became the first commercial application of Moses. Year after year, it has expanded on the core to add more functionalities, testing them at its translation department. It launched the now famous DIY MT package back in 2011, now part of many other platforms. PangeaMT now offers API, workflow and a TMX management system to clean and use material for machine translation learning and training.

The webinar continued to see how first-time customization and training data consultancy permeate each PangeaMT development. This is also applicable to data cleaning and reporting, which can later be automated after client-specific parameters and weighs are in place. Pangeanic’s team stressed time and time that the concept behind PangeaMT is independence. Translators feel empowered by having and managing their own engines and seeing that their post-editing material has an impact on engine behavior pretty soon.

This webinar pointed to machine translation applications and deployment scenarios beyond the usual requesting of machine-translated output in a limited fashion for pre-translation. With PangeaMT, users can create their own ecosystem.

Click here to learn about how to use your bilingual files, glossaries and TMX as assets to build MT engines. Be in control of your domain-specific engines …always!

Next time you think languages, think Pangeanic
Your Machine Translation Customization Solutions


Understanding Machine Translation Customization and DIY MT

by Manuel Herranz

The same mistake that was made by many translation agencies, translation companies and now language service providers is being made by tough machine translation companies. “My (machine) translation is better than yours”, “my machine translation system works, everybody else’s doesn’t”. Translation companies have learnt that they cannot sell translation services on “translation quality claims” only or “I am better than you because…”  – but it seems that some machine translation companies have to learn the same lesson. I am referring particularly to those with risky levels of investment /venture capital to repay and without the testing ground of in-house native speakers or a real translation department where to test their technologies and MT before release. At times, such companies obtained their “high quality clean data” by bombarding Google Translate and applying cleaning cycles which included manual revision by local, non-native graduates. Many LSPs fall for the big marketing campaigns, strong wordings – the limelight is always very attractive. Translation Memory technologies are a good proof of that.

Bad-mouthing the competition is the worst marketing tool I would recommend to anybody in sales, marketing or representing a company. Talk about your strengths. Acknowledge what you cannot do but what you can do to solve the problem. If you cannot match some offerings from the competition, saying it doesn’t work is a terrible policy. There are tens of use cases and applications, conferences, presentations to prove that, for example DIY MT works and is in good health, being used at LSPs, institutions and corporations. As far as I know, automated retraining and Moses packaging are part of at least two EU-funded programmes. As platforms such as Gala provide an excellent platform for machine translation webminars, monopolistic attitudes become more and more aggressive.

But I want to minimize self-promotion. What Kirti Vashee seems to forget in his virulent blog entries is that no company will release a tool that doesn’t work nor install a product that cannot do what it claims it can. I was an industrial engineer for many years to learn at least the difference between what works and doesn’t work. When it comes to hardware tools, quality may be easy to spot. When it comes to services (and in machine translation is clear, “my output” “my clients” “my productivity” and “my technological independence”) quality is what works best for me. Claiming that in 2013 MT is so complex only one company fully understands it, is presumptuous to say the least.

Let me quote some translation agencies (the term Language Service Provider being unknown to the majority of people outside the language industry). They are not big companies, possibly what economists call small and medium-size companies.

Tilde, Apsic, Lexcelera, Pangeanic. I am sure other four at least could make it to this list. What do these companies have in common? All of them were/are  translation companies that have transformed themselves into higher solution providers either by developing software solutions that solved particular problems in translation or by customizing technology into their processes. With the help of EU funds and a clear vision to fill a market need, Tilde led R&D projects aimed at developing machine translation for less-resourced languages. Automated engine creation and re-training were part of the initial EU-funded project.

Apsic is the developer of one of the best consistency-checking software (XBench) which is a must of any company wanting to ensure terminology consistency and error-free supplies over hundreds of files.

Pangeanic has developed a management system on top of Moses which manages training sets and automatically cleans some data, trains engines and creates new engines with a variety of other customizable features.

As MT customizers, we know that initially some settings, parameters, weighs and features need to be configured carefully to get a good start. But I do not know of any company in the software business that insists on manual processes and cannot automate what it has to do repetitively.

Next time you think languages, think Pangeanic
Your Machine Translation Customization Solutions


Translation Companies Rank High

by Manuel Herranz

The translation industry, despite downward pressures (not unlike any other industries) seems to be able to provide revenue and career opportunities despite the grim economic outlook. Several translation companies are entrants in the famous Inc. 500-5000, which ranks companies by general revenue growth. This follows recent reports forecasting machine translation industry growth at around 18% and general translation industry growing (although slower) in 2012.

There are 15 companies within the translation and language services category in total in the list. There are several conditions to qualify. Inc.500-5000 only lists US-based companies over a 3-year period and the companies must have generated a minimum of $100,000 in 2008 but at least $2M in 2012.

As this is a commercial ranking, companies must be privately held, for-profit and they must not be divisions of larger companies or subsidiaries of foreign companies. This is the list and ranking of the top translation companies that made it to the list taking into account.

Position Name Growth Revenue
204 CWU 2062,00% $10.2 M
423 Mid Atlantic Professionals TA/SSI 1076,00% $9.2 M
823 InDemand Interpreting 550,00% $3.9 M
1619 Mango Languages 243,00% $7 M
2514 adaQuest 141,00% $12.7 M
2581 Language Training Center 137,00% $3.8 M
2636       Global Language Solutions 133,00% $10.50 M
3064       Propio Language Solutions 109,00% $2.10 M
3310       CETRA Language Solutions 96,00% $5.90 M
3396       CyraCom International 92,00% $48.70 M
3676       WeLocalize                                         81%        $90.80 M
3839       Certified Languages International 74,00% $14.40 M
4082       1-Stop Translation USA 65,00% $2.30 M
4248       Universal Language Service 59,00% $6.10 M
4287       LinguaLinx 57%        $4.30 M
4363       TransPerfect 55%        $341.30 M
Next time you think languages, think Pangeanic
Your Machine Translation Customization Solutions



Cloud Traffic and Data Will Increase Translation Services

Cisco estimates that global cloud traffic will grow 45% annually until 2016, with translation services growing at around 15% to 20% per year. According to Ian  Henderson, CTO of Rubic, a translation and location company, this means that many new machine translators must enter the industry each year to handle the content.

On the other hand, Raymond Kurzweil, one of the brightest minds in the world, director of technology at Google and a futurist known for his predictions about artificial intelligence, predicts that machines will match human intelligence and perform several feats that seem to us science fiction nowadays, including human-quality translation, by year 2029.

Current happenings also suggest a strong role for non-human translation, with machine translation (MT) advancing rapidly. Three simultaneous-translation devices have been announced since June 2012, including one by Microsoft that renders live audio translations from the spoken word, respecting the tones and inflexions of the speaker.

Perfect is hard

But perfecting translation machine engines remains one of the toughest challenges in artificial intelligence. For several decades, computer scientists with the help of armies of linguists, tried rule-based approaches, i.e. teaching machine translation systems the linguistic rules or similarities between two languages (sometimes not related languages, like English and Japanese) and including the necessary dictionaries. Progress was extremely slow and suffered several setbacks, like the ALPAC report in 1966.

Technology did not cease to advance until statistical systems, using vasts amounts of data, have made it possible to train translation engines fast and efficiently for several domains. See our presentation in Budapest including a short history of machine translation.
[slideshare id=8510213&style=border: 1px solid #CCC; border-width: 1px 1px 0; margin-bottom: 5px;&sc=no]

Click here for the longer version, a recommended review of a lucid article courtesy of Gadget Web Site.

Undoubtedly, ever growing content and the demand for translating online data into multiple languages is growing fast. Exponentially. Pangeanic launched its Pangea machine translation project in 2008, reporting real-life implementations in many events, and it is now a successful, customizable software capable of re-training itself and creating engines on the fly. The project has won international name and is part of EU-funded projects.

“Human translation and machine translation are kind of like ‘frenemies,’” translation expert Nataly Kelly said. “They live alongside each other, but not without a lot of tension.” Sometimes, machine translations are so atrocious, human translators prefer to start from scratch.

Machine Translation companies and their output are becoming more and more ubiquitious every day. And as experts, we know that the aim of the technology is not replacing multilingual humans. Machines (rather automatic translation software) cannot fully replace human translators…yet. In fact, human translators often clean up machine translation (post-editing). Thus, the technology becomes an enhancer rather than a replacement.

It is this need for accuracy that keeps the (human translation) business growing. In fact, it is one of the few industries to have grown during the worldwide recession. It is approximately a $34 billion market. Machine translation’s market is around $200 million with growth forecasts of around 18,65%.

“Demand for translation is booming because content creation is exploding,” says Kelly. “And since much of that content is created, and demanded, in multiple languages, human translators alone can’t keep up. They need machine translations to improve–and fast.”

Next time you think languages, think Pangeanic
Your Machine Translation Customization Solutions


Human Translation or Machine Translation – What’s Best for Me?

For some people, using a translation software program to translate a piece of text from one language to the next is enough. It would be naive to believe this always works. We have proven at Pangeanic that this works in applied contexts, when we are dealing with a particular domain, enough clean data and when certain conditions apply. Please refer to many of our presentations since 2009 on the use of applied machine translation to speed translation of documentation in particular.

But as we all know, it takes a lot more than just software. The application of unrestricted, universal machine translation will take some time. In fact, it would not be fair to talk about “machine translation” in general but language combinations (English/Spanish/French/Portuguese/Scandinavian) in which it is undoubtedly successful -whilst in other languages certain nuances make it less of a success story.

Also, in some types of texts, it is essential to depart greatly from the writer’s meaning in the original language in order to carry over the thoughts and sometimes even emotions behind the words. Metaphors and comparisions do not machine translate well many times.

Human Translators Use Computers

Human Translators Use Computers – Is Machine Translation Not Just Another Tool?

When it comes to human translation, some joke it is one of the eldest professions in the world…
However, technology has focused in the last few decades in the need to help, accelerate and solve the needs of international commerce and information transfer (remember President Obama’s call upon coming into office). Particularly since the advent of Statistical Machine Translation and online services, translation has become part of our every day lives, truly ubiquitous.

The speed at which machine translation happens is a huge advantage in time over human translation. For information purposes only, most users will put up with errors from translation systems so at least they can make sense of some texts. But the real, true advantage of machine translation is as a productivity enhancer of human translation services.

Financially speaking, machine translation is usually more economical than paying for human translation, but both serve different purposes. Machine Translation has proven very useful for gisting (finding out) what a text is about. In some applications, it provides very good results, which means savings in time and money as humans post-edit the output. Even though translator resistance to become a post-editor has been historically a stumbling block, younger generations of linguists are becoming more and more used to edit fast machine output in what the experts call a “redefinition of the role of translation as a profession”.

The power of the human brain makes it possible to translate text whilst keeping the essence or context of what is being recorded. This results in a conversion that is much easier to read and understand, but at a much slower pace. If the text does not contain feelings or emotions, then translation systems can often produce good enough output to be acceptable as the technology has improved in comparison with old rule-based systems.

If you are trying to decide whether human or machine translation is the right one for you, then you should consider what type of content you have, an publication time you can afford so you can make the right choice.

Next time you think languages, think Pangeanic
Your Machine Translation Customization Solutions


I used to be a Translator, Now I Run Machine Translation (LocWorld London 2013)

by Manuel Herranz

It is only when looking back in time that one realizes how much work has been done, how far we are from where we used to be … what we call progress. That is what happened during our presentation at Localization World presentation of Pangea machine translation technologies.

Our presentation summarized how the mix of Pangea Technologies have enabled translation practitioners to empower themselves (see the launch of our DIY SMT in Barcelona, 2011) and be active in machine translation rather than just be passive users or passive post-editors. The platform is mostly based on open source developments to allow flexibility and customization, but it also includes  propietary cleaning filters, translation engine creation and retraining, dataset management and a very powerful set of statistics so users can see improvements every step of the way.

Pangea is the history of a solution designed for translators, for applied language professionals. It is machine translation as a productivity enhancer – and these features are what have made a small internal project grow into a reference technology, and a concept (DIY SMT) used worldwide.  Currently, the company is also part of the EU-funded EXPERT (Empirical Approaches to Hybrid MT and Post-Editing).

The story of Pangea DIY (S)MT will continue, applying its concept of flexibility and empowerment to language technologies, letting practitioners utilize their TMs as an engine-training tool, customizing their translation engines even with small data sets.

[slideshare id=24510059&style=border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px&sc=no]

Next time you think languages, think Pangeanic
Your Machine Translation Customization Solutions



Pangeanic in EU EXPERT Project Evaluating Hybrid Machine Translation

by Manuel Herranz

As recently published in our news section, Pangeanic is taking part in the EU-funded EXPERT Project.

EXPERT: EXPloiting Empirical appRoaches to Translation

EXPERT: EXPloiting Empirical appRoaches to Translation

EXPERT aims to train young researchers to promote the research, development and use of hybrid language translation technologies. In practice, EXPERT aims at improving translation practices and enhancing the productivity of relevant actors in the translation market. In this respect, EXPERT’s findings will set an agenda for new skills and jobs by promoting new job profiles based on empirical data from translation professionals (language service providers) and academia. The assumption of the project is that true potential of MT remains to be exploited as a result of non-user-friendly interfaces, lack of awareness of translator’s feedback, etc.

However, Pangeanic already created and released a web-based tool that is able to organize material for Machine Translation by domain, maintain it and perform some cleaning routines, a key factor in our participation in the project. This web tool is also able to directly create engines by domain or by TMs and perform several operations on training sets before engine training. Following a revolutionary concept, Machine Translation engines are created or updated depending on domains, and a few clicks can set in motion several actions to provide ready-for-use (S)MT.

The web tool already incorporates hybrid features (such as those presented at JTF in Tokyo, 2011), and these will be tested, expanded and improved upon in EXPERT.

Our role  within the 4-year project is to concentrate on results-driven testing of hybridization on the 6 official United Nations languages, carrying out a series of experiments on EN/FR/ES/ZH/ RU/AR. These will include general pre- and post-processing rules designed to improve machine translation output. For example, some tests will alter training sets and evaluate the impact of reordering in certain language combina­tions, measuring gains when using purely statistical, syntax-based or factorial models.

Pangeanic will focus on the automatic generation of bilingual written texts for multiple language combinations, alignment, segment cleaning and segment selection for bilingual engine building. We will also look at what hybridation language technology techniques need to be incorporated and im­proved to tackle re-ordering issues and other linguistic phenomena in non-related languages. When deal­ing with language-specific issues, we will also delve into automatic quality metrics and how these can correlate to human, non-objective qualitative appreciations.

Using our tool, users can check engine statistics (e.g. BLEU score, number of segments, number of words) and be­havior. For example, the engines can be used for translation and can be automatically updated with new or post-edited material, with further retraining possible via Pangeanic’s MT API or web. In this way we can measure the impact of new datasets and hybrid techniques over time on translation quality and the project will benefit from existing, state-of-the-art technologies.

Next time you think languages, think Pangeanic
Your Machine Translation Customization Solutions