Tag Archives: Machine translation

Microsoft and Skype Translator- Will it Become True?

Today, organizations invest heavily in their research and development departments. This has led to new technological advancements being made in various fields to make our lives convenient.

Machine translation is one such technological milestone that has been achieved after years of research and experimentation. Machine translation, also known as automatic translation, is rapidly gaining popularity among masses and its importance has now begun to be realized by various schools of thought. Machine translation is software-based technology and can translate written text. Historically, it was based on rule-based (the relationship between one language and another). The advent of massive amounts of data and statistical systems capable of processing relationships between language pairs, created the possibility of building translation engines fast and efficiently. Hybrid systems take the best from both worlds.

So far, various organizations have shown interest in this new discovery of science and are focused on further developing the technology. Microsoft, for instance, have been researching on the subject of machine translation for more than ten years. Doing so, the researchers at Microsoft have made some incredible innovations in the field of machine translation, and are continuing to do so.

The research team at Microsoft was able to realize the importance of machine translation and how effective it can be in lowering language and communication barriers that hindered global communication.

The first productive results that the research team at Microsoft delivered were related to Microsoft’s product-support Knowledge Base. Later the technology was modified further and now can be accessed by millions all over the world, in the form Bing Translator, a hybrid approach to machine translation with re-training features, a kind of “DIY MT” in which Microsoft can make use of your data. They joined TAUS several years ago in an effort to obtain more data and accessing donors’ data proved valuable.

After the success of Bing Translator, the next challenging task for Microsoft was to introduce machine translation technology for speech. The job was even more intricate than developing software for text translation. In addition to taking into consideration various nuances among different languages, care also had to be taken in terms of speech pronunciation and utterances. The challenge was to develop machine translation software that is compatible with the pronunciation of various languages and can differentiate between voice utterances of a diversified group of people.

After thorough research, Microsoft research team was successful in designing software that provided solutions to most of the problems concerning machine translation of speech. Technology which is described as real time translation for Skype was for the first time unveiled at the Code technology conference. Microsoft also released a video showing the head of its Machine Translation R&D team Chris Wendt speaking in German and having his message translated into English and back into German (see the video here).

Chris Wendt Skype Translator

Chris Wendt using Skype Translator

The video shows a short, limited conversation. However, other examples involving Gurdeep Pall, Corporate Vice President of Skype, showed a conversation that was fairly understandable, although few anomalies were too obvious to be ignored. It appears that the technology still fails to completely comprehend the sentences before translating, and still needs to be worked on to produce more user friendly results.

However, it cannot be denied that the launch of machine based translation service for Skype is an effective and substantive technological milestone towards lowering communication barriers all over the world. Machine translation is a gaining popularity due to its diversified and time-efficient approach. It is becoming a utility that we expect to find embedded in products and services, at the press of a button.

The translation service for Skype now can be accessed by the masses and have over 300 million active users all over the world. As the effectiveness and usefulness of Skype translator has been realized by many in carrying out business transactions, it appears that in the coming years this technology is going to see a tremendous increase in its users. In fact, Microsoft has already taken the initiative to carry out further research on the subject of machine translation to make the technology more convenient for its users.

3 types of machine translation

If you are a content manager and own a business, you know how much time writing takes. For instance, you can order one of your writers to write good content for your blog, for example a content-rich article of about 1,000 words. It soon adds to 10,000 words of valuable content that you need to transfer from one language to another. How fast can you expect the work to be done? Well, it can take some days or weeks to get it done. This is where modern technology steps in. If you need to produce volumes of work in several languages within a limited time span and with a tight budget, machine translation is what you need. With this technology you will save on to two important aspects: money and time, but its deployment must be well planned.

The next few articles in our blog will deal with the importance of planning well your machine translation strategy and incorporating a machine translation workflow and use in your organization, whether you are a translation company or a translation buyer. We will use our experience at Pangeanic as the first LSP in the world to deploy Moses commercially and how we grew from there to create PangeaMT and serve custom engines and full machine translation systems.

If you need to meet strict deadlines and need your work translated quick, human translation services might take up time than expected. Rushing human translators is bound to produce mistakes. Squaring the cost/time/quality triangle and building scalable translation strategies is something that few companies have achieved in international publication. But when it comes to machine translation, and acknowledging that you will get some comprehension errors, you save time. And in language pairs for which it is very difficult to find a translator (imagine translating Japanese into Turkish as some of our clients requested recently), machine translation is the only option when speed is essential. Machine translation is a tool to speed translators’ output so they produce more. Popular online translators have made this possible. However, in real-life scenarios, many clients require special formats, very particular expressions and terminology adherence that generalist engines cannot offer. There are many gains in machine translation, but the main benefits always come from building specific and custom engines using the client’s previously translated material and terminology.

Gone are the days in which only large corporations could afford buying machine translation engines. Pangeanic, via its machine translation division PangeaMT has offered custom-built MT engines for years to companies and to other Language Service Providers, providing them with a key competitive edge and allowing for large projects to be completed on time, fast and efficiently.

At Pangeanic, we speak about the 3 typical uses of types of machine translation we can encounter

  1. for gisting (simply understanding what something says, with little lifetime value and low expectations by the user). Here machine translation engines exist prior to human interaction
  2. for publication (for serious publication work with a higher lifetime value for the document and high quality expectations by the user). Humans are in control of the input with which the engines have been trained and these perform according to their specific needs and domains. Here, machine translation engines are created after human users have decided that it is viable to use MT and they use it for a purpose.
  3. for human interaction (when humans do not speak each other’s language and a voice recognition software converts speech to text which is then machine translated and converted again into speech).
3 types of machine translation: understanding, publication, human interaction

3 types of machine translation: understanding, publication, human interaction

 

Everyone has used free online translation systems. The users’ approach to them is that it should be instantaneous, fast and free. And it should cover as many language areas as possible. In other words, it should be like a sheet: good length but not too much depth. Lower quality or unreliable outputs are acceptable as the service is free.

The second case is what concerns translation professionals and it is the use of custom-built machine translation engines for a specific purpose. Typically, translation professionals will pay for this service as a professional service and tool with its own ROI as it will lead to higher outputs by professional translators who save time in typing, reading and understanding and sometimes looking up terminology. A well-built translation engine will contain specific terminology that will save invaluable time to post-editors even if they do need to improve the sentence to make it flow and sound human. Post-edited material, constantly evolving techniques in natural language processing, hybridation, etc. This is the area where machine translation has made the highest impact in professional and quality publication, as an aid and tool for translators’ to use.

A spin-off case of the above is the use of customized MT with an API to translate web content on the fly, for example making calls from your content management system to a custom-built engine that can serve fast translation of products, short reviews, etc. Opening this type of access to machine translation can open new revenue streams to companies as they can add new services to their clients with the right technological partner.

As explained above, the third application of machine translation engines is human to human when people do not speak each other’s language. Some claims have been made about speech-to-speech translation lately, but mostly in controlled environments. Let us remember that although speech recognition has advanced a lot, there is a training time required for the software to recognize one’s tone and some accents are better recognized than others. Without prior training, speech recognition can fail. This is a loss to which we have to apply machine translation and convert text to speech again.

Stay tune to our blog to find out more about Pangeanic’s applied machine translation strategies and how our technology has provided success stories to both larger translation companies, organizations and companies in a variety of sectors.

The Little PangeaMT Engine that could…

the little engine that could

A little PangeaMT engine translates BIG!

A story of The Little Engine That Could inspired a lot of kids and adults around the world back in the days and still continues to do so … It was the little blue engine that was the star of the day and not the fancy and extravagant train engines because she had the right attitude in wanting to help others!
Lessons learned: you don’t have to be big and stylish to do great things…you “just” need to believe and work hard…. And be helpful! Others will notice the effort.

I arrived at Pangeanic HQ in sunny Valencia, Spain, on my first week of job (I started as a Market Development Director two weeks ago), and I kind of related that story to Pangeanic and its Machine Translation Engine solutions (this play on words with “engine” seems ideal in this situation). A medium size translation company at heart, Pangeanic has developed PangeaMT as a machine translation tool that eases multilingual technical publication particularly to producers of large volumes like the automotive manufacturing, engineering and electronics industry or pharma. It is developing real-time translation MT for online help and interviews.

The Pangeanic staff at the HQ office is a mix of Spanish, Belgian, German, Italian and French translators, Project Managers and machine translation technical experts. All passionate and dedicated to helping their customers meet translation and MT needs and working towards the same goal – providing down-to-earth yet effective MT and specialized technical translation solutions to its international customers. What makes Pangeanic different is that they actively transfer their technology to users, they really want to make a difference.

It may seem that PangeaMT engine is “little and humble” because the customer portal is not exhibiting cool designs and colors, maybe it’s been drowned by some louder and more expensive sophisticated marketing from other MT solutions companies, maybe there hasn’t been a lot of hubbub about Pangeanic’s solutions, but at the heart of it all, that little PangeaMT engine really can! I’ve seen it with my own eyes during my first weeks…

Pangeanic is not exactly a small player, it just does its work quietly… The research and development is done in conjunction with Valencia’s Polytechnic, a Computer Science Institute, but Pangeanic also takes part in European research together with leading European Universities. They are industrial partners in the EXPERT project and recetly as an industrial tester for the EU translation workbench project Casmacat. Pangeanic was one of the key founding member of TAUS and its data-sharing initiative which provided the organization much of the corpus for experimentation in the early years. I’ve seen a technologically-savvy company with smart solutions in place and with a big push to empower MT users for years (other LSPs and corporations) so they can become machine translation expert users themselves, as well as exposing to the corporate world the usefulness of using MT technology and its benefits.

Pangeanic works quietly in the background, happy to help its customers and in fact bend backwards to assist with any MT or translation issues.  Pangeanic empowers users to learn more about how their MT engine works and how they can maximize its use – it opens its technologies providing training and access to customization in a way typical “machine translation engine sellers” do not. Pangeanic’s motto “Bring democracy and affordability to Machine Translation” is proudly reflected in the everyday work they do. So from my first observations, I can really say that this “little” PangeaMT engine really can… Pangeanic makes it possible… It works hard and believes in what it does.

by Maria Kania-Tasak

TAUS Tokyo Executive Forum 2014 – Machine Translation becomes embedded

Despite years of economic stagnation, a feeling we are so familiar with in Europe, Japan proved that many good things can be expected from it at the latest TAUS Summit when it comes to innovation and application of machine translation as an embedded application in services and technologies.

However, the first striking news came from Korea. CSLi acquisition of Systran had surprised many (I’m no exception), but the presentation at TAUS explained many of the unknowns. It also provided a hindsight as to what the route map may be for the future of machine translation as a traction force in the translation industry. CSLi is Samsung’s machine translation provider of their famous S-Translator app. Their acquisition of a Western expert with vast experience in European languages has opened a lot more language pairs and expertise to Samsung. This, in turn, provides massive amounts of users’ search and language data to the corporation.

Hunnect’s experiences with engine machine translation without big data were an explanation for hands-on applications. Mr Sándor Sojnóczky classed “little” at 8M words within the human science domain. He was able to customize some engines and build on them and obtain real improvements by separating the material on 3 levels. His life sciences engine was based on a first level of general Life Science corpora, a second level based on Medical devices and Clinical Devices and a third level which was specific to a product. Despite the success, (post-editor producing around 900 words an hour) a general impression from this non-developer is that MT companies hardly provide the world-class customer care service other types of companies provide. In one word, his machine translation vendor got his money and his ideas and quickly moved on. (Users of PangeaMT presented optimum results with a single software engine at 5 million words at past TAUS events and Localization World, following the launch of our User Empowerment in Barcelona 2010, but we will refrain from self-promotion).

Sándor Sojnóczky from Hunnect

Growth at printing companies via machine translation? That is the title of Masanobu Ogata’s presentation from Toppan Printing Co Ltd. Their plan is to offer their translation system to Japanese companies expanding globally for free or at a very low price. The focus is low-cost operations to translate manga, novels, how-to books and other printed Japanese content, digitize it and sell it through Booklive. They will use it to reduce their in-house localization work and make it more efficient. If it is free, and it could become the standard system in Japan. Right now the system fills the demand for Asian languages, with business translation orders system to be launched in late 2014.

However, apart from in-industry news and developments, the limelight was cast on two applications that are making translation a utility. One came from NTT Docomo, introducing a kind of Google glass device and a menu translator which can magically return translations over pictures taken with one’s mobile phone. I got news only two days ago that google had bought a start up to do exactly that, offering driving signal translation as a use case.

The other breaking application came from Mark Seligman at Spoken Translation. Mark introduced a live translator for the medical sector which can understand and translate sentences within domains for certain language pairs, running a live transmission over the internet with one of his associates into German.

Our presentation on PangeaMT as the ultimate User Empowerment platform with which to experiment, and above learn and grow your company’s machine translation strategy was well received and understood, with plenty of Q&A. Buyers of machine translation technology are getting wiser and wiser. They do not want to become passive users of lonely engines with some nice statistics thrown at them.

IMG_0579*Explaining the advantage of technological independence rather than becoming an “engine buyer-user”.

Increasing, the ability to grow one’s system, clean, know the best of Moses and tweak all options to put maximum customization in the hands of users is becoming more popular, although some players like to mix concepts about what DIY and User Empowerment is (for self-promotion) at their presentations at industry event presentations.

Dion at Gala showing hamburguers Dion at Gala showing instant soups

The conference continued with Jaap explaining TAUS roadmap for the Human Language Project, a long-term driving force like the Human Genome project in order to disentangle languages. With the idea of MT becoming the Lingua Franca, the Data Repository with its attractive matrix of languages is an attractive feature to any machine translation enthusiast. Other work includes the quality metrics and studies on finding things like annotated data, and a program called FT2MT which would include automatic selection for optimal model combination, a shift from translation data to library of models, and a strong accent on evaluation which must be automated as human evaluation is too costly and lengthy.

Finally, it was NTT Docomo’s Menu translator the application which won TAUS Innovation price. Plenty of things can be expected from Japan again

ntt docomo receive prize

Reflections on Gala Istanbul 2014 – The perspective of a translation company language specialist

The GALA 2014 conference was being held in Istanbul, Turkey from 24rd till 26th of March, bringing together experts and key figures of the global translation and interpretation industry. Here’s our summary from the perspective of a translation company language specialist, A. Thömel.

On Monday 24th of April, the conference “kicked off” with a presentation by Fikret Orman, president of Beşiktaş, a Turkish football club from Istanbul with a history spanning over more than 100 years. In his keynote address, Mr Orman shared some interesting insights into the management of his team and, in a side remark, made clear what type of player he is himself, mentioning his work schedule of 16h/7 days a week. As a welcomed gesture, he had brought with him one of the black and white striped shirts of the club for each of the over 300 GALA attendees. They didn’t hesitate to slip the shirts over the imaginary ones they were already wearing for their own team in their respective leagues (CAT-Tool developers, MT vendors, Management Systems, etc.), leading to visually striking moment which rightfully suggested a togetherness. Because that’s what Gala Istanbul was going to be about, getting together and learning from each other in a friendly and professional atmosphere.

BPL791

While the full schedule of the individual speakers sessions can be looked up here, continue reading for some handpicked highlights at a glance:

  • Paul Filking from SDL

Paul Filking from SDL asked the question “Are you truly interoperable?”. His departure point was a typical scenario where a provider starts working with a CAT-Tool, then an accounting system, a content management system, and so on. As business grows, he starts working with other providers who have their own CAT, CMS, etc. That’s when an integration between all those systems would be needed, but it’s not there. The systems work in parallel but not truly together, so things get growingly complicated. Here, the much discussed question of “standards” comes in – are they actually helping with interoperability? While in certain scenarios, standards really do help (the picture of maybe 20 different phone chargers is well known but still generates a giggle as we all know the problem so well), Paul doesn’t believe that, because open standards will never be able to keep up with innovation. He instead focuses on APIs and custom extensions done by individual providers to provide interoperability. With APIs, which function as doors into an application, all the different systems can sort of talk to each other and one does not need to wait for new standards, he can take advantage of the latest technology trends using the tools he already has. For instance, as people love “their” cat(-tool)s and often tend to stay with the one they bought. When Studio 2009 came out, Pangeanic’s Machine Translation solution, PangeaMT, was in its initial phase of commercialization and not yet directly integrated into CAT-tool workflows. Nowadays, everybody using Studio can directly choose PangeaMT via API amongst other providers, as can be seen in this slide of Paul’s presentation:

mZt1at

Amongst the various mentioned scenarios made possible through API-integration, one of the most striking ones was the following: Somebody post-editing Machine Translation material by voice recognition -even with his phone!

  • Adam Blau from Blau Consulting

Adam Blau from Blau Consulting in the section “Reaching Out” suggested ways to take the “Black Magic” out of hiring localizations Business Development Managers. He made clear from the beginning why the right hiring technique is so important: A hiring mistake after one year could not only mean 30,000 EUR lost in salary, but also 300,000 EUR in lost opportunities. Adam offered us many of the right questions to ask in order to avoid such a mistake, such as: If the BDMs are supposed to work with small companies, have they worked in one themselves before? During interviews, do they demonstrate the same skills and work habits they need when it comes to understand the buyers? Do they show a thirst for industry-specific knowledge? Will your company be able to provide the travel budgets etc. they are used to have? Do they fit your clients and the way they buy? Setting short term goals is another way to assure your HHRR investment and the respective expectations should made be very clear from the beginning on.

  • Gustavo Lucardi from Trusted Translations

Gustavo Lucardi from Trusted Translations explained how to obtain new clients online. His first and possibly most important advice was setting the theme of his presentation: “Play with Google!”. While most of us probably think we already know and use Google well, when the audience was asked who was using Google Webmaster Tools, only 10 hands or so were raised in a rather full conference venue. For Gustavo, this clearly meant an opportunity worthwhile to be explored. Google Webmaster Tools and the seemingly more popular Google Analytics are free and enable us, together with Google AdWords, to determine important things as the conversion rates from Google to our websites, the impressions (number of times our site appears in search results) and our AdWords investment. The secret is the interaction between all three tools to measure the traffic we receive and make sure it’s not just any traffic, but quality traffic.

One of various mentioned SEO related points, also related to traffic and the cost of pay per click, is the bounce rate. A “bounce” happens every time somebody lands on one of your websites, but then doesn’t continue with another page of your site. To give an example, if you landed directly on this blog and later -or right NOW -decide to click on Machine Translation and learn about PangeaMT, our machine translation solution, this site’s bounce rate goes down rather than up, which shows the site generates interest.

  • Fabiano Cid from Ccaps and Diego Bartolomé from tauyou

As far as the sessions go, a special mention has to go to Fabiano Cid from Ccaps and Diego Bartolomé from tauyou who told the audience how they failed to win a 100,000,000 Word contract and possibly offered the perfect mix of entertainment and the kind of stimulating, thought-provoking discussion one was expecting from this conference. The retelling of their communication with a Latin American editorial company raised subjects as the difference between light post-editing…

1RgLKc

… and full post-editing:

GDfXEm

Those were just some of the points discussed with that client, till the day he disappeared and they never heard from him again:

HBPeo7

When Diego and Fabiano asked the audience members what they would have done, various industry professionals offered their interesting points of view.

JLS5p6

Many more things are worthwhile being mentioned about Gala 2014: The good food served upstairs (while downstairs some MT developers were proving an obsession with ill-conceived fast food metaphors in their misunderstanding of what User Empowerment in Machine Translation really is), an amazing dinner complete with traditional belly dancer and the thinnest waiters in whole Turkey (if you have been there, you know why), the challenges of speed networking, the amazing organization by Laura Brandon and her great team…and the friendly people from Turkey like the team of main sponsor Urban Translation Services.

This blog entry started with a football related notice and it will end with another one: Pangeanic was excited to realize that in many ways we and several other companies from Spain will have a “home field advantage” in 2015 as the place for the next GALA conference was announced to be the beautiful city of Seville. We are already looking forward to meet you there.

By A.Thömel

All About Partnership – The needs of Translation Companies at JABA-Translations Partner Summit 2014

As a leading translation company and developer of translation technologiesPangeanic was a sponsor in this gathering where heads of several translation companies, translation technology vendors, organizations such as ELIA and GALA and academia met to review the needs of the European translation industry.
In a number of surveys carried out amongst translation companies in the past years, CAT tool training has been identified as a major gap at academic institutions training future language industry professionals. A collaboration program has been set up to improve this under the aegis of ELIA, headed by Anu Carnegie-Brown from STP Nordic, and the plan is for LSPs and language technology providers to go to academia together to demonstrate how translation is done today. Françoise Bajon, president of ELIA, provided interesting figures from general polls in Europe showing the dramatically different perceptions from employers, students and universities as to how ready graduates are for the commercial world.  The view from employers and students was that 35% thought the trainers had done a good job, whereas the view from academia was that around 78% thought the students has been trained well enough. Clearly, the needs of many industries are not being met with what graduates know when they come out of university.

Different training perceptions universities students employers in Europe

CAT tools are one way to ease website localization, for example. However, few companies consider the effort required in internationalization, assuming this will only be atranslation of the original text. However, most websites use content management systems or publishing programs like WordPress, Drupal and TypePad. CAT tools have developed connectors but it is impossible to connect to all such systems. Some organizations are working towards a standard, but publishing from CMS systems is still programmed and assumed as a monolingual task. The adoption of the best technical processes often also relies on cooperation between the client and the solution provider. Hans Fenstermacher from GALA pointed out that the attitude of partnership in reaching the optimised end result is essential in this area that may be one of the thorniest and most difficult challenges for the translation industry. MemoQ’s presentation was particularly enlightening as they have worked on file compatibility issues (interoperability) and Plunet’s talk on project management was also very helpful.

Several companies presented their efforts and developments working with larger MLVs like Welocalize (Raymund Prins from Global textware). Stefan Gentz from Tracom and Jesper Sandberg from STP worked as joint masters of ceremonies throughout the event. Machine translation was also high on the agenda and how it can really be made to work at translation companies, with presentations from Harald Elsen, Delta International CITS, about customization for IBM Germany, from Lucy Software and Services into the Iberian languages and of course our Pangea Machine Translation system, dealing with how companies increasingly view MT as just another tool, and the argument moving quickly from adoption to either insourcing or outsourcing.

Joaquim Alves addressing the audience

Excellent organization skills and a special thanks to Joaquim Alves and his wife Manuela from JABA-Translations for all the hard work that comes with  an event like this, getting together ELIA and GALA, clients and partner companies in Europe

Tekom and EXPERT Hybrid MT Winter School

Pangeanic attended two major events during November, promoting its flexible machine translation technologies to translation experts/LSPs and corporate users. We also took part in the first training event of the EU’s Marie Curie EXPERT program which aims at training young researchers in hybrid machine translation technologies and link up experienced researchers with industry.

Traditional Japanese music Tekom

Traditional Japanese music at Tekom

TAUS' Razheb Choudhury - Manuel (Pangeanic) - Diego Bartolome (TAUYOU)

TAUS’ Razheb Choudhury – Manuel Herranz (Pangeanic) – Diego Bartolome (Tauyou)

Tekom concentrated pretty much on terminology management this year. With machine translation now being mainstream in most LSPs, or being adopted to some degree, we were happy to see that worth of a mouth as a result of practical use of PangeaMT by adopters is the best publicity for a technology developed free-of-chains. Despite typical noise by some developers, more and more LSPs are becoming increasingly interested in owning a technology which empowers them to grow and become technologically savvy, as well as enabling them to design better solutions to their clients, without being technological dependent.

Lucia Specia - Post-Editing

Lucia Specia – Post-Editing Presentation

Pangeanic’s technology has been made available as a powerful platform to language service providers and corporations for some years. This has led the company to become part of national research projects and currently a hosting organization where young and experienced researchers will learn and develop novel hybridation skills in the field. The first event within the EXPERT project was held in Birmingham and organized by Wolverhampton University.  The project brings academia and industry together, and aims at training the next generation of researchers at leading European institutions, research centres and technology companies. Apart from being an accolade to work already done, industrial partners such as Pangeanic will be able to expand machine translation capabilities, language combinations and new hybridation and combination techniques in several areas, making the best of computer-assisted translation and machine translation technologies (example-based, statistical and hybrid approaches) as well as including input from respected figures in post-editing research.

Pangeanic translation technology in the press

Pangeanic’s translation technology developments have often been the focus of international media and think tanks, like TAUS or Localization World, where we have showcased our technologies and Use Cases. Now, Pangeanic’s efficient translation workflows using Moses-based machine translation customized developments have also attracted local media attention.
The prestigious Spanish newspaper ElMundo.es printed a 3-page report with a full description of the company’s history, star use cases and applications, renown machine translation applications and developments.
pangeanic staff

Click here to obtain a free PDF of the article inno14oct.pdf.

Days ago, Valencia’s regional Finance Minister was interviewed by the online newspaper 20minutos. He quoted Pangeanic’s taking part in the Valencian Global program and machine translation technologies as key to create an “innovation ecosystem” that can create “highly qualified jobs” due to its significant technological component. Pangeanic expects to grow its services as a result of taking part in the program and expand its global sales network.

Pangeanic has also appeared in other digital media, such as notasdeprensa, again within the Valencian Global internationalization framework and entornointeligente, about its business development and coaching with leading entrepreneurship figures like MIT’s Bill Aulet.

Next time you think languages, think Pangeanic
Your Machine Translation Customization Solutions

         

Four Steps to Understanding DIY Machine Translation Customization

There has been some recent controversy in LinkedIn and blogs about claims to higher technical levels of engine customization, what is machine translation engine customization, DIY Machine Translation Customization, and how some people understand it.

PangeaMT specializes in custom-built systems which users (typically LSPs and translation language departments) can later re-train in two different ways

1. on-site if they have a full system installation (when data privacy is an issue)
2. using our own servers, in SaaS and via our API.

K. Vashee states that “The reality is that running an open source MT solution or using a “upload and pray” solution like that of many DIY MT vendors has become very easy.” This is a gross misunderstanding of what DIY MT is. DIY is about empowering the MT user to take control of the system or at least part of the process, rather than being a passive receiver of MT output that has to be quickly post-edited.

Building an MT engine has become pretty popular (that’s different from easy) and widespread in 2013. Systems are getting better as more and more data is available. Yet, data is not everything. One of our largest engines at PangeaMT holds more than 190 million words, and other engines contain five or six TMX files with over 300Mb of text data inside each. Some little engines with under 5M words perform very well for the documentation task they have been built (see our common presentation with Sybase at Localization World 2011 below).

I do not know any MT system builders who claim that using unclean data will not affect the output. Or that leave such freedom to untrained MT system users, without training. That is a key differentiator for PangeaMT: we train users so they can have an impact on how their MT will evolve and develop. Initial revision of (at least) part of the material or typical chunks of text within the domain is the first step to MT engine customization. I summarize some key steps for a good DIY SMT implementation, whether on-site or off-site (SaaS):

1. Gather relevant, in-domain material.
Your own material is key for the best engine performance. The material you have translated in the past is likely to be similar to the material you will translate in the future. Those expressions, terminology lists, translation memories, HTML files, parallel data, even monolingual texts, will form the basis of your customized engine.
However, there may be times when you cannot share all your data. This is the advantage of PangeaMT. Do not despair. Any general, related data will serve purpose for the engine set up. We will train you and show you potential pitfalls with training sets and cleaning.

2. Ask your vendor to analyze the data provided and run cleaning procedures. Your MT vendor should be transparent about “dirty data”, segments discarded and present an analysis of the troublesome segments or datasets which should not be used for machine learning. Dirty data does not mean “bad translation” but very often “noise” that has been introduced by the translation management tool itself, rendering a segment unusuable for machine learning. Explaining rather than translating, or offering bilingual versions will of course confuse learning patterns. So will adding – ” “, ; : profusely when they should not be there, or bad alignments. Source same as target

Data cleaning is a key step in the system. We recommend deleting segments rather than trying to “repair” them. Most of the time, it is not worth the time – unless your data is really dirty.

A lot of cleaning can be done prior to the material entering the system (see below).

Untranslated "to" would affect machine translation learningUntranslated “to” would affect machine translation learning

There are more complicated “cleaning” routines which fall outside the scope of this article and involve revising alignments in phrase tables. We will leave that for keen system users.

3. Perform initial tests (first engines) together with your vendor.
Your vendor may do this and just present your with the final “good” engine or with a variety of engines depending on your specialization.  A habitual training method is to separate 2,000 segments from the training material and then ask the engine to translate those segments, thus obtaining a BLEU score (i.e a measure of how good the system thinks it is). However, this is not the only way nor the most efficient and % BLEU scores cannot be compared across languages nor even within the same language for different domains. An engine providing a 55% BLEU is no good when asked to translate out-of-purpose material, whereas PangeaMT systems have been reported to provide productivity increases from 50% – 300% in German with small engines scoring 38% BLEU but built for very specific purposes like software documentation or automotive manuals.

Put the engine to test with previous translations you have not provided or similar material.

4. Learn about engine re-training and the impact of post-edited material.
How big is your engine? How many words does it contain? What is the BLEU score/Meteor, etc? How many words do I need to retrain my engine? Does my vendor ask for 5%, 10% of the engine size or does it promise on-the-fly re-training with jsut one sentence? Even though that sounds pretty good, a 20-word sentence will have little impact on any engine, particularly considering that the “small” MT engines may contain 5 million words.

We recommend a route whereby your post-edited material can enter the re-training cycle at any time, and a system where you are in control of both cleaning and re-training. PangeaMT offers both. You can upload new material any time after you complete a translation or finish a post-editing job. The latter is extremely good material and several papers point to benefits of post-edited material in MT engines. You can also schedule or set immediate re-training.

PangeaMT engine control panel

PangeaMT engine control panel

Those four steps are basic checkpoints you should bear in mind when moving your  organization towards higher automation and adopting MT. Above all, you should also consider the cost of “ownership” or “SaaS” according to your needs and how far deep you want to go in MT. Do you wish to position yourself as an authority with fully customized machine translation technology in your language pair / field? PangeaMT will help you. Or do you simply wish to save time and translate faster, without changing tools? Our TMX workflow will help you.
Many tools are fully compatible with PangeaMT, and our philosophy is to engage with tool and platform providers to offer open standards solutions, no tie-ins. Our SDL plug-in allows you to work with a well-known tool and, simultaneously, benefit from being the owner of your own engines and use the translation memory to build, customize and re-train the engine(s) for the next jobs.  With PangeaMT, you will get an instant suggestion from your engine and choose whatever is more relevant, the translation memory match or the suggestion translated by the engine. Post-editing takes a few seconds, whereas translating sentences from scratch can take almost a minute sometimes.
Because every engine is built with your own material, it is specific to you only and trained to perform and translate in the fields you specialise and nothing else. Following strict TMX cleaning procedures and engine training methods, customized engines become extremely useful translation tools that aid translators in their every day tasks. Your future post-edited material can retrain the engine very fast, improving accuracy more and more with every job.
Next time you think languages, think Pangeanic
Your Machine Translation Customization Solutions

         

PangeaMT Webinar on Translation Customization and DIY MT

by Manuel Herranz

Machine translation is a hot topic – and will be a hot topic for some years to come. But it is not only a hot topic with a lot of mystifying hype around it.

Elia Yuste, Andi Frank and (I hope) myself came a little bit closer to discerning doubts about what is a customized machine translation engine, what is MT DIY,

Our Gala Webinar aimed at clarifying some misconceptions about language companies building their own tools and applying them successfully in the language market.

The language industry is a very varied industry, with few technological players which dominate the landscape and a myriad of smaller tools which fit many purposes, large and small. PangeaMT was born as the technological division solving the needs of a translation company. PangeaMT now has a life of its own and it is a well-respected, mature technology.

The presentation is available already in slideshare.

Pangea Machine Translation became the first commercial application of Moses. Year after year, it has expanded on the core to add more functionalities, testing them at its translation department. It launched the now famous DIY MT package back in 2011, now part of many other platforms. PangeaMT now offers API, workflow and a TMX management system to clean and use material for machine translation learning and training.

The webinar continued to see how first-time customization and training data consultancy permeate each PangeaMT development. This is also applicable to data cleaning and reporting, which can later be automated after client-specific parameters and weighs are in place. Pangeanic’s team stressed time and time that the concept behind PangeaMT is independence. Translators feel empowered by having and managing their own engines and seeing that their post-editing material has an impact on engine behavior pretty soon.

This webinar pointed to machine translation applications and deployment scenarios beyond the usual requesting of machine-translated output in a limited fashion for pre-translation. With PangeaMT, users can create their own ecosystem.

Click here to learn about how to use your bilingual files, glossaries and TMX as assets to build MT engines. Be in control of your domain-specific engines …always!

Next time you think languages, think Pangeanic
Your Machine Translation Customization Solutions