Languages that defy machine translation

There are over 7,000 languages in the world – some allow for easy machine translation, while others present a major challenge for these translation initiatives.

M achine translation enables a translation system to receive text in the source language so that it can generate output text in a target language. To do so, it applies logical and statistical rules (simple or complex). Its ultimate goal is to reach 100% parity with professional human translation.

Current techniques and the existence of abundant bilingual data make it possible to reach this goal in many cases. However, some minority languages still pose a challenge for today's translation technology. Read on to find out more about this issue and its possible solutions using Neural Machine Translation systems.

The main challenges of machine translation

Machine translation techniques have evolved to include several options.

A common technique is rule-based machine translation (RBMT): This uses established rules to convert a source text into a new target language. These rules are implemented by linguists and refer to semantic, syntactic and lexical aspects.

This technology's biggest challenge or limitation is that it requires several rule-based layers for it to function and these "rules" that teach the machine what to do must be elaborated by linguistic experts.

Therefore, challenges start to creep up when languages to be translated are in an alphabet other than the Latin alphabet, or have complex syntactic or verbal systems.

The second machine translation technique is statistical machine translation (SMT). This is an efficient method that is still very popular today.

SMT is based on large amounts of data, from which the system learns and generates translations – this requires specialized training.

Human linguists tend to supervise the machines' work, but limited access to quality data makes things a little complicated. While there are languages where translated material is abundant (English, Spanish, French, German...), the pool of available data is limited when it comes to minority languages.

This is either because there are not as many translations in the first place, or because the translations that do exist are not of high quality. This limitation in turn further increases the demand for these types of translations.

The most common language combinations are English and Spanish, Spanish and German, Spanish and French, and Italian and Spanish, to name a few. Deviating from these popular languages means more work is needed, not only in terms of data collection, but also in terms of the time required to perform these translations.

Related content: Languages that defy machine translation

Minority languages

For statistical machine translation to be applicable, a language must have sufficient available data to feed the algorithms. Languages such as English and Spanish don’t present a challenge in this sense, as language models from 50 million segments or more are available.

However, as minority languages (e.g. Burmese or Gujarati) typically have less bilingual data available, a machine translation engine will be limited to producing low-quality results.

How to train a machine to translate minority languages

When bilingual data (a basic component of any translation process) is scarce, special techniques must be applied. This is where neural machine translation (NMT) techniques come into play.

Neural Machine Translation

NMT uses neural networks trained through machine learning as a translation algorithm. By applying statistical techniques, a translation model is able to use millions of parameters that ultimately convert the original text into translated text.

This form of artificial intelligence mimics the way thinking works in the human brain. The aim is to make machines learn the meaning of words, beyond memorizing words or phrases. This type of automated translation opens the door to handling more complex data and language models.

Today, such systems are trained with millions of pages of text. The future goal will be to reduce the amount of data needed for this training.

At present, for minority languages, or rather, when scarce resources are available, neural translation works in the same way as with other languages, although the model used must be trained (created) with special techniques.

These techniques include:

Synthetic bilingual data generation, i.e. bilingual data specifically created to improve the machine translation process. This approach has proven effective in machine translations from Korean to English, according to a study by Guanghao Xu, Youngjoong Ko and Jungyun Seo of Seoul University.
Increasing the amount of data provided to the machine translation engine by generating data through native linguists for each language.
Use of monolingual data

Related content: Languages that defy machine translation

Despite not having large amounts of translated texts – what’s known as parallel data – machine translation engines can learn the relationships between languages and generate quality translations.

However, neural machine translation systems also face a number of challenges in the coming years, including achieving greater accuracy and faster learning.

Therefore, although NMT systems are indispensable in the automated translation industry today, they still require human intervention, which in many cases is critical.

How Pangeanic's ECO platform works

ECO is Pangeanic’s language services platform that provides a machine or hybrid translation service.

In addition to accurate software and the latest available technologies, Pangeanic has a team of professional native linguists in charge of both training the machines and reviewing the automated results before they reach a client.

Our team holds expert knowledge in artificial intelligence technology, allowing us to adapt to our clients' individual requests, regardless of the type of language or translation difficulty.

ECO is cloud-based, meaning it’s accessible to any given user with a browser and Internet access. This intuitive technology allows our users to process texts directly or use formatted files.

Our extensive resources allow us to automatically translate hundreds of millions of words in record time (thousands of pages per hour), anonymize content, summarize, extract knowledge and key data, and convert unstructured data into structured content.

More: Pangeanic's ECO: a mission to translate and anonymize all the text in the world

Additionally, this service is suitable for a range of industries including e-commerce, international legal communications and other specific machine translation needs.