6 min read

06/07/2023

The Importance of Human Parallel Data and Translations in Training MT Systems

DATA EXPERT

It is a rare occurrence to find a spare 30 minutes in Manuel Herranz's busy schedule as Pangeanic's CEO. However, the topic of today's interview holds significant value for audiences who have been exploring Large Language Models, GenAI, and AI in general.

Manuel has been building language technology since 2009, making him an ideal speaker on the subject. One key aspect of the new IA revolution is data, and, particularly the importance of human parallel data and translations in the training of MT systems.

Today, Manuel sheds light on the significance of incorporating human expertise and parallel data in training translation systems. We delve into the topic to understand how human data contributes to the development of accurate and reliable translation systems.

One of the dangers of LLMs is their capacity to produce human-like data. Consequently, the significance and value of human data cannot be overstated when dealing with datasets. And the best way to obtain high-quality human data to train MT systems is... Translations.

Some translation companies/language service providers (LSPs) have established dedicated data departments, either as auxiliary or core business operations. Feeding those systems requires a lot of data – so much that even the major LSPs cannot provide it. This growing need led to the emergence of a new generation of data providers, some of which have secured substantial investments from venture capital firms in recent years.

+ Aurora:

Manuel, I am excited to have Pangeanic's CEO here with us today! First, let's discuss the importance of human data and translations in training translation systems.

This is my first question:

What is the fundamental role of human data in training machine translation systems?

- Manuel:

That’s a good question, and it's a great place to start because the quest for data has been a primary focus for Pangeanic since we first started developing Statistical Machine Translation systems back in 2008-2009. We believed that 5 to 8 million words were a fair amount to train a system. In those days, human-generated data derived from translations was the only source, and it was limited, mostly consisting of publicly available datasets.

We experimented a lot with open datasets, cleansed them like all new MT players at the time, and learned some essential lessons for the coming years: the importance of data augmentation – which in those days was done mostly manually!

Good quality human-produced parallel data plays a critical role in the training of machine translation systems. This was the case with statistical systems, and it is even more important with neural machine translation. Human-produced data brings an understanding of the cultural context that is essential for achieving precise and natural translations.

Indeed, there are instances where a translation requires more than a word-by-word or literal rendition. It often involves capturing the essence of a local expression, which holds immense value.

Systems must learn how expressions naturally occur in each language, encompassing not just a single example but a multitude of instances. This includes familiarizing themselves with set expressions, varied ways of conveying the same idea, domain-specific jargon, and other nuances specific to the language. By incorporating a diverse range of examples, MT systems can better grasp the intricacies and subtleties of language, resulting in more accurate and contextually appropriate translations.

Texto

Descripción generada automáticamente

This data is not easy to create. As I said, we’ve been working on building data sets for almost 15 years and these sets include translations performed by human experts which help the systems learn linguistic patterns and enhance the coherence and quality of machine translations. It is the best data to feed a system if we want our system to sound natural in the language we are translating to. The quality of training data plays a critical role in the accuracy and fluidity of translations generated by AI systems.

+ Aurora:

Why is it important to use human translations as references to improve automated translation systems? 

- Manuel:

"Human translations are an invaluable resource for enhancing automated translation systems."

Human translators possess the ability to grasp and adapt the meaning and contextual nuances of a text, considering cultural and linguistic subtleties. By feeding machine translation systems with this type of data (translations produced by real translators), automated translation systems can learn from the decisions made by human experts, thereby improving quality and accuracy.

But let’s not forget that data cleansing must always take place prior to feeding the system. Cleansing can be more or less strict, according to the developers’ needs. For example, we may decide whether to include tags or not, if sentences end (or not) in a full stop, or if certain “extensions” may (or not) be included, such as for example the translation of foreign names, foreign films, etc. 

The use of synthetic data and back translations has been quite popular to create basic or baseline systems, and although amounts of parallel data can grow exponentially, there is always a risk that machine-generated noise and bias will be introduced in the training dataset. For example, filtering the monolingual set will leave out certain sentences, but it is very difficult to identify sayings and expressions (they are endless!). They will enter the training set as the machine (not a human) has understood them, often quite literally. This is completely undesirable.

+ Aurora:  

What is Pangeanic's approach to collecting and utilizing human data to train its translation systems? 

- Manuel:

At Pangeanic, our mission is to combine Artificial Intelligence with human ingenuity to extract value from data in a scalable way. We firmly believe in the importance of blending Artificial Intelligence with human translation expertise.

Human-generated translations have limitations in terms of scalability. We spend the best part of the year producing parallel corpora to create stock. Sometimes, we use post-editing to scale according to clients’ needs and preferences. This methodology is known as "AI-assisted translation," where human translators collaborate with an automated translation system (their own, quite often!).

Basically, clients provide us with the text they require content for, and our teams translate, recreate, and occasionally generate several versions of the same sentence to have masculine/feminine alternatives, or to create bias-free material. 

We have a team that is constantly gathering material (mostly speech-based) and transcribing it to prepare candidates for human evaluation and approval. We curate data that is not easily crawlable or found in public repositories to guarantee freshness and to add new material that the algorithms have not seen before.

Another team works with clients on pure translation services as we have done since 2000 (some clients are not worried about the translation memory, and we come to agreements for its re-purpose for a discount). A different team works on post-editing utilizing the system's suggestions to expedite their work, while the system learns from the translators' decisions. Another team runs data augmentation. The whole company is geared towards parallel corpora acquisition, which enables us to continuously enhance our systems and deliver high-quality and consistent translations. 

+ Aurora:  

What are the benefits of having automated translation systems that rely on human data? 

- Manuel:

Automated translation systems based on human data have numerous benefits. Firstly, they provide more accurate and natural translations, capturing linguistic and cultural nuances more effectively. Imagine sayings in English such as “kick the bucket,” “pull out all the stops,” or “rock the boat.” They cannot translate literally into any language and you need an alternative idiomatic expression – or not - just the explanation of it, depending on the language. Only a trained linguist can work on such projects. You need data at scale. 

If I was to name 4 benefits of having purely (or mostly) human data in the training of machine translation systems, I would say:

1. Intention and Context

This is what I mentioned above: human translation data captures the intention of the original text as well as the cultural and linguistic context surrounding it.

By using human translation data, machine translation systems can learn to interpret the entire message, avoiding literal or inconsistent translations. The translation is a highly complex process that is not limited to mere word substitution! 

2. Higher consistency and fluidity through post-editing 

Secondly, let’s not forget that high-quality human data can also come from post-editing.

This is a process by which a human translator reviews and corrects translations generated by an automatic system, let’s call it AI human-in-the-loop. It is a hybrid approach that combines the best of both worlds:

- The fast processing and scalable capability from machine learning.

- The experience and judgment of human translators.

Post-editing data allows the system to learn from errors and improve the consistency and fluidity of translations, gradually refining the model as more corrections are made. 

3. Adaptation to specific domains and terminology 

Human or post-edited translation data can be collected and curated to suit specific domains and terminology. For example, in medical or legal translation, it is crucial to use precise and specialized terminology. Human translators are experts in these fields and can ensure that technical terms are translated correctly.

By training machine translation systems with human translation or post-edited data specific to these domains, enhanced accuracy is attained while circumventing errors that may arise from a broader approach.

4. Quality control and correction of systematic errors 

Without any doubt, the use of human translation or post-edited parallel corpora enables a more rigorous quality control in translations generated by machine translation systems.

Human translators can identify systematic errors or incorrect patterns in translations and correct them. Over time, this contributes to improving the consistency and accuracy of machine translation, although I would say that the most important point is that we are building a feedback cycle that allows simultaneous, continuous improvement of the system. 

 Recommended reading:

Human-in-the-loop (HITL); making the most of human and machine intelligence

+ Aurora:  

How do you envision the future of automated translation systems and the significance of human data in shaping their development? 

- Manuel:

"In the future, automated translation systems will continue to advance, but the integration of human data in their development will always remain crucial."

Indeed, new approaches like GenAI and LLMs offer opportunities to generate valuable data. However, I view these methods as alternative ways of utilizing machine translation with post-editing.

Without adequate quality control, human expertise throughout the process, and necessary checks for quality assurance and verification, we cannot consider machine output as a reliable input for another machine. Maintaining a robust cycle of human involvement, quality control, and verification is crucial to ensure the accuracy and reliability of machine-generated translations.

Human data provides the necessary guidance to ensure accurate, contextually appropriate, and culturally relevant translations.

True, there is a junction where the collaboration between Artificial Intelligence and human translators guarantees an optimal and satisfying translation experience for users. Furthermore, there is an increasing emphasis on developing systems that can swiftly adapt to specific scenarios, encompassing precise terminology and desired stylistic elements. This push for adaptability enables the delivery of tailored translations that align with the unique requirements of each context.