What are back translation and synthetic data?
Back translation involves translating a monolingual text from one language to another and then translating the translated text back to the original language. Synthetic data for Machine Translation systems is created by automatically generating new texts from existing monolingual data. Back translation is a method for generating synthetic parallel data that offers a solution to the challenge of creating bilingual corpora for low-resource languages or for domains with scarce resources.
While both methods can be effective for improving Machine Translation performance, they also have some known disadvantages.
Machine Translation has revolutionized the way we communicate across languages and has created better business opportunities for small, medium, and large enterprises because of its efficiency and accessibility. The caliber of the training data is crucial to the efficiency of any Machine Learning (ML) system.
Both systems have been researched and extensively used in academic circles, and there are many publications on the subject. While they have their benefits, they also come with several disadvantages.
In this article, we are going to delve into the drawbacks of using back translation and synthetic data in the training of Machine Translation systems.
Main disadvantages of back translation
One of the most widely known drawbacks of back translation is the fact it can introduce errors from the original monolingual data. This means that when a monolingual text is translated to another language and then back to the original language, the errors in the original text can be propagated to the back-translated text. Obviously, this leads to the algorithms of the MT system learning the errors and propagating the same errors in its translations.
These errors are known as "artifacts" and can include grammatical errors, unnatural word choices, problems with fluidity, and other problems, such as complete nonsense.
Main disadvantages of synthetic data
From already existing monolingual data, additional text is automatically generated to produce synthetic data. While some synthetic data may be of high quality and in-domain, the quality of this generated text can vary. Other synthetic data may be of low quality.
Synthetic data can be repetitive, which would make the MT system training less effective. Additionally, synthetic data can introduce biases into the MT system.
Synthetic data is often created from just one specific domain or genre of text. This can lead to the MT system learning biases or jargon and expressions that are specific to that domain or genre. These biases or specific jargon can then be propagated to the MT system’s translations of other domains or genres (think of the word “bay” meaning a part of the coast or a loading area, or even a specific area in a supermarket where food and items are stocked).
Consequences of error propagation from back translation and synthetic data
The main consequences of error propagation are:
1. Loss of contextual accuracy
Back translation involves translating a target sentence back into the source language using an existing translation model. This technique aims to generate additional parallel training data. However, during this process, the translated sentences may lose their original contextual accuracy, resulting in distorted or unnatural translations. As mentioned above, the training model may learn incorrect or misleading patterns, leading to inaccurate translations during inference.
2. Increased noise and error propagation
Synthetic data is a powerful technique involving the creation of artificial training examples through rule-based methods or pre-trained language models, but it is important to be aware that it can inadvertently introduce noise and errors into the training dataset. These inaccuracies have the potential to cascade throughout the model, hampering its capacity to produce accurate translations. Because the synthetic data may contain incorrect word choices, grammatical errors, or unnatural sentence structures, suboptimal translation outputs will be generated.
3. Limited coverage of real-world language variations
Back translation and synthetic data often fail to capture the diverse linguistic variations found in real-world translations. When Machine Translation systems are trained mostly or solely on synthetic data, they may struggle when faced with complex structures, long sentences, idiomatic expressions, or regional language variations. Without exposure to the rich nuances of authentic human-quality translations, these systems may produce inaccurate or nonsensical output when confronted with such linguistic complexities.
4. Ethical concerns and bias amplification
Just like in any Machine Learning project, MT systems are not immune to biases present in their training data. The use of back translation and synthetic data can inadvertently amplify existing biases or introduce new ones. If the original training data contains biased or discriminatory language, these biases can be magnified when generating synthetic data or through the iterative process of back translation. Consequently, the resulting Machine Translation system may propagate biased or offensive translations, perpetuating discrimination, and inequality.
5. Increased computational costs
Of course! To obtain back translation and synthetic data, substantial computational resources and time-consuming processes are required.
Generating high-quality synthetic data often involves either complex rule-based methods or fine-tuning pre-trained language models, which is computationally expensive because of the sheer number of servers and computers needed.
Back translation requires additional training iterations. This increases the overall training time and resource requirements, making it a costly technique for organizations and researchers with limited computational capabilities.
6. Overfitting and generalization issues
Back translation and synthetic data can also lead to overfitting. Overfitting occurs when the model learns the patterns in the training data too well and is unable to generalize to new data. This can happen because synthetic data is not a perfect representation of real-world translations. The model may learn to rely on the patterns in the synthetic data and ignore other important features of real-world translations. As a result, the model may struggle to accurately translate sentences that deviate from the patterns seen in the training data.
Good-quality parallel corpora and extensive cleansing are essential to the building of a good Machine Translation system.
As we have seen, the use of back translation and synthetic data has several disadvantages but, if properly managed by a team of experienced Machine Learning engineers, it can be effective in improving the performance of Machine Translation systems, particularly those that are built for specific purposes or domains. Therefore, it is important to be aware of the potential issues associated with these techniques and to take steps to mitigate any of the 6 consequences listed.
Some ML teams have chosen to train on a combination of monolingual data and carefully selected back-translated data. With proper linguistic guidance, quality assurance, and review, this can help to reduce the impact of errors from the original monolingual data.
Lastly, MT systems can be evaluated on a variety of datasets, including synthetic data to identify and address any biases that may be introduced.
Of course, our wholehearted recommendation is to use properly curated, translated, or post-edited data!
Only by amassing large amounts of human-quality data will a system reflect the nuances of each language and offer not literal or machine-like translations but true expressions from one language to another that convey ideas, not just words.
Relying on either method can lead to the loss of contextual accuracy, increased noise, and error propagation, limited coverage of language variations, ethical concerns, computational costs, and overfitting and generalization issues. These are serious challenges for developers that are planning to develop or update solid Machine Translation systems based on reliability and performance metrics.
By relying on human-produced translations and parallel corpora stock from Pangeanic’s translation services, developers can address these challenges and constantly continue to improve and update Machine Translation systems with current and up-to-date expressions, ensuring their systems deliver accurate and culturally sensitive translations.