How to train your machine translation engine

A machine translation engine offers many advantages, with reduced translation times and minimized use of human resources being the main benefits.

The market value of translation engines is estimated to grow at an annual rate of 7.1%, from US$153.8 million in 2020 to US$230.67 million in 2026, according to Mordor Intelligence.

As machine learning and deep learning technologies get smarter, the results produced by machine translation engines become ever more accurate. These technologies prove there is a strong case for adequate training of machine translation engines if successful translations are to be achieved.

Here are some key guidelines for training translation engines, and achieving quality translation results.

What is a machine translation engine?

A machine translation engine is software capable of translating texts from a source language to a target language.

Applying artificial intelligence to these technologies has boosted their accuracy. Today, they are capable of analyzing huge amounts of data and transforming them into information in order to produce accurate translations, including at the semantic and speaker intent level.

AI-enabled machine translation engines use data to identify correlations and structure, obtaining information from huge amounts of data to help them solve problems that would require thousands or millions of hours for a human to process.

The capabilities of a machine translation engine are multiplied by the addition of technologies such as machine learning and deep learning. Through these techniques, translation engines are able to apply machine learning, continuously improving the results they provide. But enhancing translation quality is reliant on good training.

How to train your machine translation engine

Optimum machine translation results start with adequate translation technology solutions. Machine learning and deep learning capabilities must be developed by a team of competent human professionals who are tasked with routinely overseeing it.

The goal of the training outcome will be for the engine to provide the most accurate translations possible, and to be able to adapt its output text to user preferences (including specific terminology, tone, and stylistic preferences, for example).

The training of a machine translation engine can be summarized in these four steps:

1. Incorporation of the base data

The basic and combustible ingredient for training an engine consists of introducing data in the form of sentence examples translated from the source language into the target language.

At this point, it’s vital the data fed into the AI system is of high quality and there is a market of data available for training out there for this purpose.

Open-source software such as Pangeanic's ECO, together with NLP (Natural Language Processing) experts, have allowed organizations to create their own artificial intelligence and machine translation processes.

Using data beyond text is also a possibility for training, but image and video data must be labeled correctly when incorporated into the training process. It is crucial a compatible annotation and data segmentation process is created.

Voice data is another type of data that can be used when training a machine translation engine. This is a specific process, as automatic speech recognition systems require large amounts of high-quality audio data recorded in numerous contexts and environments. Pangeanic's machine translation technology has the necessary resources to provide customized audio data sets that match specific requirements such as age, accent, language, speaker profile, subject, including background noise.

2. Data cleaning and normalization

After raw data collection, cleaning of dirty data and data normalization should be carried out. This process includes, for example, always using the correct quotation marks for both languages. From this point on, the translation engine can be fed the appropriate data.

ECO cleans data automatically when sending files to be trained, and only requires that the data be in the standard XML-based translation format called TMX (Translation Memory Exchange), a translation memory. You can guarantee compatibility and easy integration with machine translation platforms like ECO by using the TMX standard.

3. Possibility of sentiment analysis

Increasingly advanced technologies are enabling translation engines to analyze the sentiment of texts, i.e. to understand and take into account the true meaning of a text or the speaker's intention when translating. For this purpose, machine learning and NLP are combined. Translation tools can now evaluate the tone of messages and consider their genuine intentions.

When analyzing documents and texts (taken, for example, from social networks) to determine the sentiment or opinions of users, these texts are classified, (positive, negative or neutral) and labeled to improve the quality of translation results.

4. Maintenance

Basic training can last days, and measures like 'stop criterion' allow for the engine to automatically check when it has ceased learning anything new. This allows for the training to stop, to not waste any time. In addition, in the case of specializing models for a specific domain, training will be performed with the available data, and depending on how much the model is to be specialized, more aggressive or more conservative training will be applied.

Beyond the initial training, achieving the best results requires a continuous training process.

Platforms such as ECO, in its new version 2, have the advantage of allowing users to train the engine in a private, simple and intuitive way, continuously improving results.

May be of interest: NLP Techniques: The Most Powerful Natural Language Processing Methods

Tips for improving the quality of your machine translation

1. Amount of data

It’s advisable to work with large amounts of data in order to guarantee translation quality. This is, in part, one of the challenges for the translation of ‘minority languages. Pangeanic offers large amounts of scalable data thanks to its huge repository of 10 billion aligned data, which allows for scalable training of machine translation engines, ensuring higher translation quality. We also customize our services for each data set used to train the AI of each client’s machine translation engine.

2. Data quality

Quantity is not everything. Successful training of translation technology requires data of the highest possible quality and for it to be in the desired domain, i.e using the correct terminology.

That’s why at Pangeanic, we provide clean parallel segments from our large database to provide our on-demand translation services. In addition, all translated data undergoes strict quality controls and checks to ensure it is clean and valid for the correct training of machine translation engines.

3. The importance of the team

Pangeanic's expert team provides advice adjusted to each client’s needs. To achieve this, we combine our team of data science experts, linguists, developers and human resources to obtain quality data which can be managed successfully.

With over 20 years’ experience in the language service industry, and as NLP developers since 2009, our clients trust us to evaluate each project with precision and care. Our professional linguists manage data collection by following a specific workflow tailored to each client’s needs. All Pangeanic data is scalable, accurate and human-generated – a key feature for any successful machine/deep learning project, as human-generated data carries less ‘noise’ compared to web translation alignment (scraping) or crowdsourcing.

As developers of machine translation systems, we understand the adverse effects poor quality data can have on algorithms. We have full confidence in our data and our extensive experience in translation quality control services. Want to learn more about the right machine translation engine for your business? Get in touch to discuss how our ECO system can best fit your needs.