What are back translation and synthetic data? Back translation and synthetic data are two popular and common techniques used to augment training...
A new guide for anyone interested in working with MT and using his/her own data to create machine translation engines has been published by TAUS in its website. The technical guide to SMT Training Data is intended for users and any organization keen to train engines with its own data. It deals with the preparation of translation training data for statistical machine translation. It examines the processes for data preparation (typically bilingual TMX) which are the catalysts to enable both data and algorithms to work together. TAUS' report by Tom Hoar also explores how to define an organization's training data strategy to match overall system design, identifying potential data sources for bilingual, well-aligned TMX. It also talks about the challenges faced when merging corpora from multiple sources to create large but
stable data sets, exploring several methods to prepare translation memories from several sources into Statistical Machine Translation training data. Finally, it looks into the speech roots of SMT and introduces the concept of
exception management as a context for preparing Statistical Machine Translation (SMT) training data. Pangeanic has made use of many bilingual data sets from several organizations, including the EU and UN in order to mix data and customize machine translation engines for some of its clients.
Complete news:
https://www.taus.net/think-tank/reports/translate-reports/technical-guide-to-smt-training-data
Further reading:
- Knowledge Center: What is translation memory?
- Translation Technology: Translation Memory
Next time you think languages, think Pangeanic Your Machine Translation Customization Solutions
Related Posts
It is a rare occurrence to find a spare 30 minutes in Manuel Herranz's busy schedule as Pangeanic's CEO. However, the topic of today's interview...
Developing, validating, and training a network from scratch can be an enormous task, in addition to requiring large data sets. This is why ...