Statistical machine translation, known as SMT or StatMT, is an approach to machine translation that yields the most probable output (translation) of each element that makes up a sentence. It is based on the use of statistical models that analyze and search for relationships between two texts with the same content: one in the source language and the other in the target language.
It is a type of MT model that has its advantages, but also certain challenges and downfalls that one should be aware of. It is also important to understand the differences between statistical machine translation and Neural Machine Translation (NMT).
More information:
The Origins of statistical machine translation
It was back in 1949 when Warren Weaver introduced the first notions of SMT. However, statistical machine translation originated in 1992 when researchers from the Thomas J. Watson Research Center reintroduced the approach and, after the use of stochastic techniques in the development of a speech recognition system, decided to experiment in the field of translation.
The research was carried out using existing human translations (bilingual corpora), specifically with the Acts of Parliament of Canada in English and French.
It was a successful experiment that consisted of aligning sentences, sets of words, and single words to perform the probabilistic calculation of correspondence between the words in the source language and those in the target language. It was the most studied machine translation system, prior to the introduction of NMT.
You may be interested in:
Foundation and rules
SMT uses information theory as its main basis; a study of the storage, processing, extraction, and use of information using statistics, computer science, information engineering, electrical engineering, and statistical mechanics.
A text is translated based on the probability that a string of words in the target language is the translation of the string of words in the source language. That is, based on the probability p(e|f), where:
-
f is the source language string.
-
e is the target language string.
This probabilistic distribution model has been approached from different perspectives. The most widely implemented is Bayes' Theorem:
- p(e|f) ∝ p(f/e)p(e)
This theorem splits the model into two subproblems. The best translation is obtained by choosing the outcome with the highest probability.
Types of statistical machine translation
The different kinds of SMT are as follows:
Word-based translation
In this case, the basic translation unit is a word in the source language. In other words, it is a model that translates word for word. However, due to idioms, morphology, and compound words, the number of words in the target sentence is usually different from the source.
Fertility is the ratio of target-language words a source-language word can give rise to. For example, the Spanish word "clavo" can mean both "clove" and "nail" in English. A source word can be assigned to several target words, but you cannot group together two source words that have only one equivalent target word.
Phrase-based translation
This type of machine translation technology translates complete word sequences and seeks to decrease the restrictions of word-based SMT. These sequences are called phrases or blocks. However, these phrases are not based on linguistic structures but on statistical methods, so as not to reduce the quality of the translation.
Syntax-based translation
In this type of machine translation technology, the SMT model doesn't use individual words or phrases but rather translates syntactic units. This means that it translates by analyzing sentences or expressions.
Related:
Human-in-the-loop (HITL); making the most of human and machine intelligence
Translation based on language models
Language models help make translation flow better and sound more natural. This is a function that, based on a translated sentence, selects the result that has the highest probability of being used by a native speaker. It also makes it easier to choose the most appropriate word, given the possibility of multiple translations.
Operational phases of statistical machine translation
This machine translation works in three main phases:
Elaborating the parallel text
The creation of parallel text follows these steps:
-
Choice: two texts or documents with the same content are chosen, one in the source language and the other in the target language. The larger the volume of text, the higher the quality of the final translation.
-
Extraction: sections of the content are extracted from the source language text and the corresponding section in the target language.
-
Separation: each section is broken down into sentences.
-
Preparation: entries are prepared for the system.
-
Alignment: each sentence in one language is mapped to the corresponding sentence in the other language.
Modeling
This phase includes:
-
Translation modeling: determines the set of possible translations for each of the sentences.
-
Language modeling: determines the flow of each sentence. This model is the one that assigns the highest probability to the sentence that uses the most natural language.
-
Searching: this is the process in which the system navigates through all the aligned sentences in order to find the most likely translation for a given sentence.
Estimating and Refining
The estimation and refining phase minimizes any possible errors for a higher-quality result. Grammatical connectors and heuristic algorithms are used for this purpose.
Differences between SMT and NMT
There are several fundamental differences between statistical and neural machine translation:
-
Neural machine translation requires more training and a larger corpus than statistical machine translation.
-
NMT is better than SMT at handling morphology, syntax, word order, and concordance.
-
SMT is a model that generates the translation based on dividing sentences into phrases and words, while NMT uses complete sentences.
Basically, SMT works by collecting statistics, that is to say, it bases its method on counting repetitions of phrases and words.
NMT works using machine translation technology which aggregates occurrences of events, but also uses parameters with real numbers and updates them when observing something new, including complete sentences.
Find out what PangeaMT can do:
Advantages of statistical machine translation compared to other methods
In comparison to traditional translation and in certain contexts, SMT presents the following advantages:
-
Although it is partial and may contain errors, SMT translates the text quickly, which allows:
-
Urgent access to data.
-
Easier work for human translators, since they only have to make corrections.
In addition, SMT offers more cost-effective translations and better use of resources, although its quality is not at the level of professional translators.
The challenges of SMT today
Statistical machine translation faces two main challenges: word order in different languages and unknown words.
Word order within a sentence is not the same in different languages. For example, the typical word order in English is "subject, verb, and object," but this may be different in other languages. In addition, there are other order modifiers, such as nouns.
As SMT must take into account the word order, the following reordering models are used to provide better alignment between the two texts.
Additionally, SMT performs a separate storage of words, without establishing any relationship. So unknown phrases or words (outside of vocabulary) that were not in the training resources, cannot be translated.
To address this second problem, word inlays, and semantic lexical resources are used, among other methods.
SMT was a dominant technique until several years ago. The field of machine translation has made a quantum leap towards neural models based on artificial intelligence (AI) that allow reliable translations and facilitate global communication.
At Pangeanic, we combine the expert knowledge of our professional translators with the best of AI to offer near-human quality neural machine translation.
Contact us! We will devise and deliver the solution you need.