Try our custom LLM Masker

2 min read

18/11/2011

The science of machine translation

One very important aspect that is often overlooked in the machine translation field and discussions is that machine translation is one variant of a more general science called pattern recognition and machine learning. MT marketing staff often overlook the rationale behind the maths they do not understand (I don't claim to understand all the math myself!!). On the other hand, linguists tend to concentrate on solving the impossible by overrating the importance of rules and linguistic data within MT systems.

Therefore, before offering some information about venues Pangeanic has been involved in and where its DIY SMT (or Machine Translation for the Masses as it has been called) has been present in one way or another, it would be useful to read an interview to Enrique Vidal from Valencia's Polytechnic - one of the major figures in pattern detection and machine learning in the world. Sr Vidal recently received the National Prize for Computing 2011.

The interview is in Spanish, but well worth reading if you want to put in context machine translation within the large science domains to which it belongs.  MT is not just about selling output and not just about computing, but finding the (in)correct patterns (and then adding some specific features). Coincidentally, news about code-cracking helping to decipher an untranslatable text (the Copiale Cipher) appeared days later in The New York Times. There is no room to summarize our attendance at TAUS Silicon Valley and Localization World here :)

I will leave that for our next post and the upcoming exhibition at the Japan Translation Federation (JTF) Festival. However, I would like to point that some of the most valuable input I was able to gather about real advances on MT came from the academic gathering on International Workshop on Using Linguistic Information for Hybrid Machine Translation (LIHMT-2011)  and the practical Saturday session ML4HMT (META NET WP2) in conjunction with DKFI. These sessions were not for the faint-hearted or commercially driven minds. They were really addressing those with a personal interest in making use of the best of research on MT to apply it to development. It was technical, and also trend-setting. As different research teams are facing the same problems worldwide, some similar, some new, and some quite imaginative approaches are emerging all over the world, from Hong Kong to Spain, from the US to Norway. The advances published at the International Workshop on Using Linguistic Information for Hybrid MT will mark the agenda for features that, sooner or later, will be integrated into future MT offerings. For example:

  • Lemmatisation, annotation for morphologically-rich languages, for example Czech and Basque and even lesser resources in the case of the 2nd one.
  • Syntax-based approaches and word re-ordering for very unrelated languages (such as Asian or Semitic languages into and out of European languages)
  • Web-based annotation tools
  • Hybridisation of techniques, starting from analysis at a morphological layer, then analytical layers, tectogrammatical layers, and then transfer, and on to synthesis to t-layers, a-layer and m-layer.
  • Word disambiguation
  • Mixture of rule-based and statistical approaches to improve predictability.
  • Post-editing effort estimation for MT systems and systems including no linguistic features or having some. Linguistic features are relevant for direct useful error detections and automatic post-editing. But for sentence-level CE, there are issues with sparsity and representation (length bias).
  • New metrics like VERTa, using linguistic knowledge organized in different levels (lexical, morphological, syntactic information, and sentence semantics)
A very intensive 4 weeks which included TAUS Santa Clara and Localization World Silicon Valley , and these science-driven MT workshops in Barcelona: three different venues to put to the market the best of research, to learn and to develop.