Pangeanic has been officially invited as a guest speaker in TAUS meeting in Portland , Oregon in October 2009. “We are thrilled and honored to be...
Valencia, 1st October 2009.
Pangeanic conducted a series of tests with PangeaMT1 for specific language domains by combining its own statistical data with data obtained from TAUS's TDA during late September. The aim of the test was to prove that increased amounts of trustable, regular data from TDA would help Pangeanic's own technologies to improve output percentage quality, and to open up new domain developments.
PangeaMT is based on a Moses engine with an applied set of heuristics according to the language.
Data
Three domains were selected for the test in the English-Spanish language pair (no distinction as to Lat.Am/EU), with the following number of files:
- ECH (Electronics-Computer Hardware): 800
tmx
- MBE (Marketing-Business-Economics): 76 tmx
- SOF (Software): 80 tmx
Valencia, 27th October 2009.
Data
Three domains were selected for the test in the English-Spanish language pair (no distinction as to Lat.Am/EU), with the following number of files: - ECH (Electronics-Computer Hardware): 800 tmx - MBE (Marketing-Business-Economics): 76 tmx - SOF (Software): 80 tmx| Electronics-Computer Hardware | English | Spanish | ||
| Sentences (segments) | 373803 | |||
| Training | Different file pairs | 373803 | ||
| Words | 3934319 | 4457167 | ||
| Vocabulary | 219789 | 234920 | ||
| Average sentence length | 10,5 | 11,9 | ||
| Sentences (segments) | 2000 | |||
| Test | Different file pairs | 2000 | ||
| Common pairs with training | 18 | |||
| Words | 20875 | 23564 | ||
| Perplexity (Trigrams) | 100 | 77 | ||
| Software | English | Spanish | ||
| Sentences (segments) | 273537 | |||
| Training | Different file pairs | 273537 | ||
| Words | 3190340 | 3710593 | ||
| Vocabulary | 117449 | 126331 | ||
| Average sentence length | 11,7 | 13,6 | ||
| Sentences (segments) | 2000 | |||
| Test | Different file pairs | 2000 | ||
| Common pairs with training | 12 | |||
| Words | 22593 | 26392 | ||
| Perplexity (Trigrams) | 115 | 72 | ||
| MBE | English | Spanish | ||
| Sentences (segments) | 71721 | |||
| Training | Different file pairs | 71721 | ||
| Words | 873284 | 1006106 | ||
| Vocabulary | 76394 | 82585 | ||
| Average sentence length | 12,2 | 14 | ||
| Sentences (segments) | 2000 | |||
| Test | Different file pairs | 2000 | ||
| Common pairs with training | 2 | |||
| Words | 23838 | 27544 | ||
| Perplexity (Trigrams) | 243 | 154 | ||
Results
Model training + optimization: Moses+MERT
Language models: 5-grams
# TMX files for each category
ECH: 800
MEB: 76
SOF: 80
Translation results English->Spanish
BLEU: ECH: 49.98
MEB: 24.39
SOF: 47.78
Meteor 0.8.3
ECH: 0.4312
MEB: 0.2610
SOF: 0.4377
The best scoring domain is Electronics-Computer Hardware, with almost 50% scoring in BLEU and 43 in METEOR.
Results in Software are also very high (47,78% and 43,7% respectively).
This is a new domain for our development and we have used almost exclusively TDA data plus one of our client's.
Marketing-Business-Economics lags behind with around 25% in both. Specific, “imaginative” marketing TMs weigh a lot here, and there is less content from TDA. Marketing literature may be closer to human speech. The result also highlights the necessity to count on at least 2M for a customized development (client corpus was under 1M).
Nevertheless, the results surpass our expectations. A 50% BLEU scoring can translate in large increases in language production. Even the 25%, as an initial result for marketing leaves a lot of room for improvement once even more data is available.
Next time you think languages, think Pangeanic
Related Articles
Pangeanic will sponsor the Website Globalization Conference that will take place in Barcelona in September 2007 where Sony Europe, Real Madrid,...
Pangeanic has become a sponsor of the Website Globalization conference that will take place in Barcelona in September 2007. The globalization of...

