PangeaMT with TDA data provides up to 50% more

Valencia, 1st October 2009.

PangeaMT is based on a Moses engine with an applied set of heuristics according to the language.

Data

Three domains were selected for the test in the English-Spanish language pair (no distinction as to Lat.Am/EU), with the following number of files:

- ECH (Electronics-Computer Hardware): 800  tmx

- MBE (Marketing-Business-Economics): 76 tmx

- SOF (Software): 80 tmx

Valencia, 27th October 2009.

Pangeanic conducted a series of tests with PangeaMT1 for specific language domains by combining its own statistical data with data obtained from TAUS's TDA during late September. The aim of the test was to prove that increased amounts of trustable, regular data from TDA would help Pangeanic's own technologies to improve output percentage quality, and to open up new domain developments. PangeaMT is a custom-built Moses-based engine. Initially developed for internal SMT use in aTMX workflow, Pangeanic is now offering SMT training services and on-demand translation services.

Data

Three domains were selected for the test in the English-Spanish language pair (no distinction as to Lat.Am/EU), with the following number of files: - ECH (Electronics-Computer Hardware): 800  tmx - MBE (Marketing-Business-Economics): 76 tmx - SOF (Software): 80 tmx


Electronics-Computer Hardware		English	Spanish
	Sentences (segments)	373803
Training	Different file pairs	373803
	Words	3934319	4457167
	Vocabulary	219789	234920
	Average sentence length	10,5	11,9
	Sentences (segments)	2000
Test	Different file pairs	2000
	Common pairs with training	18
	Words	20875	23564
	Perplexity (Trigrams)	100	77


Software		English	Spanish
	Sentences (segments)	273537
Training	Different file pairs	273537
	Words	3190340	3710593
	Vocabulary	117449	126331
	Average sentence length	11,7	13,6
	Sentences (segments)	2000
Test	Different file pairs	2000
	Common pairs with training	12
	Words	22593	26392
	Perplexity (Trigrams)	115	72


MBE		English	Spanish
	Sentences (segments)	71721
Training	Different file pairs	71721
	Words	873284	1006106
	Vocabulary	76394	82585
	Average sentence length	12,2	14
	Sentences (segments)	2000
Test	Different file pairs	2000
	Common pairs with training	2
	Words	23838	27544
	Perplexity (Trigrams)	243	154

Perplexity is a measure that gives us an idea of the complexity of the task and how similar the test is to the training. The higher the perplexity, the higher the difficulty.

Results

Model training + optimization: Moses+MERT

Language models: 5-grams

# TMX files for each category

ECH: 800

MEB: 76

SOF: 80

Translation results English->Spanish

BLEU: ECH: 49.98

MEB: 24.39

SOF: 47.78

Meteor 0.8.3

ECH: 0.4312

MEB: 0.2610

SOF: 0.4377

The best scoring domain is Electronics-Computer Hardware, with almost 50% scoring in BLEU and 43 in METEOR.

Results in Software are also very high (47,78% and 43,7% respectively).

This is a new domain for our development and we have used almost exclusively TDA data plus one of our client's.

Marketing-Business-Economics lags behind with around 25% in both. Specific, “imaginative” marketing TMs weigh a lot here, and there is less content from TDA. Marketing literature may be closer to human speech. The result also highlights the necessity to count on at least 2M for a customized development (client corpus was under 1M).

Nevertheless, the results surpass our expectations. A 50% BLEU scoring can translate in large increases in language production. Even the 25%, as an initial result for marketing leaves a lot of room for improvement once even more data is available.

PangeaMT with TDA data provides up to 50% more

Data

Results

Next time you think languages, think Pangeanic

Subscribe to our newsletter: