Asian Parallel Corpora: The Availability Challenge and the Imperative of Human Curation

Written by Manuel Herranz | 07/21/23

The proliferation of machine translation systems has surged in the global digital landscape, powered primarily by advancements in Natural Language Processing (NLP) and Artificial Intelligence (AI). While these systems exhibit remarkable abilities in various linguistic contexts, they often struggle with Asian languages—particularly Traditional Chinese, Chinese from Hong Kong, Chinese from Taiwan, Thai, and Vietnamese.

At the core of these struggles lies the issue of acquiring and effectively utilizing parallel corpus for these languages.

"A parallel corpus is a collection of texts in two languages that correspond with each other, acting as the foundation for training machine translation models."

 

Content

 

Asian context

Especially in the field of Chinese translation services, there are remarkable differences between the different Chinese dialects spoken in different regions. Traditional Chinese, Chinese from Hong Kong, and Chinese from Taiwan can differ significantly from each other, not only in terms of spelling but also in terms of grammar, vocabulary, and even culture. Recognizing and understanding these differences is crucial for the development of effective translation systems. 

In addition, Thai and Vietnamese are less common languages for which less data is available. This makes it more difficult to access high-quality parallel corpora, which represents an additional challenge. Human curators can play a crucial role here by carefully reviewing, cleaning, and enriching the existing data to improve the quality and relevance of the set. 

Overall, it can be said that the path to effective machine translation systems for Asian languages is paved with numerous challenges. However, through human curation and careful data maintenance, the quality and accuracy of the translations can be improved and thus better, more efficient, and more inclusive communication can be promoted. 

Equally important is the mastering of these languages’ writing systems. Traditional Chinese, Chinese from Hong Kong, and Chinese from Taiwan are languages with a complex writing system that can make the correct assignment of translations considered more difficult. Thai and Vietnamese also have specific characteristics that pose challenges for machine translation systems. For example, Thai has no spaces between the words, which makes segmentation a significant challenge. 

 

Recommended reading: 

What is the difference between Traditional and Simplified Chinese.

 

It is therefore crucial that we rely on human curators when creating parallel corpora for these languages. Through their knowledge of the target language and its nuances, they can ensure that the generated translations are correct in terms of both linguistic accuracy and cultural context. It is not only about providing a technically accurate translation but also about taking into account the cultural differences that play an equally important role in communication. 

In view of the fact that machine translation systems are increasingly being used in commercial and institutional contexts, the quality of translation can have a significant impact. An inaccurate or culturally inappropriate translation can lead to misunderstandings and undermine trust in such systems. 

The creation of high-quality parallel corpora for Asian languages is therefore an urgent task that requires a mixture of technical expertise, linguistic knowledge, and cultural understanding.

Human curation can make a significant difference here. With their help, a careful selection and processing of the data used for the training of the machine translation systems can be ensured. They can also help bridge the gap between the technical aspects of translation and the linguistic and cultural nuances encoded in the data. 

Despite the challenges, however, there are also positive developments. In recent years, there has been considerable progress in the creation and improvement of machine translation systems for Asian languages. For example, special translation services for various Chinese dialects have been developed and improved. 

In addition, there have also been efforts to develop technologies that are specifically tailored to the peculiarities of languages such as Thai and Vietnamese. These developments show that despite the challenges, considerable progress is possible. 

Challenges in sourcing Asian parallel corpora

1. Linguistic diversity:

The most immediate challenge is the linguistic diversity within the regions themselves. For instance, in Chinese translation services, there is a significant difference between Traditional Chinese, Chinese from Hong Kong, and Chinese from Taiwan. These variants have differences in vocabulary, syntax, and even semantics, all of which must be taken into account when creating and using parallel corpora.  

2. Limited availability:

Asian languages, particularly Thai and Vietnamese, lack substantial, freely available parallel corpora, unlike languages such as English, French, or Spanish. This scarcity is a significant impediment for machine translation services striving to offer comprehensive coverage of languages. 

3. Context sensitivity:

Many Asian languages are highly context-sensitive, where a word's meaning can significantly change depending on the context in which it is used. This complexity adds another layer of difficulty in compiling and utilizing parallel corpora for these languages. 

 

 Suggested: 

Top Machine Translation Software Trends for 2023

 

The essential role of human curation

Given the complexities outlined above, human curation becomes an indispensable part of building optimal machine translation systems for these Asian languages. 

1. Quality control:

Human reviewers can ensure the quality of the parallel corpora, checking for accurate alignment and contextually appropriate translations. This process is crucial for maintaining the integrity of the data that the machine translation systems learn from. 

2. Handling linguistic nuances:

The intricacies of language, especially with context-sensitive languages like Thai and Vietnamese, often require human understanding to handle effectively. Humans can discern subtle changes in meaning and tone that current AI systems might miss. 

3. Cultural appropriateness:

A crucial aspect of translation that often gets overlooked is cultural appropriateness. Translations should not just be linguistically accurate, they should also be culturally sensitive and appropriate. Human curators, with their understanding of cultural nuances, play a pivotal role in ensuring this aspect. 

4. Data augmentation:

Human curators can also augment the existing parallel corpora by generating new translations, especially in domains where the available data is limited. 

 

 Do you want to know how parallel corpora are used in machine translation?

 

Freely available Asian corpora  

  1. United Nations parallel corpus: The United Nations has a multilingual corpus with documents in six official languages of the UN, including Chinese

  2. Chinese-English parallel corpora: Available on the Linguistic Data Consortium's website, this data set includes bilingual text for translation and language pair training.

  1. OpenSubtitles: A collection of subtitle files from movies and TV shows, available in many languages, including Traditional Chinese, Cantonese, and Thai.

  1. Global Voices: Global Voices is a community of bloggers and translators around the world that translates its articles into multiple languages, providing a valuable resource for parallel text.

  1. TED Talks Transcripts: TED Talks are translated into many languages, providing a valuable source of parallel text. This includes Vietnamese and Thai.

  1. Wikipedia: Though not strictly parallel, it can provide a considerable amount of translated content for languages that have a significant presence on the platform.

  1. Tatoeba: A collaborative, multilingual dictionary that provides example sentences and their translations in numerous languages.

  1. OPUS: An open-source parallel corpus collected from the web covering hundreds of languages, including Traditional Chinese, Chinese from Hong Kong, Chinese from Taiwan, Thai, and Vietnamese.

  1. The Asian Language Treebank (ALT): The ALT project provides parallel corpora in Asian languages, including Vietnamese and Thai.

 

Remember that even though these resources are free, some of them may require permission for commercial use. Also, the quality of parallel corpora can vary, and may require additional cleaning or preprocessing before use. 

In summary, while the acquisition of parallel corpora for Asian languages—particularly Traditional Chinese, Chinese from Hong Kong, Chinese from Taiwan, Thai, and Vietnamese—presents considerable challenges, the blend of sophisticated AI technology and indispensable human curation offers a promising solution.

This synergy can ensure that machine translation systems are not just linguistically accurate but culturally nuanced and contextually appropriate. As such, even as we advance towards increasingly automated systems, the human touch remains a crucial ingredient for optimizing machine translation services.