6 min read

26/09/2025

An AI language data crisis? Quality data collection is the only path forward for vulnerable languages

DATA BLOG NLP SOLUTIONS MACHINE TRANSLATION ARTIFICIAL INTELLIGENCE EU LANGUAGES

17:47

Recent research reveals a devastating “doom spiral” threatening the world’s smaller languages, but proven solutions exist

A crisis is unfolding across the digital landscape that threatens to accelerate the extinction of the world’s most vulnerable languages. Recent investigations by MIT Technology Review have revealed what researchers call a “linguistic doom loop”, a devastating cycle where machine translation tools flood smaller language resources with error-plagued content, which then corrupts AI training data, leading to even worse translations and further degradation of language representation online. This is nothing strange to those of us who use the buzzword "AI" with some reluctance but were trained in the field of statistical machine translation, pattern recognition (one of our University partners is PRHLT, "Pattern Recognition and Human Language Technologies"- just the name the link between the two), and early neural machine translation. We called it "Google is polluting its own waters" when it used crawled web pages that had been previously translated with Google Translate without human review. For a time, (possibly after a retraining), the quality of Google Translate went down.

The scale of this crisis is staggering. Volunteers working on African language Wikipedia editions estimate that between 40% and 60% of articles are uncorrected machine translations. When researchers audited the Inuktitut Wikipedia, they found that more than two-thirds of pages containing substantial content featured portions created through automated translation. For 27 under-resourced languages, Wikipedia represents the sole easily accessible source of online linguistic data—making the quality of this content a matter of digital life or death for these languages.

When bad data becomes dangerous data

The consequences extend far beyond digital inconvenience. Consider Fulfulde, a language used mainly by pastoralists and farmers across the Sahel region of Africa. When agricultural information is mistranslated into Fulfulde, it can “easily harm” farmers who rely on accurate seasonal guidance for their livelihoods. Yet, current machine translation systems demonstrate shocking inaccuracy: Google Translate incorrectly translates the Fulfulde (Fulani) word for January as June, while ChatGPT suggests it means August or September. The same systems suggest the Fulfulde word for “harvest” means “fever” or “well-being”—errors that could prove catastrophic when farmers need precise agricultural timing information.

This connection to Fulfulde (Fulani) is particularly meaningful for Pangeanic, whose leadership has demonstrated a commitment to supporting African communities through initiatives such as the Malima Project in northern Cameroon. This work with the Kapsiki people in the village of Gouria has provided direct insight into how language technology can either empower or harm vulnerable communities, reinforcing the critical importance of accurate linguistic representation in digital systems.

Manuel Herranz Pangeanic is Gouria Cameroon as part of the Malima Project

Manuel Herranz Pangeanic is Gouria, Cameroon as part of the Malima Project

The problem compounds when AI systems trained on this corrupted data begin producing error-strewn learning materials. AI-generated phrasebooks for languages like Inuktitut and Cree have appeared on Amazon, containing what linguists describe as “complete nonsense.” Rather than democratizing access to minority languages, these systems are creating an “ever-expanding minefield” for learners and speakers to navigate.

Breaking the cycle: A proven alternative approach

While this crisis deepens across the industry, some organizations have demonstrated that a different path is possible. At Pangeanic, two decades of experience in language data collection have yielded a methodology that directly addresses the root causes of the doom spiral: poor data quality, lack of native speaker involvement, and absence of cultural context.

The contrast between automated approaches and human-centered data collection becomes clear when examining success stories. Pangeanic’s collaboration with the Barcelona Supercomputing Center (BSC) exemplifies this alternative approach. The partnership involved creating customized datasets containing parallel corpora (bilingual segments) classified by domain and style, with comprehensive data annotation, and other NLP disciplines like bias detection mechanisms, and hate speech detection, all of which were validated by human experts rather than automated systems.

Read More: Barcelona Supercomputer Use Case

This collaboration contributed to the development of the first family of Catalan LLM models, as well as an industry-leading multilingual Spanish-Catalan-English (and other European languages) LLM, and an LLM for all the regional languages of Spain, which significantly outperforms competing Spanish LLMs. The success demonstrates how quality-focused data collection can elevate languages from vulnerability to strength. As Maite Melero from BSC noted at a recent TAUS Summit in Dublin, strategic community engagement through regional channels is paramount. BSC made a call through the local Catalan-speaking TV channel and the response was overwhelming, with tens of thousands of hours of spoken Catalan and millions of written texts donated. This helped gather language resources for their Catalan LLM, and as a result, Catalan is no longer considered an endangered language.

Read More: How Many Languages Are Spoken in Spain?

The model: Scaling quality through collaboration

This success story is being replicated across Europe through coordinated initiatives that prioritize quality over quantity. The European Language Data Space (LDS) represents a systematic approach to creating sustainable language data ecosystems. At the upcoming LDS workshop in Barcelona on October 14, 2025, Manuel Herranz from Pangeanic will join other European language technology leaders to discuss “Language data production, management, and market development: overcoming obstacles”, addressing precisely the challenges that have created the current crisis.

The workshop agenda reflects the comprehensive approach needed: sessions on the strategic role of data in language AI, the European Commission’s Digital Europe Programme supporting the Common European Language Data Space, and practical discussions on developing markets for language data and services. This represents the kind of coordinated effort required to establish alternatives to the failing automated approaches.

European initiatives, such as EuroLLM and OpenEuroLLM, demonstrate how this coordinated approach can be scaled. EuroLLM aims to build open-source European Large Language Models supporting all 24 official European languages plus strategically important languages, while OpenEuroLLM brings together 20 European research institutions, companies, and EuroHPC centers to develop next-generation open-source multilingual models. These projects emphasize transparent development, cultural sensitivity, and compliance with European AI regulations, all elements that are currently lacking in the automated systems contributing to the current crisis.

Technical solutions to a technical problem

The doom spiral exists because current approaches treat language data as a commodity that can be automatically generated and processed. Pangeanic’s methodology recognizes language data as a cultural artifact requiring specialized expertise. The company’s ECO platform recruits native speakers to write on specific topics, while language teams curate non-crawlable data and clean freely available sources through processes that preserve linguistic authenticity and cultural nuances.

Quality control mechanisms distinguish this approach from automated alternatives. The PECAT tool provides human-in-the-loop validation, ensuring annotated data meets the standards necessary for effective AI training. This contrasts sharply with the automated translation dumps that have corrupted smaller language resources across the internet.

Data ownership models also differ fundamentally. Pangeanic provides customers with full ownership and copyright of datasets, whether for monolingual datasets or speech data and transcriptions for machine learning training. This ensures communities and organizations maintain control over their linguistic heritage while building necessary technological infrastructure.

The company follows processes ensuring Ethical AI principles are built into every step, with these commitments passed on to client products. This ethical foundation distinguishes human-centered approaches from purely extractive models that have contributed to the current crisis.

Evidence-Based Results Across Language Families

Pangeanic’s track record demonstrates that quality-focused approaches work across diverse linguistic contexts. Beyond the Catalan success with BSC, the company has developed datasets for languages across different continents, recognizing that each presents unique challenges that require specialized approaches.

Work with Spanish language models includes the provision of data for next-generation Large Language Models containing input from the National Library and multiple sources. Knowledge extraction projects, such as building models for major financial institutions to extract client and contract information at scale, demonstrate how quality data enables practical applications. Machine translation leadership, as shown through projects like NTEU (Neural Translation for the EU), has enabled the creation of custom translation models for European public administrations by utilizing big data repositories and curated bilingual data collections.

The company’s anonymization work, which led the European MAPA project, involved data labeling and annotation to create the first LLM-based open-source personal data anonymizer, demonstrating how quality data collection enables compliance with privacy regulations while advancing technological capabilities.

The Path Forward: Industry-Wide Transformation

The linguistic doom spiral represents a crossroads for the language technology industry. The current trajectory—automated data harvesting, machine translation dumping, and extraction-focused business models—leads inevitably to the degradation and eventual digital extinction of vulnerable languages.

The alternative path requires recognizing language data as cultural heritage, requiring specialized expertise, community engagement, and ethical frameworks. This approach requires a higher initial investment but produces sustainable results that strengthen, rather than weaken, linguistic diversity.

European initiatives demonstrate how this transformation can scale through coordinated investment, shared infrastructure, and collaborative development models. The European Language Data Space provides the ecosystem framework, while projects such as EuroLLM and OpenEuroLLM demonstrate how quality-focused approaches can compete with and surpass automated alternatives.

The urgent need for industry leadership

With UNESCO declaring a language extinct every two weeks, the window for action continues to narrow. The doom spiral accelerates as more corrupted data enters training datasets, making recovery increasingly difficult for affected languages.

Organizations working with smaller and under-resourced languages face a critical choice: continue relying on automated systems that perpetuate the crisis, or partner with specialized providers who understand both the technical requirements of modern AI systems and the cultural sensitivity required for authentic language representation.

The success stories emerging from Europe (from Catalan’s transformation through quality data collection to the coordinated initiatives now scaling across the continent—prove that alternatives exist. The question facing the global language technology community is whether these proven approaches will spread quickly enough to prevent the digital extinction of hundreds of vulnerable languages.

For languages like Fulfulde, spoken by pastoralists and farmers who need accurate information for their livelihoods, the stakes could not be higher. The choice between automated approximations and expert-crafted authentic datasets is ultimately a choice between technological colonialism and digital sovereignty for the world’s linguistic minorities.

The solution exists. The expertise is available. The only question remaining is whether the industry will choose quality over convenience before it’s too late.

Pangeanic combines expertise as an AI developer with skills as a translation services company, creating unique capabilities for speech datasets, text annotation, data classification, and content moderation across vulnerable languages. Contact us to learn how quality data collection can break the doom spiral for the languages that matter to your community and mission.