4 min read

14/06/2026

AI Data Operations, Small Language Models and the Cost of Renting Cognition

NEWS BLOG EXPERT NLP SOLUTIONS MACHINE TRANSLATION ARTIFICIAL INTELLIGENCE

The next phase of enterprise AI will be decided less by access to generic models and more by who controls the data, the cost, the deployment, and the judgment behind them. In a recent LT-Innovate conversation with its Vice Chairman, Bruno Herrmann, Pangeanic CEO Manuel Herranz explained why language technology has moved from translation automation to governed language intelligence, and why the organizations that win the next phase will own their data strategy rather than rent it by the token.

The full conversation is also available on YouTube.

From a translation company to multilingual AI infrastructure

Pangeanic did not begin as an AI company. It began as a language service provider, then built statistical machine translation from local and European research projects, and from 2017 onward expanded into the full range of natural language technologies: machine translation that still runs on-premises, sentiment analysis, anonymization, evaluation and the data infrastructure beneath all of it.

That history is important because the experience of doing language work the hard way, before large language models made fluent text cheap, is exactly what makes a modern AI pipeline reliable rather than merely impressive. Translation forced the industry to solve practical problems that enterprise AI is rediscovering under new names: domain adaptation, terminology control, human evaluation, quality thresholds, multilingual coverage and measurable output.

The hidden cost of renting your cognition

The most important point in the conversation was also the least discussed in the wider market: the difference between buying capability and renting it. When an organization subcontracts its language intelligence to a model billed by the token, it quietly shifts from capital expenditure to open-ended operational expenditure. You can calculate the return on ten software licenses or ten GPUs. You cannot easily calculate the return on a cost that scales with every question your employees ask.

The deeper risk is not the invoice. It is that an organization billed by the token is, in effect, subcontracting part of its own decision-making to a third party. For a company, a ministry or a public institution that needs to remain the master of its own affairs, that dependency is a strategic exposure, not a convenience.

This is the part of total cost of ownership that adoption-led AI projects most often miss. They budget for infrastructure and data, then watch token costs climb as usage spreads, because the cost was never linked to the rate of adoption in the first place.

AI Data Operations: data strategy before model strategy

Everything above starts and ends with data. Pangeanic calls this discipline AI Data Operations: the work of sourcing, preparing, annotating, evaluating, governing and continuously improving the data that AI systems actually depend on.

Having data is not the same as being ready for AI. The real questions are whether the data is relevant, trustworthy, accurate, representative, properly structured and legally usable. Most organizations have data. Far fewer have a data strategy that can support AI in production.

This is also where the commercial paths diverge. Some teams need datasets for AI they can license quickly through off-the-shelf training data. Others need bespoke data collection when no existing dataset matches the language, domain, format, consent, annotation or compliance requirement.

Pangeanic has run this across public data spaces and national language efforts, including data work for the language models built at the Barcelona Supercomputing Center, and operates secure document translation for the Spanish Tax Agency. The point is consistent: data strategy comes before model strategy, not after it.

Small, task-specific language models

The market spent two years assuming bigger was always better. The evidence now points in a more practical direction. Gartner predicts that by 2027 organizations will use small, task-specific AI models at least three times more than general-purpose large language models, because accuracy on real business tasks comes from specialization, not scale alone.

Pangeanic has been here before. A single machine translation engine is, in effect, a task-specific small language model, and the company has been building those for years. The next step is framing these models inside a knowledge graph so they can retrieve, interpret and connect data intelligently.

This is the same engineering lineage behind Deep Adaptive AI Translation and Machine Translation Quality Estimation: specialized models, tuned to a task, evaluated against a standard and deployed in workflows where output quality can be measured.

For organizations that need private deployment, domain control and lower inference costs, small language model customization offers a practical path between generic AI access and fully governed operational intelligence.

Where human judgment still decides

If raw informational processing has become abundant at scale, the obvious question is what humans are still for. The answer is the distinction between being intelligent and being clever. A machine can compile a report, draft code or produce fluent language faster than any person. What it cannot do is exercise judgment, hold a network of relationships, or take responsibility for a decision.

The value of AI is not that journalists write more articles or doctors write more prescriptions. It is that capable people make better decisions, with the deep checking and the heavy lifting delegated to the machine and the clever execution kept human.

This is the reason human feedback, expert evaluation and model alignment sit at the center of a serious data pipeline rather than at its edge, because in production AI, human judgment cannot be decorative review layer but be part of the control system.

The next eighteen months: knowledge graphs, secure deployment and sovereign AI

Looking ahead, the priority is execution on small, task-specific models framed within knowledge graphs (where necessary), deployed where the organization keeps control. The advice to any organization is to protect its independence: take a task-specific small model, host it on-premises or in a secure private cloud, and build the intelligence layer over it with the right indexing, retrieval and evaluation workflow.

The strongest opportunities lie in governed data linking inside regulated environments, where connecting separate, authorized systems answers questions that isolated databases never could. In healthcare, education, public administration and enterprise knowledge environments, secure and localized AI is not a convenience feature. It is a way to deliver services that depend on data being linked safely, under the organization’s own governance.

That is the practical entry point for teams evaluating custom datasets, model alignment, private AI deployment, MTQE, adaptive translation or multilingual RAG. If your organization is weighing how to build AI it can actually control, that conversation starts with the data.y

Next step

Build AI you can actually control

If your organization is evaluating custom datasets, model alignment, private AI deployment, MTQE, adaptive translation or multilingual RAG, the right starting point is the data.

Talk to Pangeanic about your AI data project