AI's Language Gap: Why Smaller Languages Struggle to Be Heard in the Age of Large Models

When Tenzin Norbo founded a company to build the world's first Tibetan large language model, he faced a question that large tech companies rarely bother asking: why bother? Mainstream AI models already produce Tibetan text, answer questions in the language, and handle basic translation. From a purely functional standpoint, that seems sufficient.
Norbo's answer, laid out in a 21 May 2026 interview with CGTN, points to something the efficiency calculus of mainstream AI development obscures: languages are not interchangeable components in a translation pipeline. A model trained primarily on Mandarin or English will handle Tibetan as a secondary concern at best, missing tonal distinctions, cultural context, and the specific literary conventions of a language with roots going back centuries.
The tension Norbo articulates is not unique to Tibetan. It is playing out across dozens of languages that, by the raw metrics AI developers care about, do not justify dedicated investment.
The Training Data Problem
Large language models are, at their foundation, statistical summaries of written text. The quality and breadth of that text determines what a model can do. Languages with abundant digital writing — English, Mandarin, Spanish, French — produce training corpora measured in the billions of tokens. Languages with smaller speaker populations and less digitised literary heritage produce far less.
This creates a compounding disadvantage. A language with fewer digital texts trains weaker models; weaker models are less useful; reduced utility means fewer speakers adopt them for digital tasks; reduced adoption generates less new digital content, closing the loop. For many of the world's approximately 7,000 languages, this cycle is already well advanced.
The structural incentive runs in the opposite direction. AI developers are commercial enterprises or well-funded research groups that answer to investors, grants, or national competitiveness agendas. None of those pressures naturally point toward languages spoken by communities that lack purchasing power or geopolitical weight.
What Dedicated Models Can and Cannot Do
The case for building a Tibetan-specific model rests on capabilities that general-purpose systems struggle to deliver: accurate handling of Tibetan script conventions, nuanced engagement with religious and literary texts, and conversational interfaces that reflect how Tibetan is actually spoken across different regional dialects.
These are legitimate technical claims. Frontier models can recognise Tibetan characters and produce passable translations, but they frequently falter on idiomatic expressions, fail to capture register distinctions between formal and colloquial usage, and lack the contextual knowledge that comes from immersion in a specific cultural environment.
A purpose-built model, trained on curated Tibetan corpora and evaluated against benchmarks designed for Tibetan speakers, can in principle address these gaps. Norbo's company is not the only outfit attempting this. Similar efforts are underway for indigenous languages across Southeast Asia, Sub-Saharan Africa, and Latin America, typically led by academic institutions, non-profit organisations, or diaspora communities rather than major tech firms.
The Sustainability Question
Technical viability, however, does not solve the economics. Building a language model requires annotated training data, compute infrastructure, and ongoing maintenance. Keeping it current demands continuous data collection and model updating. All of this costs money.
For languages spoken by communities with limited commercial leverage, the funding models tend to be fragile: short-term grants, institutional goodwill, volunteer labour. None of these provide the stable foundation that a language technology ecosystem requires to grow organically.
International bodies, academic linguistics programmes, and cultural foundations have attempted to fill this gap, with mixed results. Open-source model releases have helped — a capable base model can be fine-tuned for a minority language at a fraction of the cost of training from scratch. But the gap between "technically possible" and "reliably available" remains wide for most of the world's smaller languages.
What Gets Left Behind
The consequence of this structural gap is not merely technological. As AI systems become embedded in education, government services, healthcare, and employment markets, the languages that lack capable AI support risk becoming less functional in digital environments. Younger speakers may find themselves navigating official systems in languages they do not natively speak, accelerating assimilation pressures that already exist.
Whether dedicated models like the one Norbo is building can reverse this trajectory is an open question. The honest answer is that a single company, however well-intentioned, cannot solve a problem rooted in global investment patterns and the incentive structures of the AI industry. What such efforts can do is demonstrate that better tools are possible and keep a language visible in a technology landscape that otherwise renders it invisible.
The alternative — relying on general-purpose models that treat Tibetan as a corner case — is not neutral. It is a choice, made by the economics of AI development, to leave something behind.
This publication's coverage of language technology and cultural preservation aims to foreground the structural conditions that shape which languages thrive in digital environments. CGTN's reporting provided the primary reporting basis for this piece.
Wire provenance
This editorial synthesis draws on the following public wire/social posts:
- https://x.com/cgtnofficial/status/1921929577816096873
- https://x.com/cgtnofficial/status/1921895745589588289