45 Million Speakers • 21 Languages • 8 States

Your Language.
Your Legacy.

AI doesn't understand Northeast India's languages. You can change that. Contribute in 2 minutes.

21
Languages
2 min
To Contribute

Quick Contribute

No login required • Open data • Contributor credited

Why This Matters

45 million people speak Northeast India's languages. But AI doesn't understand them. Yet.

Invisible Languages

Google Translate doesn't support Khasi. Siri can't understand Garo. ChatGPT can't read Mizo. Without data, these languages are invisible to technology.

The Data Gap

English ~trillion tokens
NE Languages ~million tokens

Big tech ignores low-resource languages. But WE can build the datasets needed.

Community Power

You don't need to be a linguist. Every sentence you write, every voice clip you record trains smarter AI for YOUR linguistic heritage.

What We're Building

Your contributions power real AI applications for education, government, and daily life

Translation Models

Bidirectional translation between English and indigenous languages for schools and government.

Voice Recognition

Speech-to-text systems that understand regional accents and dialects.

Open Datasets

High-quality, freely available corpora for researchers and developers worldwide.

Languages We're Preserving

21 languages across 8 Northeast states

AdiText
Tibeto-Burman
AngamiText
Tibeto-Burman
AoText
Tibeto-Burman
AssameseText + Voice
অসমীয়া
Indo-Aryan
BhutiaText
Tibeto-Burman
BodoText
Tibeto-Burman
GaroText + Voice
Garo
Tibeto-Burman
HmarText
Tibeto-Burman
KarbiText
Tibeto-Burman
KhasiText + Voice
Ka Ktien Khasi
Austroasiatic
KokborokText
Tibeto-Burman
LepchaText
Tibeto-Burman
LimbuText
Tibeto-Burman
Meitei (Manipuri)Text
ꯃꯩꯇꯩꯂꯣꯟ
Tibeto-Burman
MizoText
Mizo ṭawng
Tibeto-Burman
NagameseText
Indo-Aryan Creole
NyishiText
Tibeto-Burman
OthersText
Mixed/Unclassified
PnarText
Austroasiatic
TangkhulText
Tibeto-Burman
ThadouText
Tibeto-Burman
WanchoText
Tibeto-Burman
WarText
Austroasiatic

Part of the Community

Contributing to efforts in linguistic research, dataset creation and endangered language documentation