Foundational AI Models for Northeast India

Our Foundation Models

We create and open-source foundational models designed to support research, education, and practical applications.

Kren-M

Generative Model

Kren-M is a bilingual (Khasi–English) language model developed through extensive continued pre-training and supervised fine-tuning of Gemma 2 (2B). Specifically designed for the Khasi, a low-resource Austroasiatic language spoken in Meghalaya, Northeast India, while retaining English fluency from its base model.

~3B params

NE-BERT

Encoder Model

NE-BERT: A regional state-of-the-art open-source model for 9 Northeast Indian languages. Built on ModernBERT for superior speed and accuracy in low-resource NLP.

~149M params

NE-LID

Language Identification Model

NE-LID is a high-accuracy language identification model for Northeast Indian languages. Benchmark study shows character-level models outperform generic LID systems on low-resource, script-diverse text.

fastText

Northeast India NLP

Multilingual NLP models and tokenizers for underrepresented languages of Northeast India, built for civic use and reproducibility.

Assamese RoBERTa

Language Model

Assamese RoBERTa is a custom monolingual RoBERTa-Base model pre-trained from scratch on the Assamese language.

~110M params

Meitei RoBERTa

Language Model

The Meitei-RoBERTa-Base model is a high-performance, monolingual transformer encoder pre-trained from scratch on the entire Meitei Monolingual Corpus.

~110M params

KhasiBERT

Language Model

Foundational Khasi model trained on ~3.6M sentences. Useful for translation, summarization, and low-resource NLP research.

~110 params Encoder

Mizo-RoBERTa

Language Model

Mizo-RoBERTa is a transformer-based language model for Mizo. Built on the RoBERTa architecture and trained on a large-scale curated corpus, this model provides state-of-the-art language understanding capabilities for Mizo NLP applications.

~110M params

NyishiBERT

Language Model

NyishiBERT is a monolingual masked language model for Nyishi (njz-Latn), a Sino-Tibetan language spoken in Northeast India. A transformer-based language model for the Nyishi language.

~110M params

NagameseBERT

language Model

NagameseBERT is a 7M parameter RoBERTa–style BERT model pre–trained on 42,552 Nagamese sentences. It achieves competitive performance on downstream NLP tasks while offering significant efficiency advantages.

~110M params

ChakmaBERT

Language Model

ChakmaBERT is a Latin-script Chakma language model built on XLM-RoBERTa. It’s trained on 41k conversational sentences. The model supports masked language modeling but is limited to informal, Latin-script Chakma use.

~MLM

GaroBERT

Language Model

GaroBERT is a masked language model for the A’chik (Garo) language, built on XLM-RoBERTa-base. It was trained on 50,673 cleaned Latin-script A’chik sentences, yielding much lower perplexity than multilingual baselines.

~278M params

Speech

Automatic speech recognition (ASR) systems for Northeast India’s underserved languages, enabling voice-to-text conversion and spoken language processing.

Garo ASR

Speech-to-text

Fine-tuned Whisper Small model for Automatic Speech Recognition in Garo language.

Nagamese ASR

Speech-to-text

Nagamese-ASR is a speech recognition model for Nagamese, trained on conversational audio. It transcribes spoken Nagamese into text, supporting Latin-script usage.

Expanding Modalities
Speech, Vision, OCR, and Multimodal models are under active development.

Let's Build Together

Are you a researcher, developer, or part of a language community in Northeast India? We are always looking for partners to collaborate on new datasets, fine-tune models, and advance the state of regional AI.