Our Foundation Models
We create and open-source foundational models designed to support research, education, and practical applications.
Kren-M
Generative Model
Kren-M is a bilingual (Khasi–English) language model developed through extensive continued pre-training and supervised fine-tuning of Gemma 2 (2B). Specifically designed for the Khasi, a low-resource Austroasiatic language spoken in Meghalaya, Northeast India, while retaining English fluency from its base model.
~3B params
NE-BERT
Encoder Model
NE-BERT: A regional state-of-the-art open-source model for 9 Northeast Indian languages. Built on ModernBERT for superior speed and accuracy in low-resource NLP.
~149M params
NE-LID
Language Identification Model
NE-LID is a high-accuracy language identification model for Northeast Indian languages. Benchmark study shows character-level models outperform generic LID systems on low-resource, script-diverse text.
fastText
Northeast India NLP
Multilingual NLP models and tokenizers for underrepresented languages of Northeast India, built for civic use and reproducibility.
Assamese RoBERTa
Language Model
Assamese RoBERTa is a custom monolingual RoBERTa-Base model pre-trained from scratch on the Assamese language.
~110M params
Meitei RoBERTa
Language Model
The Meitei-RoBERTa-Base model is a high-performance, monolingual transformer encoder pre-trained from scratch on the entire Meitei Monolingual Corpus.
~110M params
KhasiBERT
Language Model
Foundational Khasi model trained on ~3.6M sentences. Useful for translation, summarization, and low-resource NLP research.
~110 params Encoder
Mizo-RoBERTa
Language Model
Mizo-RoBERTa is a transformer-based language model for Mizo. Built on the RoBERTa architecture and trained on a large-scale curated corpus, this model provides state-of-the-art language understanding capabilities for Mizo NLP applications.
~110M params
NyishiBERT
Language Model
NyishiBERT is a monolingual masked language model for Nyishi (njz-Latn), a Sino-Tibetan language spoken in Northeast India. A transformer-based language model for the Nyishi language.
~110M params
NagameseBERT
language Model
NagameseBERT is a 7M parameter RoBERTa–style BERT model pre–trained on 42,552 Nagamese sentences. It achieves competitive performance on downstream NLP tasks while offering significant efficiency advantages.
~110M params
ChakmaBERT
Language Model
ChakmaBERT is a Latin-script Chakma language model built on XLM-RoBERTa. It’s trained on 41k conversational sentences. The model supports masked language modeling but is limited to informal, Latin-script Chakma use.
~MLM
GaroBERT
Language Model
GaroBERT is a masked language model for the A’chik (Garo) language, built on XLM-RoBERTa-base. It was trained on 50,673 cleaned Latin-script A’chik sentences, yielding much lower perplexity than multilingual baselines.
~278M params
Speech
Automatic speech recognition (ASR) systems for Northeast India’s underserved languages, enabling voice-to-text conversion and spoken language processing.
Garo ASR
Speech-to-text
Fine-tuned Whisper Small model for Automatic Speech Recognition in Garo language.
Nagamese ASR
Speech-to-text
Nagamese-ASR is a speech recognition model for Nagamese, trained on conversational audio. It transcribes spoken Nagamese into text, supporting Latin-script usage.
Expanding Modalities
Speech, Vision, OCR, and Multimodal models are under active development.
Let's Build Together
Are you a researcher, developer, or part of a language community in Northeast India? We are always looking for partners to collaborate on new datasets, fine-tune models, and advance the state of regional AI.