Our Foundation Models

We create and open-source foundational models designed to support research, education, and practical applications.

Kren-M

Generative Model

Kren-M is a bilingual (Khasi–English) language model developed through extensive continued pre-training and supervised fine-tuning of Gemma 2 (2B). Specifically designed for the Khasi, a low-resource Austroasiatic language spoken in Meghalaya, Northeast India, while retaining English fluency from its base model.

~3B params

NE-BERT

Encoder Model

NE-BERT: A regional state-of-the-art open-source model for 9 Northeast Indian languages. Built on ModernBERT for superior speed and accuracy in low-resource NLP.

~149M params

Northeast India NLP

Multilingual NLP models and tokenizers for underrepresented languages of Northeast India, built for civic use and reproducibility.

Assamese RoBERTa

language Model

Assamese RoBERTa is a custom monolingual RoBERTa-Base model pre-trained from scratch on the Assamese language.

~110M params

Meitei RoBERTa

Language Model

The Meitei-RoBERTa-Base model is a high-performance, monolingual transformer encoder pre-trained from scratch on the entire Meitei Monolingual Corpus.

~110M params

KhasiBERT

Language Model

Foundational Khasi model trained on ~3.6M sentences. Useful for translation, summarization, and low-resource NLP research.

~110 params  Encoder

Mizo-RoBERTa

language Model

Mizo-RoBERTa is a transformer-based language model for Mizo. Built on the RoBERTa architecture and trained on a large-scale curated corpus, this model provides state-of-the-art language understanding capabilities for Mizo NLP applications.

~110M params

Let's Build Together

Are you a researcher, developer, or part of a language community in Northeast India? We are always looking for partners to collaborate on new datasets, fine-tune models, and advance the state of regional AI.