KhasiBERT: AI Model for the Khasi Language & Digital Meghalaya

KhasiBERT

KhasiBERT is a foundational language model developed by MWirelabs, designed to bring state-of-the-art natural language processing (NLP) capabilities to the Khasi language. It is a RoBERTa-based model pre-trained from scratch on a large, curated corpus of Khasi text, providing a robust base for building downstream applications like text classification, named entity recognition, and question answering.

Training Data
The model was pre-trained on a corpus of 3.6 million Khasi sentences, meticulously cleaned and processed to ensure high quality. This dataset provides the model with a deep understanding of the language’s structure and nuances.

Tokenizer

Type: Byte-level Byte-Pair Encoding (BPE)
Vocabulary Size: 50,265 tokens

Model Performance & Evaluation

The primary evaluation for a model trained with a Masked Language Modeling objective is its ability to accurately predict masked tokens. The following results are from the model’s performance on a held-out evaluation set.

Evaluation Metric	Score
Evaluation Loss	1.5986
Evaluation Perplexity	4.95
Evaluation Accuracy	0.6592

These results demonstrate the model’s strong foundational understanding of the Khasi language and its readiness to be fine-tuned for specific tasks.

Quick start

				
					from transformers import RobertaForMaskedLM, RobertaTokenizerFast, pipeline

# Load model and tokenizer
model = RobertaForMaskedLM.from_pretrained('MWirelabs/khasibert')
tokenizer = RobertaTokenizerFast.from_pretrained('MWirelabs/khasibert')

# Create fill-mask pipeline
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

# Example usage
text = 'Ka Meghalaya ka <mask> ha ka jingpyrkhat jong ki Khasi.'
results = fill_mask(text)
print(results)

Let's Build Together

Are you a researcher, developer, or part of a language community in Northeast India? We are always looking for partners to collaborate on new datasets, fine-tune models, and advance the state of regional AI.