KhasiBERT

KhasiBERT is a foundational language model developed by MWirelabs, designed to bring state-of-the-art natural language processing (NLP) capabilities to the Khasi language. It is a RoBERTa-based model pre-trained from scratch on a large, curated corpus of Khasi text, providing a robust base for building downstream applications like text classification, named entity recognition, and question answering.

Training Data
The model was pre-trained on a corpus of 3.6 million Khasi sentences, meticulously cleaned and processed to ensure high quality. This dataset provides the model with a deep understanding of the language’s structure and nuances.

Tokenizer

  • Type: Byte-level Byte-Pair Encoding (BPE)

  • Vocabulary Size: 50,265 tokens

Model Performance & Evaluation

The primary evaluation for a model trained with a Masked Language Modeling objective is its ability to accurately predict masked tokens. The following results are from the model’s performance on a held-out evaluation set.

Evaluation MetricScore
Evaluation Loss1.5986
Evaluation Perplexity4.95
Evaluation Accuracy0.6592

These results demonstrate the model’s strong foundational understanding of the Khasi language and its readiness to be fine-tuned for specific tasks.

Quick start

				
					from transformers import RobertaForMaskedLM, RobertaTokenizerFast, pipeline

# Load model and tokenizer
model = RobertaForMaskedLM.from_pretrained('MWirelabs/khasibert')
tokenizer = RobertaTokenizerFast.from_pretrained('MWirelabs/khasibert')

# Create fill-mask pipeline
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

# Example usage
text = 'Ka Meghalaya ka <mask> ha ka jingpyrkhat jong ki Khasi.'
results = fill_mask(text)
print(results)

				
			

Let's Build Together

Are you a researcher, developer, or part of a language community in Northeast India? We are always looking for partners to collaborate on new datasets, fine-tune models, and advance the state of regional AI.