KhasiBERT
KhasiBERT is a foundational language model developed by MWirelabs, designed to bring state-of-the-art natural language processing (NLP) capabilities to the Khasi language. It is a RoBERTa-based model pre-trained from scratch on a large, curated corpus of Khasi text, providing a robust base for building downstream applications like text classification, named entity recognition, and question answering.
Training Data
The model was pre-trained on a corpus of 3.6 million Khasi sentences, meticulously cleaned and processed to ensure high quality. This dataset provides the model with a deep understanding of the language’s structure and nuances.
Tokenizer
Type: Byte-level Byte-Pair Encoding (BPE)
Vocabulary Size: 50,265 tokens
Model Performance & Evaluation
The primary evaluation for a model trained with a Masked Language Modeling objective is its ability to accurately predict masked tokens. The following results are from the model’s performance on a held-out evaluation set.
| Evaluation Metric | Score |
|---|---|
| Evaluation Loss | 1.5986 |
| Evaluation Perplexity | 4.95 |
| Evaluation Accuracy | 0.6592 |
These results demonstrate the model’s strong foundational understanding of the Khasi language and its readiness to be fine-tuned for specific tasks.
Quick start
from transformers import RobertaForMaskedLM, RobertaTokenizerFast, pipeline
# Load model and tokenizer
model = RobertaForMaskedLM.from_pretrained('MWirelabs/khasibert')
tokenizer = RobertaTokenizerFast.from_pretrained('MWirelabs/khasibert')
# Create fill-mask pipeline
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
# Example usage
text = 'Ka Meghalaya ka ha ka jingpyrkhat jong ki Khasi.'
results = fill_mask(text)
print(results)
Let's Build Together
Are you a researcher, developer, or part of a language community in Northeast India? We are always looking for partners to collaborate on new datasets, fine-tune models, and advance the state of regional AI.