NE-OCR: Northeast India's First Multilingual OCR System
Standard OCR models treat our scripts as “noise.” NE-OCR treats them as native. We built Northeast India’s first multilingual OCR system on thousands of handcrafted and synthetic document images to preserve, protect, and digitize our region’s diverse written heritage.
Why the World Needs a Northeast Indian OCR System.
Millions of documents, manuscripts, government records, and cultural texts across Northeast India remain locked in physical form – inaccessible, unsearchable, and at risk of being lost forever.
Generic OCR tools like Tesseract and PaddleOCR fail on Northeast Indian scripts because they were never trained on them. They produce garbled output for Assamese, Meitei, and Latin-script languages like Mizo, Garo, and Khasi – treating our writing systems as edge cases.
NE-OCR is different. It is a domain-specific multilingual OCR system built from the ground up for Northeast India. By training on curated document datasets across multiple languages and scripts, we have created a single system that accurately reads the unique written context of the Northeast.
Results: Character Accuracy
Character Accuracy (ChA%) for all five models across 12 language-script pairs. NE-OCR achieves the highest ChA on 9 pairs. Bold values indicate the best result per language-script pair.
Language | Script | NE-OCR (Ours) | EasyOCR | Tesseract 5 | TrOCR (large-printed) | Chandra |
Assamese | Bengali | 97.46% | 32.25% | 8.79% | 0.80% | 57.83% |
Bodo | Devanagari | 83.38% | 82.65% | 64.85% | 1.85% | 74.76% |
English (anchor) | Latin | 90.35% | 68.91% | 50.77% | 88.87% | 91.30% |
Garo | Latin | 93.52% | 69.43% | 69.90% | 87.83% | 94.15% |
Hindi (anchor) | Devanagari | 97.69% | 49.54% | 41.48% | 1.27% | 85.78% |
Khasi | Latin | 98.85% | 77.78% | 80.72% | 93.22% | 94.15% |
Kokborok | Latin | 97.59% | 83.00% | 78.76% | 94.58% | 96.19% |
Meitei (Bengali script) | Bengali | 97.09% | 33.64% | 7.30% | 0.55% | 48.34% |
Meitei (Meitei Mayek) | Meitei Mayek | 95.56% | 2.50% | 2.24% | 2.45% | 2.57% |
Mizo | Latin | 95.96% | 67.62% | 68.44% | 84.58% | 92.96% |
Nagamese | Latin | 97.91% | 81.60% | 78.05% | 93.46% | 97.60% |
Nyishi | Latin | 94.50% | 69.56% | 69.92% | 87.23% | 91.85% |
Average | – | 94.99% | 59.87% | 51.77% | 53.06% | 77.29% |
Why NE-OCR Works Where Others Fail
Generic OCR systems were never built for Northeast India. They fragment agglutinative words, misread Bengali-Assamese script ligatures, and produce garbage output on low-resource Latin-script languages like Nyishi and Kokborok.
NE-OCR is built on DocTR ViTSTR-Base – a Vision Transformer architecture selected after rigorous cross-language benchmarking. One unified model. Twelve languages. No pipeline switching.
What This Means For You
- Digitize government records in Assamese, Meitei, Kokborok and Khasi among others – automatically and at scale.
- Preserve endangered language texts before they are lost to time.
- Deploy once, cover the Northeast. A single model handles Bengali-Assamese and Latin scripts without any reconfiguration
Quick start
import torch, json
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download
from doctr.models import vitstr_base
# Download files
model_path = hf_hub_download(repo_id='MWirelabs/ne-ocr', filename='ne_ocr_best.pt')
vocab_path = hf_hub_download(repo_id='MWirelabs/ne-ocr', filename='ne_ocr_vocab.json')
# Load vocab
with open(vocab_path, encoding='utf-8') as f:
vocab_data = json.load(f)
vocab_str = ''.join(vocab_data['vocab'][1:])
# Load model
model = vitstr_base(pretrained=False, vocab=vocab_str)
model.load_state_dict(torch.load(model_path, map_location='cpu'))
model.eval()
# Inference (word/line crop, max 32 chars)
img = Image.open('your_crop.jpg').convert('RGB').resize((128, 32))
img_tensor = torch.tensor(np.array(img, dtype=np.float32)/255.0).permute(2,0,1).unsqueeze(0)
out = model(img_tensor)
print(out['preds'][0][0])
Let's Build Together
Are you a researcher, developer, or part of a language community in Northeast India? We are always looking for partners to collaborate on new datasets, fine-tune models, and advance the state of regional AI.