NE-OCR: Northeast India's First Multilingual OCR System

Standard OCR models treat our scripts as “noise.” NE-OCR treats them as native. We built Northeast India’s first multilingual OCR system on thousands of handcrafted and synthetic document images to preserve, protect, and digitize our region’s diverse written heritage.

Why the World Needs a Northeast Indian OCR System.

Millions of documents, manuscripts, government records, and cultural texts across Northeast India remain locked in physical form – inaccessible, unsearchable, and at risk of being lost forever.

Generic OCR tools like Tesseract and PaddleOCR fail on Northeast Indian scripts because they were never trained on them. They produce garbled output for Assamese, Meitei, and Latin-script languages like Mizo, Garo, and Khasi – treating our writing systems as edge cases.

NE-OCR is different. It is a domain-specific multilingual OCR system built from the ground up for Northeast India. By training on curated document datasets across multiple languages and scripts, we have created a single system that accurately reads the unique written context of the Northeast.

Results: Character Accuracy

Character Accuracy (ChA%) for all five models across 12 language-script pairs. NE-OCR achieves the highest ChA on 9 pairs. Bold values indicate the best result per language-script pair.

Language	Script	NE-OCR (Ours)	EasyOCR	Tesseract 5	TrOCR (large-printed)	Chandra
Assamese	Bengali	97.46%	32.25%	8.79%	0.80%	57.83%
Bodo	Devanagari	83.38%	82.65%	64.85%	1.85%	74.76%
English (anchor)	Latin	90.35%	68.91%	50.77%	88.87%	91.30%
Garo	Latin	93.52%	69.43%	69.90%	87.83%	94.15%
Hindi (anchor)	Devanagari	97.69%	49.54%	41.48%	1.27%	85.78%
Khasi	Latin	98.85%	77.78%	80.72%	93.22%	94.15%
Kokborok	Latin	97.59%	83.00%	78.76%	94.58%	96.19%
Meitei (Bengali script)	Bengali	97.09%	33.64%	7.30%	0.55%	48.34%
Meitei (Meitei Mayek)	Meitei Mayek	95.56%	2.50%	2.24%	2.45%	2.57%
Mizo	Latin	95.96%	67.62%	68.44%	84.58%	92.96%
Nagamese	Latin	97.91%	81.60%	78.05%	93.46%	97.60%
Nyishi	Latin	94.50%	69.56%	69.92%	87.23%	91.85%
Average	–	94.99%	59.87%	51.77%	53.06%	77.29%

Why NE-OCR Works Where Others Fail

Generic OCR systems were never built for Northeast India. They fragment agglutinative words, misread Bengali-Assamese script ligatures, and produce garbage output on low-resource Latin-script languages like Nyishi and Kokborok.

NE-OCR is built on DocTR ViTSTR-Base – a Vision Transformer architecture selected after rigorous cross-language benchmarking. One unified model. Twelve languages. No pipeline switching.

What This Means For You

Digitize government records in Assamese, Meitei, Kokborok and Khasi among others – automatically and at scale.
Preserve endangered language texts before they are lost to time.
Deploy once, cover the Northeast. A single model handles Bengali-Assamese and Latin scripts without any reconfiguration

Quick start

				
					import torch, json
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download
from doctr.models import vitstr_base

# Download files
model_path = hf_hub_download(repo_id='MWirelabs/ne-ocr', filename='ne_ocr_best.pt')
vocab_path = hf_hub_download(repo_id='MWirelabs/ne-ocr', filename='ne_ocr_vocab.json')

# Load vocab
with open(vocab_path, encoding='utf-8') as f:
    vocab_data = json.load(f)
vocab_str = ''.join(vocab_data['vocab'][1:])

# Load model
model = vitstr_base(pretrained=False, vocab=vocab_str)
model.load_state_dict(torch.load(model_path, map_location='cpu'))
model.eval()

# Inference (word/line crop, max 32 chars)
img = Image.open('your_crop.jpg').convert('RGB').resize((128, 32))
img_tensor = torch.tensor(np.array(img, dtype=np.float32)/255.0).permute(2,0,1).unsqueeze(0)
out = model(img_tensor)
print(out['preds'][0][0])

Let's Build Together

Are you a researcher, developer, or part of a language community in Northeast India? We are always looking for partners to collaborate on new datasets, fine-tune models, and advance the state of regional AI.