NE-OCR: Northeast India's First Multilingual OCR System

Standard OCR models treat our scripts as “noise.” NE-OCR treats them as native. We built Northeast India’s first multilingual OCR system on thousands of handcrafted and synthetic document images to preserve, protect, and digitize our region’s diverse written heritage.

Why the World Needs a Northeast Indian OCR System.

Millions of documents, manuscripts, government records, and cultural texts across Northeast India remain locked in physical form – inaccessible, unsearchable, and at risk of being lost forever.

Generic OCR tools like Tesseract and PaddleOCR fail on Northeast Indian scripts because they were never trained on them. They produce garbled output for Assamese, Meitei, and Latin-script languages like Mizo, Garo, and Khasi – treating our writing systems as edge cases.

NE-OCR is different. It is a domain-specific multilingual OCR system built from the ground up for Northeast India. By training on curated document datasets across multiple languages and scripts, we have created a single system that accurately reads the unique written context of the Northeast.

Results: Character Accuracy

Character Accuracy (ChA%) for all five models across 12 language-script pairs. NE-OCR achieves the highest ChA on 9 pairs. Bold values indicate the best result per language-script pair.

Language

Script

NE-OCR (Ours)

EasyOCR

Tesseract 5

TrOCR (large-printed)

Chandra

Assamese

Bengali

97.46%

32.25%

8.79%

0.80%

57.83%

Bodo

Devanagari

83.38%

82.65%

64.85%

1.85%

74.76%

English (anchor)

Latin

90.35%

68.91%

50.77%

88.87%

91.30%

Garo

Latin

93.52%

69.43%

69.90%

87.83%

94.15%

Hindi (anchor)

Devanagari

97.69%

49.54%

41.48%

1.27%

85.78%

Khasi

Latin

98.85%

77.78%

80.72%

93.22%

94.15%

Kokborok

Latin

97.59%

83.00%

78.76%

94.58%

96.19%

Meitei (Bengali script)

Bengali

97.09%

33.64%

7.30%

0.55%

48.34%

Meitei (Meitei Mayek)

Meitei Mayek

95.56%

2.50%

2.24%

2.45%

2.57%

Mizo

Latin

95.96%

67.62%

68.44%

84.58%

92.96%

Nagamese

Latin

97.91%

81.60%

78.05%

93.46%

97.60%

Nyishi

Latin

94.50%

69.56%

69.92%

87.23%

91.85%

Average

94.99%

59.87%

51.77%

53.06%

77.29%

Why NE-OCR Works Where Others Fail

Generic OCR systems were never built for Northeast India. They fragment agglutinative words, misread Bengali-Assamese script ligatures, and produce garbage output on low-resource Latin-script languages like Nyishi and Kokborok.

NE-OCR is built on DocTR ViTSTR-Base – a Vision Transformer architecture selected after rigorous cross-language benchmarking. One unified model. Twelve languages. No pipeline switching.

What This Means For You

  1. Digitize government records in Assamese, Meitei, Kokborok and Khasi among others – automatically and at scale.
  2. Preserve endangered language texts before they are lost to time.
  3. Deploy once, cover the Northeast. A single model handles Bengali-Assamese and Latin scripts without any reconfiguration

Quick start

				
					import torch, json
import numpy as np
from PIL import Image
from huggingface_hub import hf_hub_download
from doctr.models import vitstr_base

# Download files
model_path = hf_hub_download(repo_id='MWirelabs/ne-ocr', filename='ne_ocr_best.pt')
vocab_path = hf_hub_download(repo_id='MWirelabs/ne-ocr', filename='ne_ocr_vocab.json')

# Load vocab
with open(vocab_path, encoding='utf-8') as f:
    vocab_data = json.load(f)
vocab_str = ''.join(vocab_data['vocab'][1:])

# Load model
model = vitstr_base(pretrained=False, vocab=vocab_str)
model.load_state_dict(torch.load(model_path, map_location='cpu'))
model.eval()

# Inference (word/line crop, max 32 chars)
img = Image.open('your_crop.jpg').convert('RGB').resize((128, 32))
img_tensor = torch.tensor(np.array(img, dtype=np.float32)/255.0).permute(2,0,1).unsqueeze(0)
out = model(img_tensor)
print(out['preds'][0][0])

				
			

Let's Build Together

Are you a researcher, developer, or part of a language community in Northeast India? We are always looking for partners to collaborate on new datasets, fine-tune models, and advance the state of regional AI.