Garo ASR: Automatic Speech Recognition for Garo Language

First AI-Powered Speech-to-Text Model for Latin-Script Garo

Garo ASR is a breakthrough automatic speech recognition (ASR) model, representing a publicly available AI system capable of converting Garo speech to text. This deep learning model enables voice-to-text conversion for the Garo language, a Tibeto-Burman language spoken by approximately 1.5 million people across Meghalaya, Assam, Tripura, and parts of Bangladesh.

What is Garo ASR?

Garo ASR is an artificial intelligence model that automatically transcribes spoken Garo language into written text. Using advanced neural network architectures and machine learning techniques, the model processes audio recordings and generates accurate text transcriptions in real-time. This technology enables applications like voice assistants, automated transcription services, accessibility tools for the hearing impaired, and digital documentation of Garo oral traditions.

The model specifically handles Latin-script Garo (used in India), distinguishing it from the Bengali-script variant used in Bangladesh. This makes it particularly valuable for educational institutions, government services, media organizations, and technology companies operating in Northeast India.

Model Performance & Accuracy

Technical Metrics:

  • Word Error Rate (WER): 9.74% – Among the lowest error rates for low-resource Indian languages
  • Character Error Rate (CER): 3.82% – Exceptional character-level accuracy

Technology & Architecture.

Built on State-of-the-Art AI: Garo ASR is built using OpenAI’s Whisper architecture, a transformer-based neural network with 244 million parameters. The model leverages transfer learning, starting from Whisper’s multilingual pre-training and fine-tuning specifically on Garo language data. This approach allows the model to achieve professional-grade accuracy despite training on only 47 hours of Garo speech data – approximately 10x less data than traditional ASR systems require.

Key Technical Features

  • End-to-end neural architecture (no phoneme dictionaries required)
  • Handles code-mixing with English
  • Robust to background noise and varying audio quality
  • Supports multiple audio formats (.wav, .mp3, .ogg, .flac, .m4a)
  • Optimized for 16kHz audio sampling rate

Quick start

				
					from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch

# Load model and processor
processor = WhisperProcessor.from_pretrained("MWirelabs/garo-asr")
model = WhisperForConditionalGeneration.from_pretrained("MWirelabs/garo-asr")

# Load audio (16kHz)
# audio_array = your audio as numpy array

# Generate transcription
inputs = processor(audio_array, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
    predicted_ids = model.generate(inputs.input_features)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]

				
			

Let's Build Together

Are you a researcher, developer, or part of a language community in Northeast India? We are always looking for partners to collaborate on new datasets, fine-tune models, and advance the state of regional AI.