Community-Led African Language AI

|

KivuLingua AI is an initiative dedicated to the creation of the first open infrastructure for speech data and AI models for Eastern Congo's Bantu languages, beginning with Mashi (1.9M speakers). We build ethical ASR and TTS systems anchored in real-world use cases: literacy education via Teaching at the Right Level (TaRL) and community health communication, ensuring continuous linguistic datafication and digital sovereignty.

200+

Native Mashi speakers and contributors already mobilized for community validation, linguistic preservation and participatory data governance.

8

Underrepresented Bantu languages across Eastern Congo (North Kivu, South Kivu) prioritized for sustained ASR, TTS and multilingual AI development.

250h+

High-quality aligned speech corpus in Mashi collected across diverse genres: read speech, spontaneous conversations, oral narratives and functional content.

Apache 2.0

All datasets, models, and tools published under open licenses on Hugging Face, GitHub and Zenodo as African digital public goods.

Our Goals

Strategic Objectives

KivuLingua AI is committed to building the first open, ethical, and sustainable infrastructure for voice data and AI models for Mashi, reproducible across all Bantu languages in the Kivu region.

Collect Open Speech Data

Gather and publish at least 250 hours of validated Mashi vocal corpus, balanced by gender, age, and dialectal variant from 200+ native speakers

Develop Open-Source ASR

Create an edge-optimized ASR model for Mashi (CPU-only, offline, <1B parameters) using Whisper, wav2vec 2.0, and MMS fine-tuning

Develop TTS System

Build an initial TTS model for Mashi using Coqui TTS for accessible speech synthesis and educational content generation

Deploy Educational Pilot

Implement Teaching at the Right Level (TaRL) literacy program in 5+ rural Bushi schools serving 500+ students in mother-tongue education

Deploy Health Pilot

Equip 30+ community health workers with mobile app for offline voice data collection in Mashi-speaking regions

Build Reproducible Pipeline

Create documented, open-source infrastructure enabling cost-effective extension to 8 other Kivu Bantu languages in Phase 2

Train Community Contributors

Build capacity of 50+ local data collectors, transcribers, and validators in ethical data practices and quality standards

Publish All Resources

Release corpus, models, tools, and benchmarks under Apache 2.0 on Hugging Face, Zenodo, and GitHub as African digital public goods

Impact & Applications

Real-World Impact Through Education & Health

KivuLingua AI resources directly serve two high-impact use cases that anchor the entire data collection strategy.

Literacy Education (TaRL)

Teaching at the Right Level approach implemented in mother-tongue Mashi. Children in rural Bushi schools learn to read in their native language with AI-powered audio content through the Ntina platform (African STEM Resources Hub).

Expected Impact

  • • 500+ students across 5 schools
  • • Mother-tongue literacy foundation
  • • Measurable learning gains with TaRL methodology

Community Health Communication

Community health workers use voice-based mobile app to deliver prevention messages in Mashi (vaccination, maternal health, nutrition) and record patient data offline in rural Eastern DRC. Every interaction generates contextualized linguistic data automatically.

Expected Impact

  • • 30+ health workers equipped
  • • 1,500+ households reached indirectly
  • • Sustainable data generation beyond project funding

Strategic Innovation: Continuous Datafication

Unlike episodic data campaigns, KivuLingua AI builds infrastructure before annotation. We deploy mobile apps that transform daily workflows (teaching, health service delivery) into continuous streams of contextualized linguistic data. This 'datafication' approach, grounded in the framework of Fendji (2026), ensures data generation is sustainable, meaningful, and embedded in community needs—not artificially extracted.

Technical Infrastructure

Open-Source Models & Tools

We develop production-ready AI systems using proven open-source frameworks, optimized for offline deployment in resource-constrained settings.

ASR Model (Speech Recognition)

Fine-tuned Whisper, wav2vec 2.0, and MMS for Mashi. Optimized for Edge deployment: <1B parameters, CPU-only, <2x real-time on low-end Android devices.

Target WER < 25%

TTS Model (Speech Synthesis)

Coqui TTS trained on Mashi corpus for text-to-speech in educational and health applications with natural, community-validated voice quality.

Target MOS > 3.5

Mobile Collection App

Co-developed with Kwetu Best Technologies. Android app for offline voice capture, transcription integration, and automatic linguistic data generation.

30 health workers + 5 schools

Reproducible Pipeline

Complete open-source toolkit: annotation scripts, quality assurance, evaluation benchmarks, documentation—ready for 8 additional Kivu languages.

Apache 2.0 License

Why Edge-First Design?

Rural Eastern Congo has limited connectivity. Our models are optimized for offline deployment on low-cost Android phones and Raspberry Pi devices. No cloud dependency. No internet required. This ensures teachers and health workers can use AI tools in the field, making KivuLingua resources genuinely accessible to the communities that need them most.

Governance & Values

Ethics, Governance & Community Sovereignty

Community control over linguistic data comes before open access. We prioritize informed consent, ethical governance, and meaningful benefit-sharing.

Sovereignty Before Openness

Mashi community council (Matabaro, Nshombo, Prof. Nkunzimwami) validates all data access terms before publication. Communities retain control.

Infrastructure Before Annotation

Build sustainable systems where data generation is continuous and community-centered, not episodic campaigns extracting data from participants.

Informed Consent

Bilingual consent forms (Mashi/French), explicit right of withdrawal, anonymization of personal data. Every participant owns their contribution.

Gender & Diversity Balance

Minimum 50% women contributors across collection, spanning ages 15-65 and 4+ geographic zones. Amplifying marginalized voices.

Continuity Before Exhaustivity

Prioritize robust ongoing data flow over attempting complete linguistic coverage in isolation. Sustainability over perfection.

Conflict-Sensitive Approach

Security of participants prioritized. Data never used for surveillance or identification. No collection in active conflict zones.

Masakhane Values & Ubuntu Philosophy

KivuLingua AI is aligned with Masakhane's principles of Ubuntu, Equity, Transparency, and Community Innovation. All outputs (corpus, models, tools) are published under open licenses (Apache 2.0, CC-BY 4.0) and contributed to the African AI ecosystem as digital public goods. Language is a shared heritage—we build infrastructure to serve the entire community.

Meet Our Team

Leadership Driving Language Preservation

Our interdisciplinary team combines expertise in artificial intelligence, machine learning, and indigenous language documentation to build sustainable infrastructure for African language technologies.

MA

Muhigiri Ashuza

Technical Lead & AI Specialist

Software engineer and AI engineer specialized in voice technologies (Audio AI), generative models, and end-to-end AI systems. Currently AI consultant at AIAM (AI & African Music) project with The MIND Institute. Founder of African STEM Resources Hub developing virtual labs and offline educational solutions. Expertise covers TTS, low-resource language NLP, LLMs, educational AI systems, and AI infrastructure for African contexts.

CA

CIRUZA Alain

AI Architect & ML Infrastructure

Master's in Machine Intelligence from AIMS/AMMI with 5+ years designing, developing and deploying large-scale AI systems in agriculture, finance, and HR. Deep learning and NLP expertise with production frameworks: PyTorch, JAX, FastAPI, Django, LangChain, React. Cloud infrastructure experience with AWS and GCP. Specializes in multilingual NLP platforms, low-latency systems, and large-scale data pipelines.

MN

Marius Nshombo

Mashi Language Custodian

Curator of Muruhula.com, a comprehensive digital Mashi-French linguistic library. 8+ years dedicated to Mashi language documentation, preservation and digital accessibility. Deep expertise in Mashi grammar, vocabulary, cultural context and linguistic practice. Drives community validation, orthographic standards and ensures authenticity in language resource development.