Language Ecosystem

Building datasets for underrepresented Bantu languages

KivuLingua focuses on community-driven speech and language datasets for indigenous languages across Eastern Congo, beginning with Mashi as the flagship pilot infrastructure.

Mashi (Shi)

ISO 639-3: shr

Active

Speakers

1.9M

Region

Bukavu, Walungu, Kabare, Kalehe

Culturally foundational Bantu language of the Bushi region with relatively stabilized orthography and initial textual resources. Priority language for ASR and TTS system development.

Nande (Kinande)

ISO 639-3: nnb

Active

Speakers

2.6M

Region

North Kivu, Ituri

One of the largest regional Bantu languages spoken by Yira/Nande communities. High sociolinguistic vitality but limited digital corpus for machine learning.

Hunde (Kihunde)

ISO 639-3: hke

Planned

Speakers

800K-950K

Region

Masisi, Rutshuru, Walikale

Historically significant language with extremely rare digital resources. Priority for corpus documentation and creation for neural model development.

Fuliru (Kifuliru)

ISO 639-3: flr

Planned

Speakers

250K-400K

Region

Uvira, Fizi

Language with existing community-led preservation initiatives. Structured vocal data remains insufficient for robust neural model performance.

Tembo (Kitembo)

ISO 639-3: tbt

Planned

Speakers

500K+

Region

Kalehe, Masisi, Rutshuru

Upland Bantu language facing displacement pressure and minimal contemporary digital documentation. Requires prioritized preservation strategy.

Havu (Kihavu)

ISO 639-3: hav

Planned

Speakers

1.1M

Region

Idjwi Island, Kalehe

Strategically important language with geographic concentration and vibrant oral heritage. High-priority for cultural preservation initiatives.

Nyanga (Kinyanga)

ISO 639-3: nyj

Planned

Speakers

150K+

Region

Walikale

Minority forest language with increasing vulnerability to Swahili dominance. Critical focus for linguistic preservation and digital inclusion.

Rega (Kirega/Lega)

ISO 639-3: leg

Planned

Speakers

250K-450K

Region

Mwenga, Shabunda, Pangi

Forest-region language with heightened erosion risk. Specialized preservation strategies essential for digital documentation and archival.

Swahili Congolais (Kingwana)

ISO 639-3: swc

Planned

Speakers

11M

Region

Regional Lingua Franca

Major regional lingua franca essential for developing robust multilingual architectures and advanced cross-linguistic transfer strategies.