IIT-Madras’ AI4Bharat Launches IndicVoices Dataset Covering 22 Languages with Bhashini Funding
- ByStartupStory | March 7, 2024
In a significant stride towards linguistic inclusivity, IIT Madras’ research lab AI4Bharat unveiled the IndicVoices dataset on March 6. This open-source natural and speech dataset spans 22 Indian languages, aiming to pioneer the development of the first Automatic Speech Recognition (ASR) model supporting all languages listed in the 8th schedule of the Indian Constitution.
Funded by Bhashini, a government-backed project supported by the Ministry of Electronics and Information Technology, Ekstep Foundation, and Nilekani Philanthropies, IndicVoices is a comprehensive collection of 7,348 hours of read (9%), extempore (74%), and conversational (17%) audio. The diverse dataset encompasses 16,237 speakers across 145 Indian districts.
The ambitious project, estimated to cost approximately Rs 30 crore, has already transcribed 1,639 hours, with a median of 73 hours per language. AI4Bharat has not only shared the dataset but also an open-source blueprint, offering standardized protocols, central tools, quality control mechanisms, and transcription guidelines, intending to facilitate similar efforts globally.
Bhashini, investing $5-6 million in AI4Bharat, aims to utilize the open-source data for building a National Public Digital Platform. This platform seeks to harness the potential of AI and emerging technologies to develop services and products in various Indian languages, especially in domains like governance and policy.
Amitabh Nag, CEO of Bhashini, emphasized, “This (datasets) will lead us to 22 language models and further lead us to use cases which we are building up.” The government-backed initiative has funded over 70 research institutes, including IIT-Bombay, IIT-Mandi, and the Indian Institute of Science Bengaluru.
Tanuj Bhojwani, Head of PeoplePlusAI, highlighted the transformative impact of eliminating the barrier of collecting high-cost datasets. He expressed optimism about startups and academia leveraging these voice datasets to create innovative solutions. Bhojwani suggested that the government could use this foundation to provide essential services in remote areas of the country.