An Indic accented English speech dataset

Description

India is the second largest English-speaking country in the world with a speaker base of roughly 130 million. Unfortunately, Indian speakers find a very poor representation in existing English ASR benchmarks such as LibriSpeech, Switchboard, Speech Accent Archive, etc. We address this gap by creating Svarah, a benchmark that contains 9.6 hours of transcribed English audio from 117 speakers across 65 districts across 19 states in India, resulting in a diverse range of accents. The collective set of native languages spoken by the speakers covers 19 of the 22 constitutionally recognized languages of India, belonging to 4 different language families. Svarah includes both read speech and spontaneous conversational data, covering a variety of domains such as history, culture, tourism, government, sports, etc. It also contains data corresponding to popular use cases such as ordering groceries, making digital payments, and using government services (e.g., checking pension claims, checking passport status, etc.). The resulting diversity in vocabulary as well as use cases allows a more robust evaluation of ASR systems for real-world applications.

Downloads
DatasetsBenchmark
Svarahlink
Details

Tutorial

Applicable to svarah_manifest.json & saa_l1_manifest.json

{"audio_filepath": <path to audio file 1>, "duration": <seconds>, "text": <transcript 1>}
{"audio_filepath": <path to audio file 2>, "duration": <seconds>, "text": <transcript 2>}

For azure and google cloud evaluations, you will be required to add your key associated with the services offered by each. For others, you can run the following :

python eval_<hf_model>.py  --manifest <manifest path>

For processing audio filepaths, kindly change them as per your directory structure in the scripts.

The meta_speaker_stats.csv file consists of 11 columns which describes some meta statistics of speakers involved in Svarah:


Table 1: WER comparison

Table 1 depicts WER’s of different models on (i) Svarah that contains data from Indian speakers and (ii) SAA_L1, LibriSpeech Clean (Libri) which contain data from native English speakers.

# Params.SvarahSAA_L1LibriSpeech
Whisperbase74M13.62.94.2
Whispermedium769M8.31.73.1
Whisperlarge1550M7.21.62.7
Wav2Vec2large317M24.93.11.8
HuBERTlarge316M25.63.22.0
WavLMlarge300M33.79.23.4
Data2Veclarge313M24.52.51.8
Conformerlarge120M14.61.12.1
AzureUS-20.924.2-
AzureIN-21.330.1-
GoogleUS-30.016.8-
GoogleIN-20.763.7-

Table 2: Accent-wise split of Svarah

Table 2: Number of hours and Number of tokens in each accent

Accent# Hours# Tokens
Assamese0.26869
Bengali0.331024
Bodo0.631520
Dogri0.441262
Gujarati0.371051
Hindi0.401068
Kannada0.711892
Kashmiri0.401310
Konkani0.541325
Maithili0.761662
Malayalam0.681711
Marathi0.30948
Nepali1.162236
Odia0.611548
Punjabi0.27820
Sindhi0.18536
Tamil0.441352
Telugu0.501311
Urdu0.641814

Citation

If you benefit from this dataset, kindly cite as follows:

@misc{javed2023svarah,
      title={Svarah: Evaluating English ASR Systems on Indian Accents},
      author={Tahir Javed and Sakshi Joshi and Vignesh Nagarajan and Sai Sundaresan and Janki Nawale and Abhigyan Raman and Kaushal Bhogale and Pratyush Kumar and Mitesh M. Khapra},
      year={2023},
      eprint={2305.15760},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}