Resource name | link |
---|---|
Dataset | Dhwani - Scripts |
For downloading, chunking, SNR filtering publicly available audio data for Building ASR systems for the next billion users
Resource name | link |
---|---|
Dataset | Dhwani - Scripts |
Required libraries
youtube_dl, yt_dlp, pandas, ffmpeg, tqdm
Usage:
bash process_data.sh </path/to/download> <num_of_threads>
The above command will start download of all the youtube-url’s for the language given, extract the audio (wav) and downsample it (to 16kHz) and name it as per the unique youtube-id. Subsequent to it, the data will be passed to VAD -> SNR -> Chunking pipeline automatically.
Required libraries
ffmpeg, tqdm
bash normalize_sr.sh <path/to/root/of/NoA>
to normalize the SR and number of channelspython vad.py <path/to/root/of/NoA> <path/to/refined/data/storage> language-specific-foldername
python snr_filter.py <path/to/refined/data/storage> language-specific-foldername <path/to/store/rejected/files>
python chunking.py <path/to/refined/data/storage/languagespecificfolder>
bash dw_util.sh <path/to/txt/of/a/particular/language> <path/to/root/where/data/will/be/stored> <#ofthreads>
Required libraries
webrtcvad, tqdm
Usage:
python vad.py <data_read_dir> <data_write_dir> <folder_name>
The reason why folder_name has been kept as a seperate entity is to allow parallelization because one can process multiple folders simultaneously.
Required libraries
numpy, soundfile
Usage:
python snr.py <data_path> <folder/language_name>
Required libraries
pydub, joblib, tqdm
Usage:
python chunking.py <chunking_path>