Youtube Data for IndicWav2Vec

Description

For downloading, chunking, SNR filtering publicly available audio data for Building ASR systems for the next billion users

Downloads
Resource namelink
DatasetDhwani - Scripts
Details

Pretraining Data Processing

For Downloading and Processing YT Data

Required libraries youtube_dl, yt_dlp, pandas, ffmpeg, tqdm

Usage: bash process_data.sh </path/to/download> <num_of_threads>

The above command will start download of all the youtube-url’s for the language given, extract the audio (wav) and downsample it (to 16kHz) and name it as per the unique youtube-id. Subsequent to it, the data will be passed to VAD -> SNR -> Chunking pipeline automatically.

For Downloading and Processing NoA Data

Required libraries ffmpeg, tqdm

  1. Download the NoA from the publicly availiable links
  2. Put the data in language specific folders
  3. Run bash normalize_sr.sh <path/to/root/of/NoA> to normalize the SR and number of channels
  4. Run python vad.py <path/to/root/of/NoA> <path/to/refined/data/storage> language-specific-foldername
  5. Run python snr_filter.py <path/to/refined/data/storage> language-specific-foldername <path/to/store/rejected/files>
  6. Run python chunking.py <path/to/refined/data/storage/languagespecificfolder>

For Processing Individual Directories

  1. Download the data using

    bash dw_util.sh <path/to/txt/of/a/particular/language> <path/to/root/where/data/will/be/stored> <#ofthreads>

  2. Pass the data through VAD step as given below
  3. Pass the data through SNR setp as given below
  4. Pass the data through Chunking as given below

Additional Tools

For Voiced Activity Detection Step only

Required libraries webrtcvad, tqdm

Usage: python vad.py <data_read_dir> <data_write_dir> <folder_name>

The reason why folder_name has been kept as a seperate entity is to allow parallelization because one can process multiple folders simultaneously.

For SNR Filtering

Required libraries numpy, soundfile

Usage: python snr.py <data_path> <folder/language_name>

For Chunking

Required libraries pydub, joblib, tqdm

Usage: python chunking.py <chunking_path>