Open Indic-language Transliteration datasets and models for the Next Billion Users

Description

Aksharantar is the largest publicly available transliteration dataset for 20 Indic languages. The corpus has 26M Indic language-English transliteration pairs.

Downloads
Details

Dataset Card for Aksharantar

Table of Contents

Dataset Description

Dataset Summary

Aksharantar is the largest publicly available transliteration dataset for 20 Indic languages. The corpus has 26M Indic language-English transliteration pairs.

Supported Tasks and Leaderboards

[More Information Needed]

Languages

Assamese (asm)Hindi (hin)Maithili (mai)Marathi (mar)Punjabi (pan)Tamil (tam)
Bengali (ben)Kannada (kan)Malayalam (mal)Nepali (nep)Sanskrit (san)Telugu (tel)
Bodo(brx)Kashmiri (kas)Manipuri (mni)Oriya (ori)Sindhi (snd)Urdu (urd)
Gujarati (guj)Konkani (kok)Dogri (doi)

Dataset Structure

Data Instances

A random sample from Hindi (hin) Train dataset.

{
'unique_identifier': 'hin1241393',
'native word': 'स्वाभिमानिक',
'english word': 'swabhimanik',
'source': 'IndicCorp',
'score': -0.1028788579
}

Data Fields

Data Splits

Subsetasm-enben-enbrx-enguj-enhin-enkan-enkas-enkok-enmai-enmal-enmni-enmar-ennep-enori-enpan-ensan-ensid-entam-entel-enurd-en
Training179K1231K36K1143K1299K2907K47K613K283K4101K10K1453K2397K346K515K1813K60K3231K2430K699K
Validation4K11K3K12K6K7K4K4K4K8K3K8K3K3K9K3K8K9K8K12K
Test5531500941367768569363967707509355126911492565734133425643165334-468245674463

Dataset Creation

Information in the paper. Aksharantar: Towards building open transliteration tools for the next billion users

Curation Rationale

[More Information Needed]

Source Data

Initial Data Collection and Normalization

Information in the paper. Aksharantar: Towards building open transliteration tools for the next billion users

Who are the source language producers?

[More Information Needed]

Annotations

Information in the paper. Aksharantar: Towards building open transliteration tools for the next billion users

Annotation process

Information in the paper. Aksharantar: Towards building open transliteration tools for the next billion users

Who are the annotators?

Information in the paper. Aksharantar: Towards building open transliteration tools for the next billion users

Personal and Sensitive Information

[More Information Needed]

Considerations for Using the Data

Social Impact of Dataset

[More Information Needed]

Discussion of Biases

[More Information Needed]

Other Known Limitations

[More Information Needed]

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

This data is released under the following licensing scheme:

CC-BY License

CC-BY

CC0 License Statement

CC0

Citation Information

@misc{madhani2022aksharantar,
      title={Aksharantar: Towards Building Open Transliteration Tools for the Next Billion Users},
      author={Yash Madhani and Sushane Parthan and Priyanka Bedekar and Ruchi Khapra and Anoop Kunchukuttan and Pratyush Kumar and Mitesh Shantadevi Khapra},
      year={2022},
      eprint={},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Contributions