FNative-script and romanized Language Identification for 22 Indic languages
Description
We create publicly available language identification (LID) datasets and models in all 22 Indian languages listed in the Indian constitution in both native-script and romanized text. First, we create Bhasha-Abhijnaanam, a language identification test set for native-script as well as romanized text which spans all 22 Indic languages. We also train IndicLID, a language identifier for all the above-mentioned languages in both native and romanized script.
Downloads
Details
Languages
| | | | | |
---|
Assamese (asm) | Hindi (hin) | Maithili (mai) | Nepali (nep) | Sanskrit (san) | Tamil (tam) |
Bengali (ben) | Kannada (kan) | Malayalam (mal) | Oriya (ori) | Santali (sat) | Telugu (tel) |
Bodo(brx) | Kashmiri (kas) | Manipuri (mni) | Punjabi (pan) | Sindhi (snd) | Urdu (urd) |
Gujarati (guj) | Konkani (kok) | Marathi (mar) | | | |