Bodo Resources for NLP - An Overview of Existing Primary Resources for Bodo

Proceedings of Intelligent Computing and Technologies Conference

July, 2021

Mwnthai Narzary, Gwmsrang Muchahary, Maharaj Brahma, Sanjib Narzary, Pranav Kumar Singh, Apurbalal Senapati

Abstract

With over 1.4 million Bodo speakers, there is a need for Automated Language Processing systems such as Machine translation, Part Of Speech tagging, Speech recognition, Named Entity Recognition, and so on. In order to develop such a system it requires a sufficient amount of dataset. In this paper we present a detailed description of the primary resources available for Bodo language that can be used as datasets to study Natural Language Processing and its applications. We have listed out different resources available for Bodo language: 8,005 Lexicon dataset collected from agriculture and health, Raw corpus dataset of 2,915,544 words, Tagged corpus consisting of 30,000 sentences, Parallel corpus of 28,359 sentences from tourism, agriculture and health and Tagged and Parallel corpus dataset of 37,768 sentences. We further discuss the challenges and opportunities present in Bodo language.

Citation

@inproceedings{narzary-etal-2021-bodo-resource,
    title = "Bodo Resources for NLP - An Overview of Existing Primary Resources for Bodo",
    author = "Mwnthai Narzary and Gwmsrang Muchahary and Maharaj Brahma and Sanjib Narzary and Pranav Kumar Singh and Apurbalal Senapati",
    booktitle = "Proceedings of Intelligent Computing and Technologies Conference",
    month = jul,
    year = "2021",
    publisher = "AIJR Proceedings",
    url = "https://doi.org/10.21467/proceedings.115.12",
    pages = "2582-3922",
    abstract = "With over 1.4 million Bodo speakers, there is a need for Automated Language Processing systems such as Machine translation, Part Of Speech tagging, Speech recognition, Named Entity Recognition, and so on. In order to develop such a system it requires a sufficient amount of dataset. In this paper we present a detailed description of the primary resources available for Bodo language that can be used as datasets to study Natural Language Processing and its applications. We have listed out different resources available for Bodo language: 8,005 Lexicon dataset collected from agriculture and health, Raw corpus dataset of 2,915,544 words, Tagged corpus consisting of 30,000 sentences, Parallel corpus of 28,359 sentences from tourism, agriculture and health and Tagged and Parallel corpus dataset of 37,768 sentences. We further discuss the challenges and opportunities present in Bodo language.",
}

Paper Link AIJR Publishing