Part-of-speech tagger for Bodo language using deep learning approach

Natural Language Processing (formerly Natural Language Engineering)

June, 2024

Dhrubajyoti Pathak, Sanjib Narzary, Sukumar Nandi and Bidisha Som

Abstract

Language processing systems such as part-of-speech (POS) tagging, named entity recognition, machine translation, speech recognition, and language modeling have been well-studied in high-resource languages. Nevertheless, research on these systems for several low-resource languages, including Bodo, Mizo, Nagamese, and others, is either yet to commence or is in its nascent stages. Language model (LM) plays a vital role in the downstream tasks of modern natural language processing. Extensive studies are carried out on LMs for high-resource languages. However, these low-resource languages are still underreprese. In this study, we first present BodoBERT, an LM for the Bodo language. To the best of our knowledge, this work is the first such effort to develop an LM for Bodo. Second, we present an ensemble deep learning-based POS tagging model for Bodo. The POS tagging model is based on combinations of BiLSTM with conditional random field and stacked embedding of BodoBERT with BytePairEmbeddings. We cover several LMs in the experiment to see how well they work in POS tagging tasks. The best-performing model achieves an F1 score of 0.8041. A comparative experiment was also conducted on Assamese POS taggers, considering that the language is spoken in the same region as Bodo.

Citation

@article{Pathak_Narzary_Nandi_Som_2024, 
  title={Part-of-speech tagger for Bodo language using deep learning approach}, 
  DOI={10.1017/nlp.2024.15}, 
  journal={Natural Language Processing}, 
  author={Pathak, Dhrubajyoti and Narzary, Sanjib and Nandi, Sukumar and Som, Bidisha}, 
  year={2024}, 
  pages={1–15}
} 

Paper Link Cambridge Core