ViMedNER: A Medical Named Entity Recognition Dataset for Vietnamese

Pham Van Duong; Tien-Dat Trinh; Minh-Tien Nguyen; Huy-The Vu; Minh Chuan Pham; Tran Manh Tuan; Le Hoang Son

doi:10.4108/eetinis.v11i3.5221

Authors

Pham Van Duong Hanoi University of Science and Technology
Tien-Dat Trinh FPT University
Minh-Tien Nguyen Hung Yen University of Technology and Education
Huy-The Vu Hung Yen University of Technology and Education
Minh Chuan Pham Hung Yen University of Technology and Education
Tran Manh Tuan Thuyloi University
Le Hoang Son Vietnam National University, Hanoi

DOI:

https://doi.org/10.4108/eetinis.v11i3.5221

Keywords:

Named entity recognition, Vietnamese corpus, Medical text, Pre-trained language model

Abstract

Named entity recognition (NER) is one of the most important tasks in natural language processing, which identifies entity boundaries and classifies them into pre-defined categories. In literature, NER systems have been developed for various languages but limited works have been conducted for Vietnamese. This mainly comes from the limitation of available and high-quality annotated data, especially for specific domains such as medicine and healthcare. In this paper, we introduce a new medical NER dataset, named ViMedNER, for recognizing Vietnamese medical entities. Unlike existing works designed for common or too-specific entities, we focus on entity types that can be used in common diagnostic and treatment scenarios, including disease names, the symptoms of the diseases, the cause of the diseases, the diagnostic, and the treatment. These entities facilitate the diagnosis and treatment of doctors for common diseases. Our dataset is collected from four well-known Vietnamese websites that are professional in terms of drag selling and disease diagnostics and annotated by domain experts with high agreement scores. To create benchmark results, strong NER baselines based on pre-trained language models including PhoBERT, XLM-R, ViDeBERTa, ViPubMedDeBERTa, and ViHealthBERT are implemented and evaluated on the dataset. Experiment results show that the performance of XLM-R is consistently better than that of the other pre-trained language models. Furthermore, additional experiments are conducted to explore the behavior of the baselines and the characteristics of our dataset.

Downloads

Download data is not yet available.

Author Biography

Pham Van Duong, Hanoi University of Science and Technology

FPT University, Hanoi

References

Angeli, G., Premkumar, M.J. and Manning, C.D. (2015) Leveraging linguistic structure for open domain information extraction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 344-354. DOI: https://doi.org/10.3115/v1/P15-1034

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K. and Dyer, C. (2016) Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260-270. DOI: https://doi.org/10.18653/v1/N16-1030

Li, X., Feng, J., Meng, Y., Han, Q., Wu, F. and Li, J. (2020) A unified mrc framework for named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 5849- 5859. DOI: https://doi.org/10.18653/v1/2020.acl-main.519

Puccetti, G., Chiarello, F. and Fantoni, G. (2021) A simple and fast method for named entity context extraction from patents. Expert Systems with Applications 184 (2021): 115570 . DOI: https://doi.org/10.1016/j.eswa.2021.115570

Sang, E., Kim, T. and Meulder, F.D. (2003) Introduction to the conll-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003.

Li, J., Sun, Y., Johnson, R.J., Sciaky, D., Wei, C.H., Leaman, R., Davis, A.P. et al. (2016) Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database 2016. DOI: https://doi.org/10.1093/database/baw068

Zhang, Z., Han, X., Liu, Z., Jiang, X., Sun, M. and Liu, Q. (2019) Ernie: Enhanced language representation with informative entities. In In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 1441-1451. DOI: https://doi.org/10.18653/v1/P19-1139

Cheng, P. and Erk, K. (2020) Attending to entities for better text understanding. In In Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 05, pp. 7554-7561. DOI: https://doi.org/10.1609/aaai.v34i05.6254

Guo, J., Xu, G., Cheng, X. and Li, H. (2009) Named entity recognition in query. In In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, pp. 267-274. DOI: https://doi.org/10.1145/1571941.1571989

Aone, C. (1999) A trainable summarizer with knowledge acquired from robust nlp techniques. Advances in automatic text summarization: 71-80 .

Mollá, D., Zaanen, M.V. and Smith, D. (2006) Named entity recognition for question answering. In In Proceedings of the Australasian language technology workshop 2006, pp. 51-58.

Babych, B. and Hartley, A. (2003) Improving machine translation quality with automatic named entity recogni- tion. In In Proceedings of the 7th International EAMT work- shop on MT and other language technology tools, Improving MT through other language technology tools, Resource and tools for building MT at EACL 2003. DOI: https://doi.org/10.3115/1609822.1609823

Xu, J., Kim, S., Song, M., Jeong, M., Kim, D., Kang, J., Rousseau, J.F. et al. (2020) Building a pubmed knowledge graph. Scientific data 7, no. 1: 205 . DOI: https://doi.org/10.1038/s41597-020-0543-2

Collier, N., Ohta, T., Tsuruoka, Y., Tateisi, Y. and Kim, J.D. (2004) Introduction to the bio-entity recognition task at JNLPBA. In Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP) (Geneva, Switzer- land: COLING): 73–78. URL https://aclanthology. org/W04-1213.

Doğan, R.I., Leaman, R. and Lu, Z. (2014) Ncbi disease corpus: a resource for disease name recognition and concept normalization. Journal of biomedical informatics 47: 1–10. DOI: https://doi.org/10.1016/j.jbi.2013.12.006

Krallinger, M., Rabal, O., Leitner, F., Vázquez, M., Salgado, D., Lu, Z., Leaman, R. et al. (2015) The chemdner corpus of chemicals and drugs and its annotation principles. Journal of Cheminformatics 7: S2 – S2. DOI: https://doi.org/10.1186/1758-2946-7-S1-S1

Nye, B., Li, J.J., Patel, R., Yang, Y., Marshall, I.J., Nenkova, A. and Wallace, B.C. (2018) A corpus with A. Smith, J.R. Wakeling multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. In Proceedings of the conference. Association for Computational Linguistics. Meeting (NIH Public Access), 2018: 197. DOI: https://doi.org/10.18653/v1/P18-1019

Kocaman, V. and Talby, D. (2022) Accurate clinical and biomedical named entity recognition at scale. Software Impacts 13: 100373 . DOI: https://doi.org/10.1016/j.simpa.2022.100373

Uzuner, Ö., South, B.R., Shen, S. and DuVall, S.L. (2011) 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association 18(5): 552–556.

Tzitzivacos, D. (2007) International classification of diseases 10th edition (icd-10). CME: Your SA Journal of CPD 25(1): 8–10.

Uzuner, Ö., Luo, Y. and Szolovits, P. (2007) Evaluating the state-of-the-art in automatic de-identification. Jour- nal of the American Medical Informatics Association 14(5): 550–563.

Devlin, J., Chang, M.W., Lee, K. and Toutanova, K. (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171-4186.

Wang, X., Jiang, Y., Bach, N., Wang, T., Huang, Z., Huang, F. and Tu, K. (2021) Improving named entity recognition by external context retrieving and cooperative learning. In In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1800- 1812. DOI: https://doi.org/10.18653/v1/2021.acl-long.142

Joshi, M., Chen, D., Liu, Y., Weld, D.S., Zettlemoyer, L. and Levy, O. (2020) Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8: 64–77 . DOI: https://doi.org/10.1162/tacl_a_00300

Li, F., Lin, Z., Zhang, M. and Ji, D. (2021) A span-based model for joint overlapped and discontinuous named entity recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pp. 4814–4828. DOI: https://doi.org/10.18653/v1/2021.acl-long.372

Fu, J., Huang, X.J. and Liu, P. (2021) Spanner: Named entity re-/recognition as span prediction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers): 7183–7195. DOI: https://doi.org/10.18653/v1/2021.acl-long.558

Son, N.H., Hieu, M.Y., Nguyen, T.A.D. and Nguyen, M.T. (2022) Jointly learning span extraction and sequence labeling for information extraction from business documents. In 2022 International Joint Conference on Neural Networks (IJCNN) (IEEE): 1–8. DOI: https://doi.org/10.1109/IJCNN55064.2022.9892779

Wan, J., Ru, D., Zhang, W. and Yu, Y. (2022) Nested named entity recognition with span-level graphs. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers): 892–903. DOI: https://doi.org/10.18653/v1/2022.acl-long.63

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A. et al. (2020) Language models are few-shot learners. Advances in neural information processing systems 33: 1877–1901.

Du, Z., Qian, Y., Liu, X., Ding, M., Qiu, J., Yang, Z. and Tang, J. (2021) All nlp tasks are generation tasks: A general pretraining framework. arXiv preprint arXiv:2103.10360 .

Paolini, G., Athiwaratkun, B., Krone, J., Ma, J., Achille, A., Anubhai, R., Santos, C.N.d. et al. (2021) Structured prediction as translation between augmented natural languages. arXiv preprint arXiv:2101.05779 .

He, Y. and Tang, B. (2022) Setgner: General named entity recognition as entity set generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: 3074–3085. DOI: https://doi.org/10.18653/v1/2022.emnlp-main.200

Uzuner, , Luo, Y. and Szolovits, P. (2007) Evaluating the State-of-the-Art in Automatic De- identification. Journal of the American Medical Informatics Association 14(5): 550–563. doi:10.1197/jamia.M2444, DOI: https://doi.org/10.1197/jamia.M2444

Uzuner, Ö., South, B.R., Shen, S. and Duvall, S.L. (2011) 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association : JAMIA 18 5: 552–6. DOI: https://doi.org/10.1136/amiajnl-2011-000203

Segura-Bedmar, I., Martínez, P. and Herrero-Zazo, M. (2013) SemEval-2013 task 9 : Extraction of drug- drug interactions from biomedical texts (DDIExtraction 2013). In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013) (Atlanta, Georgia, USA: Association for Computational Linguistics): 341–350. URL https://aclanthology.org/S13-2056.

Huyen, N.T.M. and Luong, V.X. (2016) Vlsp 2016 shared task: Named entity recognition. Proceedings of Vietnamese Speech and Language Processing (VLSP) .

Nguyen, H.T., Ngo, Q.T., Vu, L.X., Tran, V.M. and Nguyen, H.T. (2018) Vlsp shared task: Named entity recognition. Journal of Computer Science and Cybernetics 34(4): 283–294. DOI: https://doi.org/10.15625/1813-9663/34/4/13161

Truong, T.H., Dao, M.H. and Nguyen, D.Q. (2021) Covid-19 named entity recognition for vietnamese. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: 2146–2153. DOI: https://doi.org/10.18653/v1/2021.naacl-main.173

Huy, T.D., Tu, N.A., Vu, T.H., Minh, N.P., Phan, N., Bui, T.H. and Truong, S.Q. (2021) Vimq: A vietnamese medical question dataset for healthcare dialogue system development. In Neural Information Processing: 28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part VI 28 (Springer): 657–664. DOI: https://doi.org/10.1007/978-3-030-92310-5_76

Grishman, R. and Sundheim, B. (1996) Message Under- standing Conference- 6: A brief history. In COLING 1996 Volume 1: The 16th International Conference on Compu- tational Linguistics. URL https://aclanthology.org/ C96-1079. DOI: https://doi.org/10.3115/992628.992709

Tjong Kim Sang, E.F. (2002) Introduction to the CoNLL- 2002 shared task: Language-independent named entity recognition. In COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002). URL https://aclanthology.org/W02-2024. DOI: https://doi.org/10.3115/1118853.1118877

Tjong Kim Sang, E.F. and De Meulder, F. (2003) In- troduction to the CoNLL-2003 shared task: Language- independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learn- ing at HLT-NAACL 2003: 142–147. URL https:// aclanthology.org/W03-0419. DOI: https://doi.org/10.3115/1119176.1119195

Singh, A.K. (2008) Named entity recognition for south and south East Asian languages: Taking stock. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages. URL https://aclanthology.org/I08-5003.

Shaalan, K. (2014) A survey of arabic named entity recognition and classification. Comput. Linguist. 40(2): 469–510. doi:10.1162/COLI_a_00178 DOI: https://doi.org/10.1162/COLI_a_00178

Piskorski, J., Pivovarova, L., Šnajder, J., Steinberger, J. and Yangarber, R. (2017) The first cross-lingual chal- lenge on recognition, normalization, and matching of named entities in Slavic languages. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing (Valencia, Spain: Association for Computational Linguis- tics): 76–85. doi:10.18653/v1/W17-1412 DOI: https://doi.org/10.18653/v1/W17-1412

Li, J., Sun, A., Han, J. and Li, C. (2022) A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering 34(1): 50–70. doi:10.1109/TKDE.2020.2981314. DOI: https://doi.org/10.1109/TKDE.2020.2981314

Baldwin, T., de Marneffe, M.C., Han, B., Kim, Y.B., Ritter, A. and Xu, W. (2015) Shared tasks of the 2015 workshop on noisy user-generated text: Twitter lexical normalization and named entity recognition. In Proceedings of the Workshop on Noisy User-generated Text (Beijing, China: Association for Computational Linguistics): 126–135. doi:10.18653/v1/W15-4319, DOI: https://doi.org/10.18653/v1/W15-4319

Wang, Y., Tong, H., Zhu, Z. and Li, Y. (2022) Nested named entity recognition: A survey. ACM Trans. Knowl. Discov. Data 16(6). doi:10.1145/3522593 DOI: https://doi.org/10.1145/3522593

Linh, H., Dao, D., Huyen, N., Quyen, N. and Dung, D. (2022) Vlsp 2021 - ner challenge: Named entity recognition for vietnamese. VNU Journal of Science: Computer Science and Communication Engineering 38(1). doi:10.25073/2588-1086/vnucsce.362, DOI: https://doi.org/10.25073/2588-1086/vnucsce.362

Nguyen, D.Q. and Nguyen, A.G.T. (2020) Phobert: Pre-trained language models for vietnamese. ArXiv abs/2003.00744. DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.92

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E. et al. (2019) Unsu- pervised cross-lingual representation learning at scale. In Annual Meeting of the Association for Computational Linguistics. DOI: https://doi.org/10.18653/v1/2020.acl-main.747

Tran, C.D., Pham, N.H., Nguyên, A.V., Hy, T.S. and Vu, T. (2023) Videberta: A powerful pre-trained language model for vietnamese. In Findings. DOI: https://doi.org/10.18653/v1/2023.findings-eacl.79

He, P., Liu, X., Gao, J. and Chen, W. (2020) Deberta: Decoding-enhanced bert with disentangled attention. ArXiv abs/2006.03654.

Phan, L., Dang, T., Tran, H.T., Trinh, T.H., Phan, V., Chau, L.D. and Luong, M.T. (2022) Enriching biomedi- cal knowledge for low-resource language through large- scale translation. In Conference of the European Chapter of the Association for Computational Linguistics. DOI: https://doi.org/10.1101/2022.10.11.511776

Minh, N., Tran, V.H., Hoang, V., Ta, H.D., Bui, T.H. and Truong, S.Q.H. (2022) ViHealthBERT: Pre-trained language models for Vietnamese in health text mining. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (Marseille, France: European Language Resources Association): 328–337. URL https://aclanthology.org/2022.lrec-1.35.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P. et al. (2020) Transformers: State- of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (Online: Association for Computational Linguistics): 38–45. doi:10.18653/v1/2020.emnlp-demos.6, DOI: https://doi.org/10.18653/v1/2020.emnlp-demos.6

Chen, Y., Liu, P., Zhong, M., Dou, Z.Y., Wang, D., Qiu, X. and Huang, X. (2020) CDEvalSumm: An empirical study of cross-dataset evaluation for neural summarization systems. In Findings of the Association for Computational Linguistics: EMNLP 2020 (Online: Association for Computational Linguistics): 3679–3691. doi:10.18653/v1/2020.findings-emnlp.329, DOI: https://doi.org/10.18653/v1/2020.findings-emnlp.329