Unsupervised Machine Learning based Documents Clustering in Urdu

Atta Ur Rahman; Khairullah Khan; Wahab Khan; Aurangzeb Khan; Bibi Saqia

doi:10.4108/eai.19-12-2018.156081

Unsupervised Machine Learning based Documents Clustering in Urdu

Authors

Atta Ur Rahman University of Science and Technology Bannu
Khairullah Khan University of Science and Technology Bannu
Wahab Khan Isles International University
Aurangzeb Khan University of Science and Technology Bannu
Bibi Saqia University of Science and Technology Bannu

DOI:

https://doi.org/10.4108/eai.19-12-2018.156081

Keywords:

Urdu, Documents clustering, Similarity Measures, K-Means Algorithm

Abstract

The volume of data on the web is growing rapidly, due to the proliferation of news sources, contents, blogs and journals etc. Like other languages, the Urdu language has also observed tremendous growth on the internet. As the volume of data is expanding, information retrieval (IR) is becoming complicated. Document clustering is an unsupervised ML approach, employed to group a huge number of dispersed documents into a small number of significant and consistent clusters, thus providing a base for indexing, IR and browsing mechanisms. Documents clustering has a long tradition in English as well as English like western languages, but Urdu lags behind in terms sophisticated natural language processing (NLP) tools and resources for documents clustering. Documents clustering becomes a challenging task in Urdu language having a rich morphology, particular structure, syntax peculiarities and cursive nature. In this study, we have developed a framework of document clustering and analysed various similarity measures for Urdu documents. We have also checked the effect of stop words removal in the process of Urdu document clustering.

Downloads

Published

19-12-2018

Issue

Vol. 5 No. 19 (2018): EAI Endorsed Transactions on Scalable Information Systems

Section

Research articles

License

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.

How to Cite

Unsupervised Machine Learning based Documents Clustering in Urdu. EAI Endorsed Scal Inf Syst [Internet]. 2018 Dec. 19 [cited 2025 Nov. 1];5(19):e5. Available from: https://publications.eai.eu/index.php/sis/article/view/2188

Download Citation

Unsupervised Machine Learning based Documents Clustering in Urdu

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

How to Cite

Most read articles by the same author(s)

Make a Submission

IF