BERTopic-Based Topic Modeling and Thematic Discovery in Long-Form Narrative Text

I.B.N HimaBindu; Sarojamma B.; Haragopal V.V.

doi:10.4108/eetismla.12836

Authors

I.B.N HimaBindu CVR College Of Engineering
Sarojamma B. Sri Venkateswara University
Haragopal V.V. Osmania University

DOI:

https://doi.org/10.4108/eetismla.12836

Keywords:

Textual data analytics, BERTopic, Topic modeling, topic probabilities, dimensionality reduction, HDBSCAN, UMAP

Abstract

With the increasing amount of digital text data available today, the demand for Natural Language Processing techniques is growing significantly. Topic modeling is a NLP technique for automatically identifying topics existing in a large corpus of text and deriving hidden patterns represented by that document collection, hence facilitating improved decision-making. The purpose of the present work is to explore the major topics of the renowned book “Autobiography of a Yogi”, written by Paramahansa Yogananda, an eloquent orator and a profound spiritual master. To accomplish the study, the most popular neural topic model ‘BERTopic’ was employed on the book. As a result, a number of intriguing topics are extracted, that are especially useful for those researchers and scholars delving into the complexities of the book as well as those interested in spirituality, Indian philosophy, the life journey and teachings of Paramahansa Yogananda.

Downloads

Download data is not yet available.

References

[1] I. B. N. HimaBindu, S. V. Reddy, V. V. Haragopal, and B. Sarojamma, “Textual Analytics on ‘Azadi Ka Amrit Mahotsav’: Exploring Indian citizens’ ideas for achieving Aatmanirbhar Bharat,” in 2023 Third International Conference on Advances in Electrical, Computing, Communication and Sustainable Technologies (ICAECT), Jan. 2023, pp. 1–8, doi: 10.1109/ICAECT57570.2023.10118308.

[2] M. Grootendorst, “BERTopic: Neural topic modeling with a class-based TF-IDF procedure,” Mar. 2022, doi: 10.48550/arXiv.2203.05794.

[3] T. Hofmann, “Unsupervised Learning by Probabilistic Latent Semantic Analysis,” pp. 177–196, 2001, [Online]. Available: https://doi.org/10.1023/A:1007617005950.

[4] D. M. Blei, A. Y. Ng, and M. T. Jordan, “Latent dirichlet allocation,” J. Mach. Learn. Res., vol. 3, pp. 993–1022, 2003, doi: 10.1162/jmlr.2003.3.4-5.993.

[5] C. Févotte and J. Idier, “Algorithms for nonnegative matrix factorization with the β-divergence,” Neural Computation, vol. 23, no. 9. pp. 2421–2456, 2011, doi: 10.1162/NECO_a_00168.

[6] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North, 2019, pp. 4171–4186, doi: 10.18653/v1/N19-1423.

[7] J. Lee et al., “Data and text mining BioBERT : a pre-trained biomedical language representation model for biomedical text mining,” vol. 36, no. September 2019, pp. 1234–1240, 2020, doi: 10.1093/bioinformatics/btz682.

[8] Y. Liu et al., “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” no. 1, Jul. 2019, [Online]. Available: http://arxiv.org/abs/1907.11692.

[9] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “ALBERT: A LITE BERT FOR SELF-SUPERVISED LEARNING OF LANGUAGE REPRESENTATIONS,” in 8th International Conference on Learning Representations, ICLR 2020, 2020, pp. 1–17.

[10] S. Sia, A. Dalmia, and S. J. Mielke, “Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too!,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 1728–1736, doi: 10.18653/v1/2020.emnlp-main.135.

[11] S. Terragni, E. Fersini, B. Galuzzi, P. Tropeano, and A. Candelieri, “OCTIS: Comparing and optimizing topic models is simple!,” EACL 2021 - 16th Conf. Eur. Chapter Assoc. Comput. Linguist. Proc. Syst. Demonstr., pp. 263–270, 2021, doi: 10.18653/v1/2021.eacl-demos.31.

[12] Z. Cao, S. Li, Y. Liu, W. Li, and H. Ji, “A novel neural topic model and its supervised extension,” Proc. Natl. Conf. Artif. Intell., vol. 3, pp. 2210–2216, 2015, doi: 10.1609/aaai.v29i1.9499.

[13] H. Zhao, D. Phung, V. Huynh, Y. Jin, L. Du, and W. Buntine, “Topic Modelling Meets Deep Neural Networks: A Survey,” in Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, Aug. 2021, pp. 4713–4720, doi: 10.24963/ijcai.2021/638.

[14] H. L. and S. Lauly, “A neural autoregressive topic model.,” Adv. Neural Inf. Process. Syst., 2012.

[15] F. Bianchi, S. Terragni, D. Hovy, D. Nozza, and E. Fersini, “Cross-lingual contextualized topic models with zero-shot learning,” in EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 2021, pp. 1676–1683, doi: 10.18653/v1/2021.eacl-main.143.

[16] A. B. Dieng, F. J. R. Ruiz, and D. M. Blei, “Topic modeling in embedding spaces,” Trans. Assoc. Comput. Linguist., vol. 8, pp. 439–453, 2020, doi: 10.1162/tacl_a_00325.

[17] L. Thompson and D. Mimno, “Topic Modeling with Contextualized Word Representation Clusters,” arXiv Prepr. arXiv2010.12626, Oct. 2020, [Online]. Available: http://arxiv.org/abs/2010.12626.

[18] D. Angelov, “Top2Vec: Distributed Representations of Topics,” pp. 1–25, Aug. 2020, [Online]. Available: http://arxiv.org/abs/2008.09470.

[19] N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embeddings using siamese BERT-networks,” EMNLP-IJCNLP 2019 - 2019 Conf. Empir. Methods Nat. Lang. Process. 9th Int. Jt. Conf. Nat. Lang. Process. Proc. Conf., pp. 3982–3992, 2019, doi: 10.18653/v1/d19-1410.

[20] C. C. Aggarwal, A. Hinneburg, and D. A. Keim, “On the surprising behavior of distance metrics in high dimensional space,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 1973, no. February 2002, pp. 420–434, 2001, doi: 10.1007/3-540-44503-x_27.

[21] K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is ‘nearest neighbor’ meaningful?,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 1540, pp. 217–235, 1998, doi: 10.1007/3-540-49257-7_15.

[22] L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,” Feb. 2018, [Online]. Available: http://arxiv.org/abs/1802.03426.

[23] L. McInnes, J. Healy, and S. Astels, “hdbscan: Hierarchical density based clustering,” J. Open Source Softw., vol. 2, no. 11, p. 205, 2017, doi: 10.21105/joss.00205.

[24] T. Joachims, “A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorixation,” Proc. ICML97, 1997.

BERTopic-Based Topic Modeling and Thematic Discovery in Long-Form Narrative Text

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Latest publications