Ontology-Enhanced Machine Learning Models for Breast Cancer Diagnosis

Thi Thu Thuy Pham; Chi Thanh Bui

doi:10.4108/eetpht.11.10650

Authors

Thi Thu Thuy Pham Nha Trang University https://orcid.org/0000-0001-8896-3140
Chi Thanh Bui Nha Trang University

DOI:

https://doi.org/10.4108/eetpht.11.10650

Keywords:

Breast Cancer, Machine Learning, Ontology, Semantic Reasoning, Predictive Modeling

Abstract

INTRODUCTION: Breast cancer remains one of the most prevalent causes of cancer-related mortality among women globally. While machine learning (ML) has demonstrated promise in early detection, conventional models often rely solely on statistical features, lacking domain-specific knowledge and interpretability.

OBJECTIVES: This study aims to enhance breast cancer prediction by integrating ontology-driven semantic features with ML models to improve both predictive accuracy and clinical interpretability.

METHODS: We applied a comprehensive pipeline comprising data preprocessing, statistical testing, and dimensionality reduction using PCA, followed by training with supervised learning models including Logistic Regression, k-NN, SVM, Random Forest, XGBoost, LightGBM, and Attention-Enhanced MLP. In the proposed approach, clinical data is transformed into RDF triples and structured within a domain-specific breast cancer ontology. Semantic reasoning via SPARQL queries enables the extraction of high-level features, which are then used in a leakage-safe stacking design that integrates (i) tabular features, (ii) KGE features, (iii) semantic subtyping signals, and (iv) SPARQL rule features, with reproducible templates and released code.

RESULTS: Across four benchmark datasets, the ontology-enhanced meta-learner achieved consistently strong performance, achieving 0.996 ± 0.006 ROC-AUC on WDBC under stratified evaluation.

CONCLUSION: Incorporating ontology-derived semantic knowledge significantly improves the performance, robustness, and interpretability of ML models for breast cancer prediction. This approach holds strong potential for real-world integration into clinical decision support systems.

Downloads

Download data is not yet available.

References

[1] World Health Organization. Breast cancer [Internet]. 2025. [cited 2025 Jul 15]. Available from: https://www.who.int/news-room/fact-sheets/detail/breast-cancer.

[2] Elmore JG, Longton GM, Carney PA, et al. Diagnostic concordance among pathologists interpreting breast biopsy specimens. JAMA. 2015;313(11):1122-1132. doi:10.1001/jama.2015.1405.

[3] Delen D, Walker G, Kadam A. Predicting breast cancer survivability: a comparison of three data mining methods. Artif Intell Med. 2005;34(2):113-127. doi:10.1016/j.artmed.2004.07.002.

[4] Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?”: explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’16); 2016. p. 1135-1144. doi:10.1145/2939672.2939778

[5] Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(Database issue):D267-D270. doi:10.1093/nar/gkh061.

[6] Santiago F, Xavier PD, Guillem BC, Miguel PJ, Pablo SB, Adolfo MC, Raimundo LR. An Ontology-Based Approach for Consolidating Patient Data Standardized With European Norm/International Organization for Standardization 13606 (EN/ISO 13606) Into Joint Observational Medical Outcomes Partnership (OMOP) Repositories: Description of a Methodology. JMIR Med Inform. 2023;11:e44547. doi: 10.2196/44547

[7] Tiddi I, Schlobach S. Knowledge graphs as tools for explainable machine learning: a survey. Artif Intell. 2022;302:103627. doi:10.1016/j.artint.2021.103627

[8] Sabahat S, Fahad M, Muhammad SR, Dilawar S, Shujaat A, Muhammad T, Hafiz MFS. Semantic web-based ontology: a comprehensive framework for cardiovascular knowledge representation. BMC Cardiovasc Disord. 2025;25:519. doi: 10.1186/s12872-025-04956-6.

[9] Ons A, Jacques H, Jean C. SemOntoMap: A Hybrid Approach for Semantic Annotation of Clinical Texts. Stud Health Technol Inform. 2024:316:1839-1843. doi: 10.3233/SHTI240789.

[10] Mohammed AN, Sanaa EF, Kawtar A, El HB, Rachida AA, Olivier D. Machine learning algorithms for breast cancer prediction and diagnosis. Procedia Comput Sci. 2021;191:487-492. doi:10.1016/j.procs.2021.07.062

[11] Gurcan MN, Tomaszewski JE, Overton JA, et al. Developing the Quantitative Histopathology Image Ontology (QHIO): a case study using the hot spot detection problem. J Biomed Inform. 2017;66:129-135. doi:10.1016/j.jbi.2016.12.006.

[12] Ghazvinian A, Noy NF, Musen MA. Creating mappings for ontologies in biomedicine: Simple methods work. AMIA Annu Symp Proc. 2009;2009:198–202.

[13] Ristoski P, Paulheim H. RDF2Vec: RDF graph embeddings for data mining. In: International Semantic Web Conference (ISWC). Springer; 2016. p. 498-514. doi:10.1007/978-3-319-46523-4_30.

[14] Chen J, Yang Z, Yang D, Liang J, Liu X. MixText: linguistically-informed interpolations of hidden space for semi-supervised text classification. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL); 2020. p. 2147-2157. doi:10.18653/v1/2020.acl-main.194.

[15] Zhang Y, Kang Y, He X, Xu C. Ontology attention layer for medical named entity recognition. Appl Sci. 2024;14(1):421. doi:10.3390/app14010421.

[16] UCI Machine Learning Repository. Breast Cancer Wisconsin (Diagnostic) [dataset on the Internet]. 1995 [cited 2025 Jul 15]. Available from: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic. doi:10.24432/C5DW2B

[17] UCI Machine Learning Repository. Breast Cancer Coimbra [dataset on the Internet]. 2018 [cited 2025 Jul 15]. Available from: https://archive.ics.uci.edu/dataset/451/breast+cancer+coimbra. doi:10.24432/C52P59

[18] UCI Machine Learning Repository. Breast Cancer [dataset on the Internet]. 1988 [cited 2025 Jul 15]. Available from: https://archive.ics.uci.edu/dataset/14/breast+cancer. doi:10.24432/C5P88X

[19] Curtis C, Shah SP, Chin SF, et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature. 2012;486(7403):346-352. doi:10.1038/nature10983.

[20] Ali M, Berrendorf M, Hoyt CT, Vermue L, Sharifzadeh S, Tresp V, Lehmann J. PyKEEN 1.0: a Python library for training and evaluating knowledge graph embeddings. J Mach Learn Res. 2021;22(82):1-6.

[21] Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013.

[22] World Wide Web Consortium (W3C). RDF 1.1 concepts and abstract syntax [Internet]. 2014 [cited 2025 Jul 14]. Available from: https://www.w3.org/TR/rdf11-concepts/

[23] Unicode Consortium. Unicode Common Locale Data Repository (CLDR) – Tab-Separated Values (TSV) format [Internet]. 2020 [cited 2026 Feb 15]. Available from: https://cldr.unicode.org/

[24] Bordes A, Usunier N, Garcia-Duran A, Weston J, Yakhnenko O. Translating embeddings for modeling multi-relational data. In: Advances in Neural Information Processing Systems (NeurIPS). 2013. p. 2787-2795.

[25] McInnes L, Healy J, Saul N, Grossberger L. UMAP: uniform manifold approximation and projection. J Open Source Softw. 2018;3(29):861. doi:10.21105/joss.00861.

[26] Cover TM, Hart PE. Nearest neighbor pattern classification. IEEE Trans Inf Theory. 1967;13(1):21-27. doi:10.1109/TIT.1967.1053964.

[27] Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep. 2019;9(1):5233. doi:10.1038/s41598-019-41695-z.

[28] Newman MEJ. Modularity and community structure in networks. Proc Natl Acad Sci U S A. 2006;103(23):8577-8582. doi:10.1073/pnas.0601602103

[29] Campello RJGB, Moulavi D, Sander J. Density-based clustering based on hierarchical density estimates. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). Springer; 2013. p. 160-172. doi:10.1007/978-3-642-37456-2_14.

[30] Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53-65. doi:10.1016/0377-0427(87)90125-7.

Ontology-Enhanced Machine Learning Models for Breast Cancer Diagnosis

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Scopus_CiteScore

Latest publications