Multimodal Data-Driven Intelligent Systems for Breast Cancer Prediction

Cancer, a malignant disease, results from abnormalities in the body cells that lead to uncontrolled growth and division, surpassing healthy growth and stability. In the case of breast cancer, this uncontrolled growth and division occurs in breast cells. Early identification of breast cancer is key to lowering mortality rates. Several new developments in artificial intelligence predictive models show promise for assisting decision-making. The primary goal of the proposed study is to build an efficient Breast Cancer Intelligent System using a multimodal dataset. The aim is to to establish Computer-Aided Diagnosis for breast cancer by integrating various data. This study uses the TCGA "The Cancer Genome Atlas Breast Invasive Carcinoma Collection" (TCGA-BRCA) dataset, which is part of an ongoing effort to create a community integrating cancer phenotypic and genotypic data. The TCGA-BRCA dataset includes: Clinical Data, RNASeq Gene Data, Mutation Data, and Methylation Data. Both clinical and genomic data are used in this study for breast cancer diagnosis. Integrating multiple data modalities enhances the robustness and precision of diagnostic and prognostic models in comparison with conventional techniques. The approach offers several advantages over unimodal models due to its ability to integrate diverse data sources. Additionally, these models can be employed to forecast the likelihood of a patient developing breast cancer in the near future, providing a valuable tool for early intervention and treatment planning.


Introduction
Cancer is one of the leading causes of death worldwide, primarily due to late diagnosis and inadequate treatment options.It is characterized by the abnormal and uncontrolled development of cells in the body which can spread from one region to another [1].Figure 1  Breast cancer (BC) is the most sever and fatal ailment afflicting women.It has recently surpassed other cancer incidences as a significant cause of malignancy, particularly in women.Alarmingly, younger age groups are experiencing a higher prevalence than the worldwide average [3].By 2023 it is anticipated that there will be 55,720 new cases of ductal carcinoma in situ (DCIS), 297,790 new instances of invasive BC, and 43,170 BC fatalities among women in the United States.Nearly 10% of BCs are hereditary or caused by inherited DNA mutations, with most hereditary cases linked to defective BRCA1 and BRCA2 genes [4,5].
Options for BC treatment have increased in both complexity and effectiveness.Improvements in machine learning (ML) and deep learning (DL) have facilitated the development of automated computer-aided diagnosis (CAD) systems which deliver precise results, increasing the efficiency of malignant tumor identification and saving time through optimal utilization.[6,7].
Numerous studies have been conducted with data based on multimodal and unimodal sources to predict BC prognosis using clinical data, imaging biomarkers, and genetic markers.However, traditional BC prediction approaches primarily rely on unimodal data, which fails to capture the full spectrum of BC characteristics.Though conventional unimodal methods have proven effective in predicting BC, they are insufficient for accurate diagnosis.To minimize medical errors, developing a multimodal approach is essential to accurately and precisely predict BC using multiple imaging modalities.This approach facilitates a more precise and reliable diagnosis.Multimodal deep learning provides a comprehensive understanding of data, improving accuracy and efficiency [9].This powerful technique allows for the extraction of meaningful information from large datasets by combining multiple modalities.
The main focus of this study is to formulate a computer aided diagnosis (CAD) for BC by integrating various data modalities.Combining different data sources enables more reliable and accurate models for diagnosing and predicting outcomes than traditional techniques.The development of a highly generic and high-performing BC prediction system using different modalities is projected to give viable solutions for BC prognosis with high accuracy.The research aims to investigate the potential of Artificial Intelligence tools, increasingly achieving significant advancements in various research fields.The results will confirm that the proposed system is a feasible alternative to existing computational systems.
Section 2 provides a literature review summarizing unimodal and multimodal dataset predictions.Section 3 details the dataset and proposed methodology of the research work.Section 4 presents results and discussion and Section 5 concludes the paper with observations and future directions.

Literature Review
This review serves as a foundation for the study of existing solutions for BC prediction.Van't Veer et al. [10] analyzed 117 primary breast carcinomas using DNA microarrays and supervised classification algorithms to identify 70 genetic prognostic signatures.These signatures were used to establish prognostic markers for detecting carcinoma.The study found that inadequate signatures were linked to metastatic, invasion, and angiogenic pathways, resulting in improved predictive performance for disease outcomes.
Yap et al. [11] explored the use of Deep Learning (DL) methods to detect breast lesions in ultrasound images, experimenting with U-Net, LeNet, and a pretrained AlexNet.Their experiments, conducted on two custom datasets with 306 and 163 images, demonstrate that pretrained AlexNet-based models outperformed all other models, achieving F-measures of 0.91 and 0.89, respectively.
An integrated deep learning architecture has been proposed by Antari et al. [12] capable of categorizing, segmenting, and detecting breast tumors.The authors employed a Full Resolution Convolutional Network (FRCN) for tumor segmentation, a Deep Convolutional Neural Network (CNN) for classification, and a YOLObased system for tumor detection.The dataset size was increased 8-fold through application of the YOLO algorithm to expand the dataset size synthetically.The researchers tested the model against the digital database of digital mammograms from the INbreast dataset, which produced a detection accuracy of 98.96%, and a dice score of 92.69%.
In Sun D et al. [13] initiated BC prediction by combining genome data with pathology images.A multiple kernel learning method was used and compared with various independent models that used genome data only.Their findings suggested that combining clinical images with a 10-fold cross-validation contributed to the robustness of the prediction.Gevaert et al. [14] integrated clinical and 70 gene data using three strategies: full integration, decision integration, and partial integration on Bayesian networks.The results showed that methods that use clinical and microarray data have better or comparable results to those that do not use clinical or microarray data.
Sun D et al. [15] enhanced BC prediction prognosis using a multimodal deep neural network (NN).They combined multi-dimensional data, including gene expression, copy number alteration profiles, and clinical data, using novel deep learning techniques.This approach outperformed single-dimensional prediction methods.By combining two independent models of microarray and clinical data, Khademi et al. [16] developed a Probabilistic Graphical Model (PGM) for BC prediction and detection.They began by reducing the dimensionality of microarray data with Principal Component Analysis (PCA), and then built an in-depth belief network to extract data feature representations.The clinical data was then processed through a structural learning algorithm before merging with SoftMax nodes to calculate BC prognosis.
Qian et al. 2021 [17] employed comparable modalities for diagnosing BC, including multimodal, multiview ultrasound imaging.The deep-learning framework was employed to construct the model based on US B-mode multiview ultrasound images and view-level multiview US images as inputs for each potential clinical test lesion.The model then analyzed the suspected lesion from multiple perspectives and provided an overall likelihood of malignancy.The model's performance was tested with each bimodal and multimodal combination to predict the malignancy risk and establish the Breast Imaging Reporting and Data System (BI-RADS) category.
In Binder et al. [18], a novel comprehensive machinelearning approach was proposed for the identification and prediction of morphologic and molecular features from histologic BC imaging datasets.The predictions are derived from a morphologic feature training database, which contains manually annotated types of breast cells in a variety of data modes, as well as histological images from the TCGA database.Liu et al. [19] designed a deep learning prediction model to predict molecular subtypes of BC.The framework is based on gene and image data, employed with Image Filtration.Validation was performed by combining deep NN with convolutional networks to achieve a high level of accuracy.Arya et al. [20] proposed a sophisticated, multimodal-based deep learning model.This model is further enhanced to develop the generation of convoluted feature maps and the extraction of stacked features using Sigmoid CNN algorithm and Random Forest Classifier.
In BC prediction, the focus has shifted to multimodality from unimodality approaches.This study assesses the efficacy of these two approaches and identifies areas for further research and its advancement.Deep Learning models utilizing multimodal datasets are recommended due to their ability to provide richer information than their unimodal counterparts.Additionally, multimodal models are capable of processing multiple data sources, which is an advantage over their unimodal peers.

Dataset
The data was collected from The Cancer Genome Atlas (TCGA), the world's most extensive repository of genomic data.The Center for Cancer Genomics of the National Cancer Institutes (NCI) with National Human Genome Research Institute initiated TCGA in 2006.From the TCGA repository, this research specifically used the TCGA Breast Invasive Carcinoma Collection (TCGA-BRCA) dataset [21], a repository of over 10,000 patient profiles and related genomic data from BC patients.This data set contains information related to the pathological and molecular features of BC tumors, as well as information about other demographic factors such as age, race, and gender.

Data Preprocessing
Data preprocessing is an important element of machine learning projects, allowing the clean up and modification of data to enhance its usability and resolve issues.Before executing any algorithms, it is necessary to verify the representation and quality of the data.This dataset contained no missing values, and values with no counts were treated as 0. The primary objective is to preprocess the data into the format required by the models to ensure a reasonably accurate representation of the data [22].

Clinical Data Preprocessing
The TCGA-BRCA clinical data mainly include data from 1,097 patients, covering variables such as age at diagnosis, vital status, days to death, days to last follow-up, tumor status, pathologic stage, gender, and race.Vital status refers to whether the patient is alive, deceased or unknown.Days to last follow-up is calculated as the number of days between the last follow-up and the initial diagnosis.Table 1 provides an overview of the dataset.Data imputation (DI) is the process of filling in missing values within a dataset.This step can improve data quality and accuracy by replacing missing values with estimates based on existing information [23].DI can help prevent bias and skew when analyzing data and reduce errors due to incomplete datasets.It can also reduce noise in a dataset and make it easier to identify meaningful patterns in the data.The DI techniques for the various features were carefully chosen depending on the nature of the data and its value [24].In this work, DI is applied if the attributes have 'not available' values based on the count and importance of the feature(s).
One-hot encoding, also known as binary encoding, is one of the data encoding techniques during preprocessing.Each category is represented as a separate column with binary values (1 or 0).Additionally, it helps reduce overfitting by providing features that can be used to train the model [25].In this study, one-hot encoding was performed on the three attributes gender, tumor status, and vital status.After one-hot encoding, attribute data contains values of mostly 0s and 1s, which may lead to inefficient pattern during the training phase.Other attributes such as race, margin status, histological type, and pathologic stage are converted into categorical data as per the category found in the dataset.Datasets are typically represented by a distribution of values that helps in understanding the data.The data distribution can reveal the underlying structure, such as its range, outliers etc. Figure 2 illustrates the data distribution frequency of the TCGA BRCA dataset.Analyzing the frequency can provide valuable insights into the prevalence and characteristics among clinical parameters.

Figure 2. Data distribution of TCGA BRCA clinical data
Feature selection involves choosing the most significant features to enhance the performance of the computational models, which helps to avoid overfitting, which is essential when constructing a model.The dataset was processed using a correlation matrix and a heatmap feature selection approach.A correlation matrix can be used to evaluate the degree of similarity between independent and dependent properties [26].The heatmap provides a visual summary of the correlation matrix, making it easier to identify patterns and relationships between features.This heatmap is used to depict an associated feature on the resulting correlation matrix, as shown in Figure 3.This correlation matrix was used to determine which traits were most closely related: gender, age, pathologic stage, and histological type.Table 3 represents the clinical data after the application of preprocessing methods.Feature scaling is a data preprocessing step that helps ensure all the features are on the same scale.This step is essential to guarantee optimal model performance and avoid any potential bias.Standardization and normalization are two popular techniques used for feature scaling.Standardization changes each feature so that its mean is 0 and standard deviation is 1, while normalization transforms each feature into a range between 0 and 1.For this purpose, standardization replaces the values by their z-scores [27] and is given in Eqn 1: which indicates that the features are redistributed with µmean of 0 and σ -standard deviation of 1.

Clinical Data Deep Neural Networks Architecture
Deep Neural Networks (DNNs) are algorithms that are revolutionizing the healthcare industry by providing new techniques to analyse clinical datasets.It is increasingly used in clinical datasets to obtain meaningful insights and patterns from large amounts of data, performing complex, nonlinear computations which can be used to identify trends, correlations, and outliers in the data.DNNs are composed of several neuron layers, allowing the network to learn from data more effectively than traditional machine learning algorithms [28].It can also be used for predictive analytics, allowing healthcare professionals to anticipate potential issues before they occur.
Initially, the proposed Optimized DNN Multimodal analytics (ODNN-MA) architecture is created using the sequential model.In DNN, data is entered into the input layer, them forwarded to the several hidden layers.Finally, the result is passed to the output layer [29].The TCGA-BRCA has 12 input parameters.The dataset was then taken for splitting into training and testing sets with a ratio of 70:30, respectively.Subsequently, 10 hidden layers were used with ReLu as the activation function.This study examined the binary classification problem, and hence, the Sigmoid activation function was applied at the output node.Once the layers are developed, NN architecture is constructed to determine the difference of real and expected outputs.Adam is the optimizer and accuracy is the metric used to evaluate model performance.The training data was fit to the model using batch size 32, and the model went through 30 iterations to train across the entire dataset.The regularization is employed to resolve issues of overfitting or underfitting.This method inhibits learning a more sophisticated or flexible model while reducing the risk of overfit [30].To train the suggested model, two regularizers were used.Figure 4   This reduces the risk of overfitting, as large model weights can result in overly complex models that learn from noise instead of true underlying patterns [31].
The dropout process involves randomly eliminating a subset of neurons from a NN during training to induce the model to generate more accurate representations of the data [32].The dropout pattern may differ depending on the layers used.Each iteration of the dropout process involves the random deletion of nodes and its connections.Thus, each iteration has its own set of nodes with its own set of outputs.The summary of a sequential DNN model can be seen in Figure 5.

RNA Seq Data
RNA sequencing (RNA-seq) is pivotal in cancer research for helping researchers understand tumor classification and progression by tracking changes in gene expression and the transcriptom.Preprocessing RNA sequencing data is critical to gaining meaningful insights from raw sequencing data.This process involves several steps, such as filtering, normalization, and mapping sequencing reads onto the reference genome or transcriptome [33].Preprocessing helps to identify relevant gene expression levels, splicing variants, and alternative transcripts.In addition, it facilitates subsequent analysis, including gene set enrichment and differential expression profiling.
In this study, genomic data primarily focus on RNASeq data from TCGA-BRCA.Genomic data are typically supplemented with clinical outcomes, including general clinical information and cancer status [34].The dataset comprises 60, 660 gene data points across tumor and normal cases.Both normalized fragments per kilobase per million (FPKM) and raw data count were utilized.Raw count data helped select genes that exhibited significant differential expression, while normalized FPKM data were employed in classification and ensemble procedures [35].Table 4 provides an overview of the number of tumor samples for BRCA subtypes.The BRCA data has the following phases: • Obtain the gene data with its subtypes.
• Split up the data as training and testing sets • Train data with ODNN-MA architecture with regularization parameters.• Evaluate the classification result on testing data.
To filter out genes with a mean value below 0.2 and a variance value below 2 across tumor samples, 1,085 genes were selected for BRCA data upon receipt of the tumor data.Subsequently, the tumor samples were divided into five subtypes based on the clinical BRCA data: Basal_like tumor samples, Her_2 tumor samples, Lum_A tumor samples, Lum_B tumor samples, and the Normal_like tumor sample.Table 4 illustrates the specific size of the tumor sample for each subtype.The selected genes were divided into training and testing sets in a 75:25 ratio.Then, ODNN-MA architecture, as depicted in Figure 4, is applied on the entire data.

Results and Discussion
This chapter provides an overview of the experimental results of two alternative dataset modalities for TCGA-BRCA using the deep NN system for the prediction of breast cancer.

Performance Metrics
This work addresses the prediction problem, thus, the performance measures taken are mainly related to classification.For detecting Breast Cancer, the target variable of 1 is considered deceased, and the target variable of 0 is considered a negative instance.This negative instance indicates that the patient is free of the tumor and is still alive.
The confusion matrix evaluates the model's preciseness and completeness and is used for the classification problem, with two or more classes as output.The arrangement of the table or matrix helps to visualize the performance of the algorithm [36].The confusion matrix observations represent the classifiers' performance as precision, recall, F1-score, accuracy, and specificity.When the target variable classes in the data are approximately balanced, accuracy will be a good metric [37].Table 5 provides a quick explanation of the performance metrics.

Metrics
Description Formula

Precision
The proportion of correctly identified cases among positive instances.

Recall
The proportion of total relevant results that were successfully categorized.

F1-score
The weighted average of the precision and recall.

Accuracy
The fraction of real positive or true negative findings.

Specificity The fraction of real negatives forecasted as negatives
The performance of the precision, accuracy, recall, specificity, and F1-score of ODNN-MA on TCGA-BRCA clinical dataset demonstrates the importance of the preprocessing technique according to the nature of its features, as illustrated in Table 6.The confusion matrix for ODNN-MA with the TCGA-BRCA clinical dataset is depicted in Figure 6.It is evident from the above figure that the confusion matrix is representative of the ODNN-MA testing set classification report for the proposed clinical dataset on TCGA-BRCA.The above matrix indicates that out of the 350 testing cases studied, 287 were examined and predicted as alive cases (True Positive (TP)), 36 were observed and predicted to be deceased (True Negative (TN)), no cases were predicted to be False Negative (FN), and 27 were observed as deceased but projected to be alive (False Positive (FP)).
Table 7 illustrates the performance of RNASeq-based BRCA classification.RNASeq-based BRCA subtypes classification is based on the proposed ODNN-MA architecture.From the table, it is evident that the vascallike subtype produces a higher accuracy score than the other types.The Stratified K-fold Cross-Validation method enhances the classification accuracy of the dataset by partitioning it randomly in equal ratios for each fold.This approach assesses the quality of the classifier output on the Area Under the Receiver Operating Characteristic (ROC) curve.The ROC curve visually represents the diagnostic capacity of a binary classifier, plotting the true positive rate (TPR) against the false positive rate (FPR).Values close to one on the ROC curve indicate superior performance of the machine learning model [38].
The ROC curve illustrates the correlation between the TPR and the FPR as a function of the changing discriminating threshold.AUC (Area Under the ROC Curve) provides an aggregated performance measure across all potential classification thresholds.AUC values range from 0 to 1, with higher values indicating better classification performance [39].The above curve summarizes the results of the proposed ODNN-MA based on clinical data for TCGA-BRCA, which demonstrate a statistically significant AUC value of 0,93 ± 0,02.This indicates that the model can accurately distinguish between live and deceased cases with a probability of 93%.

Conclusion
Breast cancer is one of the most prevalent cancers among women, accounting for 69% of cancer-related deaths in this demographic.Early detection of breast cancer is crucial as it remains a significant health challenge today.Detecting it early can substantially improve survival rates by enabling timely treatment.Though traditional deep learning models used within detection excel with specific data types, multimodal deep learning models are even more effective due to their ability to integrate richer, more comprehensive data from multiple sources compared to traditional unimodal models.In this study, the performance model achieved 92% accuracy on clinical data, an AUC ROC score of 0.93 ± 0.0, and 96% accuracy on RNASeq data.This capability enables the model to leverage diverse data sources, offering significant advantages over unimodal approaches.The evidence of this study suggests that integration of these models would be certain to improve the estimation of the likelihood of patients' risk of developing breast cancer in future.

Figure 1 .
Figure 1.Number of estimated cancer cases in India

Figure 3 .
Figure 3. Correlation matrix heatmap for TCGA-BRCA clinical data depicts the ODNN-MA architecture of the proposed clinical work.

Figure 4 .
Figure 4. ODNN-MA architecture for TCGA-BRCA clinical data L2 Parameter Regularization technique improves the accuracy of models by reducing overfitting.It is an optimization strategy that alters the loss function during training by adding a penalty term.Then, model weights are penalized, to ensure they are modest and close to their original values.This reduces the risk of overfitting, as large model weights can result in overly complex models that learn from noise instead of true underlying patterns[31].The dropout process involves randomly eliminating a subset of neurons from a NN during training to induce the model to generate more accurate representations of the data[32].The dropout pattern may differ depending on the layers used.Each iteration of the dropout process involves

Figure 5 .
Figure 5. Summary of the Sequential DNN model Constraints of the Confusion matrix are: • True Positives: When the actual and projected BC instances are true.• True Negatives: When predicted instances are false, and actual instances are precisely false.• False Positives: When the actual BC instances are false, but the prediction is true.• False Negatives: When the prediction is false, but actual cases are true.

Figures 7 and 8
illustrate the ROC and AUC of the proposed ODNN-MA applied to the clinical dataset of TCGA-BRCA, illustrating the comparison of TPR against FPR.

Table . 1
Overall Analysis of the cardiovascular disease dataset

Table 2 .
Table 2 consists of the clinical characteristics of the TCGA-BRCA clinical Dataset.Clinical characteristics of TCGA-BRCA clinical Dataset

Table 4 .
Number of tumor samples for BRCA subtypes

Table 6 .
Performance of ODNN-MA on TCGA-BRCA clinical data Figure 6.Confusion Matrix for ODNN-MA on TCGA-BRCA clinical dataset