Deep spectral network for time series clustering

Deep clustering is an approach that uses deep learning to cluster data, since it involves training a neural network model to become familiar with a data representation that is suitable for clustering. Deep clustering has been applied to a wide range of data types, including images, texts, time series and has the advantage of being able to automatically learn features from the data, which can be more e ff ective than using hand-crafted features. It is also able to handle high-dimensional data, such as time series with many variables, which can be challenging for traditional clustering techniques. In this paper, we introduce a novel deep neural network type to improve the performance of the auto-encoder part by ignoring the unnecessary extra-noises and labelling the input data. Our approach is helpful when just a limited amount of labelled data is available, but labelling a big amount of data would be costly or time-consuming. It also applies for the data in high-dimensional and di ffi cult to define a good set of features for clustering.


Introduction
Clustering for time series is always an interesting and challenging problem for several main reasons [2].While the most popular time series data types are univariate, multivariate, or tensor fields, the time sequence data's high dimensionality and irregular lengths present numerous difficulties for the conventional clustering techniques in actual use [1].For conventional approaches, the partition-based methods or density-based methods are often used in many fields for their simpleness, but they are also appeared the weaknesses [15].For example, the hierarchical agglomerate clustering (HAC) method often raise drawbacks such as the ambiguity in determining the final cluster number and the difficulty in selecting a suitable distance metric when merging two clusters.On the other hand, using the clustering for the time series with k-means methods also raise some disadvantages, such as the imprecision of silhouette analysis score or the difficulty to determine the values of k.Furthermore, the "mean" of time series with different length could not be well-defined, and the assumptions on k-means often require the following conditions: • Considering the dataset's balanced cluster size.
• Taking into account the fact that features within a cluster are independent and have equal variance.
• Hypothesizing on the similar density of available clusters.
Instead of using the central grouping methods, the graph-based clustering method such as the spectral clustering is introduced here.A first approach for clustering the time series, based on spectral decomposition of the affinity matrix and then a lowdimensional space k-means clustering, could be found in the works of [3,35].The spectral clustering is not only be able to dual with the high dimensionality of arbitrary length time series, but also can determine automatically the optimal cluster number and self-tune the variance of the Gaussian kernel.Moreover, Wang et al.'s work [35] provides theoretical support for the feasibility of our approach.For a large-scale and high-dimensional data, we refer the spectral clustering based on iterative optimization method of Zhao et al [34].A set of clustering techniques known as "deep clustering" uses deep neural networks to develop representations that are conducive to clustering.The last decade observes the use of deep learning for the clustering problem as a potentially new research trends in this domain because of its ability to capture the non-linearity in the data representations.The most well-known neural network architectures are considered as the fully-connected neural network (FCN), the convolutional neural network (CNN), the deep belief network (DBN), the autoencoder (AE), and the generative adversarial net work (GAN) [27].Without using neural network, the classical representations methods such as principal component analysis (PCA), multidimensional scaling (MDS), discrete Fourier transform (DFT), or discrete wavelet transform (DWT) often reduce the form of dimension to get better loss function and achieve the higher accuracy of clustering.In [31], the authors investigate on a deep learning approach for approximate spectral clustering called "SpectralNet".This is a very first beginning neural network model to linearize the input data, then applying directly the spectral clustering for the encoding part.However, this model does not introduce better methods to improve the data representations and enhance the overall loss functions.Other several deep clustering models could be found in [1,27].The deep clustering convolution neural network for image data such as the DAE or DCAE model could be found in the work of Huang et al [20] and Guo et al [17].Recently, there are some interesting methods to enhance the representation methods in the encoding parts such as the self-organizing map (SOM), the growing neural gas (GNG), and the generative adversarial networks (GANs).In the case of missing data, a CRLI model is proposed by Ma et el with jointly optimization methods on two parallel branches [24].There are some deep clustering model proposed in the specific context of time series, such as the deep temporal clustering (DTC) [26] and the deep temporal clustering representation (DTCR) [23].
Motivated by the existing approaches, we propose an improved spectral clustering based on bidirectional dilated recurrent neural networks, namely DSTR, for large-scale and high-dimensional time series data.Even though an outcome variable exists or some basic information regarding the clusters is available in many circumstances, one may still wish to run a cluster analysis.Using another approach from [23], we employ a novel model that extends conventional cross entropy minimization to an ideal transport problem by maximizing the information between labels and input data indices.In reality, clustering after using latent representation via the neural network is an unsupervised task that may lead to enhance the model's complexity and may give the trivial solutions [32].
To increase the information between data indices and labels, we add the restriction that the labels must induce an equi-partition of the data.This appears to transform the task into a semi-unsupervised problem.

State of the art
Problem statement : Considering the time series Each time series has the same lengths d, i.e: and that leads to a data matrix A = [x 1 , x 2 , ..., x M ].The goal of our problem is to cluster the set X into a finite partition C = {C 1 , ..., C k } of k clusters in a way that maximizes the similarity between objects within a cluster while minimizing the dissimilarity between objects in different clusters.Then C i is called a cluster where A = ∪ M i=1 C i and C i ∩ C j = ∅ for all i j.Lets us denote X be the pairwise constraint matrix and F be the function that map the set X into k clusters, then the map result is resulted by C. The most representative approaches for the classical problem of time series clustering could be classified into the following main categories: hierarchical based method that agglomerate each item and merge the clusters from the bottom-up, partitioning based method that decompose a data set into a set of disjoint clusters from unlabeled objects such that each cluster contain at least one element, model-based method that recover the original model from a set of data and could fit the data well, density-based method, grid-based method, and multistep method.• The partially supervised deep clustering problem: R(X, X ) → C.
• The multi view deep clustering problem: R(X 1 , ..., X n ) → C where X i be the i th view of X.
• The domain adaptation deep clustering problem: R(X i , X i , X j ) → C where (X i , X i ) be the identified source and X j be the target domain without labels.
When performing traditional clustering tasks, it is frequently believed that all of the data are one view or simple modal, which have the same shape and structure.There are five different types of one view deep clustering algorithms: the DAE, the DNN, the VAE, the GAN, and the GNN types [42].When the data to be processed only include a tiny amount of prior restrictions, standard clustering techniques cannot effectively use the prior information [42].The multi-view deep clustering uses the consistent and supplementary information present in multi view data, incorporating these frameworks with DEC type, subspace clustering-based, and GNN type, to enhance clustering performance [42].The approaches for deep clustering based on transfer learning such as DNN type or GAN type aim to improve the effectiveness of existing clustering tasks using data from pertinent tasks [42].
Model-based approaches: The most well-known architectures for the deep clustering are Fully Connected Neural Network (FCNN), Convolution Neural Network (CNN), and Recurrent Neural Network (RNN) [1]: • FCNNs are a type of feedforward neural network that consists of layers of interconnected "neurons," where every neuron in one layer is coupled to every neuron in the following layer.FCNNs are generally used for tasks that involve input data with a fixed-size feature vector, such as image classification or regression.
• CNNs, a particular type of neural network, are particularly well suited for the tasks requiring data with a spatial structure, such those involving images.CNNs are composed of layers of convolutional filters, which are used to learn local patterns in the data, as well as pooling layers, which are used to reduce the spatial resolution of the data and increase the network's ability to generalize.
• RNNs are a particular kind of neural network that function well for tasks that need sequential data, including time series forecasting or natural language processing.RNNs are composed of "recurrent" units, which allow the network to process data with an arbitrary length by maintaining a hidden state that is updated at each time step.
The FCNN architectures could be considered as Deep Embedded Clustering (DEC) (see Xie et al. [32]) and Improved Deep Embedded Clustering (IDEC) (see Guo et al. [18]).Meanwhile, the loss function is referred to the deep clustering model's optimization objective.While the DEC model propose a unique clustering loss, namely Kullback-Leibler (KL) divergence, the IDEC generalize it by adding a reconstruction in the total loss function L = λL network + (1 − λ)L clustering where λ ∈ [0, 1].The total loss function for the IDEC model is a weighted sum of the network loss and the clustering loss, where the weighting factor λ determines the relative importance of each term.When λ = 0, the model is equivalent to the DEC model, and when λ = 1, the model is equivalent to an autoencoder.By varying λ, it is possible to trade off the reconstruction error for the clustering quality, or vice versa.There are several variations of the IDEC model that have been proposed in the literature.Some examples include: • The Structural Deep Clustering Network (SDCN), which incorporates structural information into the clustering process by adding a graph convolutional layer to the IDEC model (see Bo et al [6]).
• The Deep Embedded Regularized Clustering (DEPICT) model, which adds a regularization term to the overall loss function of the IDEC model to encourage the latent representation to be smooth and continuous (see Dizaji et al [13]).
• The Deep Clustering Network (DCN), which extends the IDEC model by adding a deep autoencoder to the network architecture and using a new clustering loss function based on the minimum entropy principle (see Yang et al [38]).
• The Deep Adaptive Image Clustering (DAIC) model, which combines the IDEC model with an adversarial learning framework to improve the robustness and generalization of the learned representations (see Chang et al [8]).
• The Deep Clustering with Convolutional Autoencoder (DCCAE) model, which uses a convolutional autoencoder as the encoder-decoder component of the IDEC model, and a new clustering loss function based on the maximum mean discrepancy (MMD) criterion (see Guo et al [18]).
There are many well-known configurations of convolutional neural networks (CNNs) that have been proposed in the literature.Some examples include: • LeNet: a classic CNN architecture that includes a number of convolutional and pooling layers, followed by a small number of fully linked layers.LeNet was among the first CNNs to be utilized successfully for tasks like handwritten digit recognition (see Lecun et al [11]).
• AlexNet: a CNN architecture with three fully linked layers and five convolutional layers, and it was the first time that CNN has triumphed in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (see Krizhevsky et al [21]).
• VGG: A number of convolutional layers precede a few fully linked layers in the architecture, and it is known for its good performance on image classification tasks (see Simonyan et al [29]).
• ResNet: a CNN architecture that was introduced by Wang et [36]).
• Dilated convolutional neural networks (DCNN): a type of CNN that use "dilated" convolutions, which have a larger effective receptive field than standard convolutions.Dilated convolutions can be used to increase the context captured by the network without increasing the number of parameters or the computational cost, and they have been shown to be effective in tasks such as semantic segmentation and image restoration (see Franceschi et al [16]).
For the recurrent neural network, we have more configurations such as the Deep Temporal Clustering (DTC) (see Madiraju et al [26]), the Bidirectional Long Short Term Memory (BLSTM), the Bidirectional Gate Recurrent Unit (BGRU), the Dilated Recurrent Neural Network (see Ma et al. [23]).
There are many different configurations of recurrent neural networks (RNNs) that have been proposed in the literature.Some examples include: • Bidirectional Long Short Term Memory (BLSTM): an RNN design made up of long short term memory (LSTM) units coupled in a "bidirectional" fashion.This means that the network processes the data in both forward and backward directions, which allows it to capture contextual information from both the past and the future.BLSTMs are often used for tasks involving sequential data, such as natural language processing (see Yulita et al [39]).
• Bidirectional Gated Recurrent Unit (BGRU): a RNN architecture that is similar to BLSTM, but it uses gated recurrent units (GRUs) instead of LSTM units.GRUs are a simpler variant of LSTMs that have fewer parameters and are easier to train.BGRUs are often used for tasks involving sequential data, such as natural language processing (see Sammani et al [28]).
• Dilated Recurrent Neural Network (DRNN): a type of RNN architecture, used as "dilated" recurrent units, which have a larger effective receptive field than standard recurrent units.Dilated recurrent units could be used to increase the context captured by the network without increasing the number of parameters or the computational cost, and they have been shown to be effective in tasks such as natural language processing and machine translation (see Chang et al. [8]).
• Variational recurrent neural networks (VRNNs): a type of recurrent neural network (RNN) that combines an RNN architecture with a probabilistic model.Indeed, it could be used to learn latent representations of a sequential data such as the natural language or time series data (see Chien et al [9]).
The representation module: Recently, the use of deep learning technique for representation learning problems, especially the unsupervised ones, raise the new trends for this research area.However, the majority of approaches now in use are not specifically intended for clustering tasks, and they are therefore unable to incorporate possible clustering information to learn improved representations [20].There are several unsupervised techniques in representative theory such as the clustering friendly, auto encoder, subspace, deep generative, mutual information maximization, and the contractive representation learning [42].The data used for representation learning could be various types of the aforementioned architectures such as images, texts, videos, or graphs [42].
Auto-Encoder based: The representation of time series obtained from deep neural network is called an encoder.For different data formats, such as the vectors, the images, the graphs, or the videos, the general structure of the auto-encoder can be changed.

Generative model based:
The generative model is a further branch of deep unsupervised representation learning, which seek to provide fresh data samples like a training dataset.By reversibly deducing the posterior of the representation p(h|x) from the data, generative approaches presume that the data x were produced from a latent representation h.There are many different types of generative models, ranging from simple models such as the Gaussian mixture model to more complex models such as generative adversarial networks (GANs) [22].Deep generative models, which are implemented using deep neural networks, have gained a lot of popularity in recent years due to their ability to learn complex, high-dimensional distributions and generate high-quality synthetic data [14].Among these, variational auto-encoder (VAE) is the most used technique, and the VAE-based deep clustering algorithms aim at solving an optimization problem about evidence lower bound (ELBO) on the data likelihood L ELBO = E p(h|x) log p(x,h) q(h|x) .In this expression, p(x, h) is the true joint distribution over observed variables x and latent variables h, and q(h|x) is the approximating distribution, also known as the variational distribution.The term E p(h|x) represents the expectation with respect to the posterior distribution p(h|x).
Mutual Information: Adopting in unsupervised learning, the mutual information is used for the discrete variables or known probability distribution to measure of the amount of information that one random variable contains about another.It is defined as the expected value of the logarithm of the ratio of the joint probability distribution of the two variables to the product of their individual probability distributions.The mutual information I (U , V ) is introduced to measure the dependence between two random variables U and V .
where P U V be the joint distribution, P U = V dP U V , P Y = U dP U V , and P U ⊗ P V are the individual probability distributions of U and V , respectively.To remind, the work of Hjelm et al. [19] introduced the efficiency of representation learning through deep neural network encoders that maximize mutual information between input and output.Considering a discriminator function D modeled by a neural network, the Jensen-Shannon divergence is a popular mutual information estimate method: In [37], the authors introduce the mutual information I (X, H, θ) between the input X and the features H with an encoder parameter θ to provide more discriminative information from the inputs such that the learned latent representations could be more robust to noise.Using the Jensen-Shannon divergence estimation, the authors maximize the term I (X, H, θ) as much as possible when training the encoder network.
Similarity measure: The choice of distance measure directly affect the clustering performance.Let's consider dist(x i , x j ) be the sum of distance between two time series x i and x j .i.e: dist(x i , x j ) = d t=1 dist(x i,t , x i,t ).Considering the Steiner sequence of time series R j that minimize the distance sum between time series observation and cluster prototype and the average distance: This expression is a formula for the average distance between a set of feature vectors F 1 , F 2 , ..., F n and a reference vector R j .Here, C i represents the set of feature vectors, and E(C i , R j ) is the average distance between the feature vectors and the reference vector.In a clustering context, the reference vector might represent the centroid of a cluster, and the average distance could be used as a measure of the compactness of the cluster.In practice, the most familiar metrics are the Euclidean distance, the dynamic time warping (DTW), the Hausdorff distance and the shape-based distance.

Model architecture
In this paper, we introduce a new kind of deep neural network that consist of three kinds of loss consisting the principal network loss, the classification loss and the auxiliary clustering loss.The deep embedding clustering (DEC) [32] is considered as the first model to propose the clustering loss, namely Kullback-Leibler (KL) divergence.After that, the improved deep embedded clustering (IDEC) [18] generalize the DEC model by adding a reconstruction in the total loss function: In the paper of [23], the authors introduce a binary classification loss that distinguish the real and fake time series data.In details, this model consist around 20% extra-data from the input as the noises, then try to use a soft-max formula to classify the data after encoding.However, the authors do not model well the noise for the information of its probability distributions in deep neural network, and that could be make the model loss the regularization.Also, this mutual information is not well estimated with current loss function.Our plan is to employ some semi-supervised clustering algorithms (or occasionally supervised clustering methods) that can be used with partially labelled data or data that contains other kinds of outcome metrics.By proposing the new architecture when eliminating the excess noise that is superfluous and labeling the input data, we increase the performance of the auto-encoder component, as well as giving the new the classification losses.In practice, learning from labelled data could give easier implements and dramatically reduce the cost of deploying algorithms.In fact, we require the labels the data to train the network, and we require the network to predict the labels of the outcome.We use the dilated recurrent neural network (DRNN) proposed by Chang et al [8] that could reduce the number of network parameters and improve the training efficiency.Overall, the total loss of deep neural network model could be considered as: In the case of γ = 0, our model is collapse to the IDEC architecture [18].

Labelling data via spectral clustering
At the beginning, we try to assign pseudo-labels for each element in the original data set A. By using the spectral clustering, one could categorize the data into the eigenspace associated to the Laplacian matrix.In fact, the whole data set could be viewed as a graph with each node as an observation and edges as the similarity.The goal of the clustering problem is to divide the graph into smaller parts so that edges within each subcomponent have a high degree of similarity and edges across subcomponents have a low degree of similarity.Such partitions can be obtained by solving the mincut problem.Let's define the weight function w : R d × R d − → (0, +∞).The loss function of spectral clustering is defined as: where d(x, y) be a Gaussian type of the distance metric between two vectors x and y, σ be the scaling parameter, and m be the sample mini-batch of at each iteration.In our case, we choose the dynamic time warping distance (DTW) as the metric.According to [3], the processes of spectral clustering are given as: • Construct an affinity matrix W ∈ R M×M in which its diagonal elements are set to be zero while • Construct a diagonal matrix K of size m × m such that the i th row's diagonal element contains K i,i = j W i,j , then a normalized Laplacian matrix could be used as • Creating an orthogonal matrix O consisting the k largest eigenvectors of L, then normalizing O to have unit length for each row.
• Clustering every row of O into k clusters, then labeling the point x i into cluster j if and only if the row i of O is belong to cluster j.

Clustering via latent representation
We define a non-linear mapping Φ enc (θ) : x i → h i that define the latent representation for encoding (AEbased) i.e: f (x i , θ) = h i ∈ R h where h < d.The new space created by the map Φ enc (θ) is called the latent space, in opposition to the original data space.Let's us denote the latent representation as ] and E is a permutation matrix of size k, then HE = [H 1 , H 2 , ..., H k ] represent a clustering process into k different groups where the i-th cluster is defined as: We denote m as the sample means of the latent representation i.e m(θ) = 1 M M i=1 Φ(x i , θ), and for each i ∈ [1, k], denoting the mean vector of the i-th cluster as m i (θ) = 1 s i s i j=1 Φ(x ij , θ).The total within-cluster scatter matrix (Total wsm ) and the total between-cluster scatter matrix (Total bsm ) for the encoding part could be considered as: 2  and According to [35], the total-data scatter matrix (Total dsm ) is not depended on the number of the cluster.In fact: To obtain high and low between cluster similarity, we should decrease trace (Total wsm (θ)) and inscrease trace (Total bsm (θ)).. Noting that the minimization of trace (Total wsm ) and maximization of trace (Total bsm ) are two equivalent problems.We have: where e i ∈ R s i be the vector of values 1 for ∀i, and Q be the block-diagonal orthogonal matrix i.e Q T Q = I.The solution of the problem min θ Trace Total wsm (θ) could be found in the theorem of KyFan [33].

Reconstruct loss
If we define the non-linear decoding mapping Ψ dec (θ ′ ) : then the loss function in the reconstructing phase should be considered in L 2norm, i.e the mean square error.
The optimization problem for the reconstruction loss could be viewed as min θ,θ ′ L reconstruction .

Multi-class classification loss
In this section, we want to utilize an appropriate classification model to reduce the cost between calculating assignments and labels learned by deep neural networks.Defining a classification head h : R d → R k , which transform the feature vector into a class score vector.Denoting the set {ℓ 1 , ℓ 2 , ..., ℓ M } ∈ {1, 2, ..., k} be the pre-assigned labels.The class score is given to class probabilities via softmax operator as p(ℓ = .|)= softmax(h • Φ(x i , θ)).therefore, the loss for minimizing the average cross entropy is given as: If the posterior distribution q(x|Φ(x i , θ)) is set to be deterministic, then another way to express this equation is: If we assume more that each label is assigned uniformly, and each data point can take only one label, then the optimization problem for the above formula with constraints becomes: θE(p(θ), q(θ))q(x|Φ(x i , θ)∈ {0, 1} M i=1 q(x|Φ(x i , θ)= M k .The above problem could be viewed as the form of optimal transport problem with binary constraints.The first constraint specifies that the function q(x|Φ(x i , θ)), often used in optimization problems with binary variables, must take on only the values 0 and 1.The second constraint specifies that the sum of the function q(x|Φ(x i , θ)) over all M data points must equal M k .This constraint is often used in optimization problems with a balance condition, where the sum of the variables must be a certain value.There are several optimization strategies that can be used to tackle this problem, including gradient descent and the primal-dual interior point method.The specific algorithm used will depend on the characteristics of the objective function and the constraints, as well as the desired accuracy and computational efficiency.The above problem that could be solved in linear time by using a variety of optimization algorithms, such as gradient descent or a primal-dual interior point method [4].

Experiment Results
We consider around 10 sample data sets that extracted from UCR time series database [12].There is a standard train/test split for each data set.The statistics of the benchmark time series datasets could be found in the appendix section.In fact, we will compare our results to those of DTCR model and the K means methods.The hidden size of the encoder part is [100, 50, 50], while the regularization coefficient λ is chosen in the set {1, 1e − 1, 1e − 2, 1e − 3}.The following table indicates the improving performance during the learning process of our deep spectral clustering model, and we also compare it to the DTCR model in [23].The recorded results are extracted from the epochs 0, 30 and 50 respectively.The experiment is taken 5 times consecutively, and we just consider the average values overall.Model evaluation: Metrics for assessing clustering output may be external or internal.For the external measure, we refer the the Rand Index (RI), Adjusted Rand Index (ARI), Adjusted Mutual Information (AMI), Fowlkes Mallows index (FMS), Homogeneity, and Completeness.
• Rand Index: The RI index refers to the similarities between the partitions of the clustering algorithms and the data set underlying structure.
• Adjusted Rand Index: The ARI index is a variant of the Rand Index (RI), which count the number of point pairings that are either in the same cluster or different clusters in both clusterings.ARI = RI + Expected(RI) Max(RI) − Expected(RI) • Normalized Mutual Information : The NMI index measure of the mutual dependence between two random variables.It is a standardized version of the mutual information, which measures how much knowledge one random variable has about another.
For the internal measure, we could consider the Davies-Bouldin index, Calinski-Harabasz index, Silhouette score, the I-index, and sum of square errors (SSE).where B be the intergroup variance of formula B = M i=1 |I i |||µ i − mu|| 2 and W i = k∈I i be the intragroup variance.Moreover, let us defining γi = 1 |I| i k∈I i d(x k , µ i ) be the average distance between each point of the set I i to its groups center, the Davies-Bouldin Index is defined as: We list all the RI and NMI comparing results for all 36 times series data sets between DSTR and DTCR model as the following:

Conclusions
In this paper, we introduce a new method for spectral clustering data using deep learning techniques.It  Deep spectral clustering has the benefit of being able to automatically extract features from the data, which can be more efficient than doing it manually.Additionally, it can handle high-dimensional data, which might be difficult for conventional clustering approaches to manage, such as photos with lots of pixels or texts with lengthy word sequences.However, deep spectral clustering also has some challenges, such as the need for large amounts of labeled data for training and the risk of overfitting if the model is too complex.It is important to carefully select the model architecture and the training procedures in order to achieve good performance on the clustering task.
Deep clustering is a machine learning technique that combines unsupervised learning (e.g., clustering) with deep learning.It is often used to learn meaningful representations of data in an unsupervised manner, and it can be implemented using any of the aforementioned architectures.In practice, the taxonomy function R contains two modules: the representation module and the clustering module, which play central roles in modern deep clustering.Normally, we have the following four kinds of deep clustering problem: • The one view deep clustering problem: R(X) → C.

•
Deep Temporal Clustering (DTC): a RNN architecture that was introduced by Madiraju et al. in their paper "Deep Temporal Clustering: Unsupervised Learning of Temporal Patterns."It is designed for unsupervised learning of temporal patterns in data, and it consists of a RNN encoder and a clustering layer (see Madiraju et al [26]).

Figure 1 .
Figure 1.The general architecture design for deep clustering network.

•
Calinski-Harabasz index: The CH index measures how compact the clusters are by calculating the distances between the points and centroids of each cluster.Let's denote µ k = 1 |I k | i∈I k x i be the average point of the group k, and µ = 1 N N i=1 x i be the center of the data set.The Calinski-Harabasz Index is introduced asCHI = (N − K)B (K − 1) K k=1 W k • Davies-Bouldin index:The DB index measures the average similarity between any two clusters and their nearest neighbors.

Figure 2 .
Figure 2. The model evaluation for beef data through the epochs: (a) Rand Index curve, (b) Normalized Mutual Information curve.