EM_GA-RS: Expectation Maximization and GA-based Movie Recommender System

This work introduced a novel approach for the movie recommender system using a machine learning approach. This work introduces a clustering-based approach to introduce a recommender system (RS). The conventional clustering approaches suffer from the clustering error issue, which leads to degraded performance. Hence, to overcome this issue, we developed an expectation-maximization-based clustering approach. However, due to imbalanced data, the performance of RS is degraded due to multicollinearity issues. Hence, we Incorporate PCA (Principal Component Analysis) based dimensionality reduction model to improve the performance. Finally, we aim to reduce the error; thus, a Genetic Algorithm (GA) is included to achieve the optimal clusters and assign the suitable recommendation. The experimental study is carried out on publically available movie datasets performance of the proposed approach is measured in terms of MSE (Mean Squared Error) and Root Mean Squared Error (RMSE). The comparative study shows that the proposed approach achieves better performance when compared with a state-of-art movie recommendation system.


Introduction
Nowadays, we live in a society where we come across several types of information in daily life scenarios.Moreover, the demand for Internet-based applications is increasing rapidly, such as online books, movies, images, web pages, etc. Due to this increased information, accessing the relevant information becomes a tedious task that can be addressed by developing a support mechanism for decision making.The decision-making process helps provide the relevant information according to the user's interest.
Based on this concept, recommender systems (RS) have attracted the research community and various e-commerce and Internet-based businesses [1].The recommender system is an automated system that collects the information from the user and provides the decision to access the relevant information.Generally, the information is collected explicitly (based on the user's rating for different types of items or products) or implicitly (based on the user behaviour during interaction with RS) [2].The RS ideas are widely adopted to overcome the retrieval challenge where users insert the query, and the retrieval systems provide relevant information.These systems are widely adopted in ecommerce [3,8], e-learning [4], book recommendation [5], and movie recommendation [6] in real-time applications.
The online storage of multimedia data increases, such as YouTube, NetFlix, etc., where recommending a suitable video or movie becomes a challenging task.Several techniques have been introduced to develop an efficient and robust system for movie recommendation.These schemes include collaborative filtering [6], content-based filtering [7], demographic [9], knowledge-based [10] and hybrid filtering [11].Recently, collaborative filtering schemes have gained huge attraction for recommendation systems.These methods use the user's historical interaction and provide the recommendation for the desired application.However, these techniques fail to achieve the desired performance for sparse interaction, such as on online shopping platforms.
Moreover, these techniques do not recommend a product that is not rated previously [12].Content-based recommendation system works on the ratings of the user in the past.Machine learning, optimization techniques, and evolutionary computations have been widely adopted for recommendation systems.Such as Hassan et al. [13] introduced a neural network-based approach for improving the accuracy of the recommender system, Kuang et al. [14] introduced a fuzzy rule model for ecommerce RS, and Kothari et al. [15] introduced a support vector machine (SVM) recommender system.Moreover, various optimization schemes are also incorporated to improve the RS performance.
Katarya et al. [16] improved collaborative filtering (CF) by incorporating particle swarm optimization (PSO), in [17] CF is improved using cuckoo search optimization, in [18] grey wolf optimization, in [19] genetic algorithm based CF, and in [20] evolutionary scheme for collaborative filtering are developed.However, the existing schemes suffer from various challenges such as cold-start problems, accuracy, scalability, privacy, etc.In order to overcome these issues, hybrid filtering schemes have been introduced recently.As mentioned earlier, these filterings utilize the filtering and devise a new filtering scheme.Paradarami et al. [21] introduced a hybrid recommendation system using a neural network scheme.
Similarly, various hybrid recommender systems are introduced, such as Hyper [22], hybrid RS based on context feature relationship [23] and fuzzy logic systems [24], etc. these recommender systems have several advantages over conventional schemes.However, computational complexity and accuracy remain challenging tasks in this field.Hence, in this work, we focus on developing a novel hybrid approach for a movie recommendation system.The main contributions of the work are as follows: The rest of the article is organized as follows: In section II, we describe the proposed system; in section II, we introduce the experimental analysis and comparative performance.Finally, in section V we introduce the proposed system's concluding remarks and future work.

Proposed Model
This section introduces the proposed solution for a hybrid filtering approach for the movie recommendation system.First of all, we define a problem with the movie recommendation system.Later, we introduce the Expectation maximization clustering scheme to obtain the clusters.In the next phase, we find the nearest neighbour clusters to estimate the expected maximization.

Problem formulation
Let us consider that total   users have rated   movies as  1:   .This matrix is considered a training matrix, which is denoted as a set of triples as (, , ).The movie rating matrix can be given as follows: The user movie rating matrix is denoted as  ∈ ℛ   ×  .In this work, our main aim is to predict the user's rating, recommend the new movies from the list based on these predictions, and predict the rating to replace the "? " from the dataset.The movie ratings are given as 1 to 5 integer values.We focus on reducing the RMSE (Root Mean Square) for the given test set during prediction.The RMSE can be expressed as: Where, (, ) ∈   test matrix,  , denotes the actual rating in the training set and  , is the predicted rating from the test set.Similarly, we try to improve the prediction accuracy using proposed hybrid filtering.

Proposed Solution
According to the proposed approach, we first consider the movie dataset and apply the expectation-maximization approach to perform the data clustering in the initial stage.
After achieving the clusters, we identify the nearest clusters and estimate the rating.However, the estimated rating is processed through a genetic algorithm to achieve the optimized ranking.Finally, the achieved rankings are used for recommending the movies based on the list of movies.

Expectation Maximization (EM) Clustering
Several recommendation systems have used K-means clustering, an instance of the Expectation Maximization EAI Endorsed Transactions on Scalable Information Systems Online First Asha K N and R Rajkumar (EM) algorithm.Generally, this algorithm is used for density estimation based on the distance parameter.The movie datasets contain irregularities, i.e., not all users need to give a rating to all the movies; hence there is a probability of achieving missing values in the data, which is difficult to handle using K-means clustering.However, missing value imputation methods can be incorporated, but we avoid the missing value imputation mechanism and adapted the Gaussian mixture model-based iterative clustering approach due to increased complexity.This scheme starts with an initial estimate and iterates to find the maximum likelihood for the input parameters.As a result, our user modelling objectives differ from those generally assumed in recommender systems, such as enhancing accuracy or related metrics like objective functions.Our intentions also differ from past work in interaction, allowing the customer modelling, which focuses on the ability to follow the recipient's goals as the conversation develops but does not often preserve models over several talks.Our assumption is that by employing an inconspicuously derived user model to guide the platform's intuitive search for items to propose, we can improve efficiency and efficacy.Our method ensures that there is a huge repository of items to choose from and that describing these items involves a multitude of attributes.When the information is tiny, or the objects are simple to explain, lesser techniques may be sufficient.
Moreover, the EM approach is more accurate and requires less computational time when compared to K-Means [25].
Let us consider that total  points are introduced in the given data as {  } −1  .We aim to assign these data points into  clusters.In this work, we model the cluster as a random variable assigned as  =  with the multidistribution probability, which satisfy ∑   = 1 such that The Gaussian distribution is (| = )~(  ,     ) where  is an identity matrix of order j,   denotes the mean, variance as ∑ =  ( 1 ,  2 , . .  ) and   is the distribution function: is a hidden variable; the log-likelihood of the complete data can be given as: Here,  introduces the input data, and the conditional expectation of  =  is represented as: Each data point is used for constructing the   , for any given   ,   can be expressed as follows: In order to achieve the final estimates, we have initial random mean and variance models as   (0) , Σ  (0) ,   (0) .
Later we apply cosine similarity to identify the neighbouring clusters.This can be expressed as: Where   and   are the users in the cluster,  denotes their corresponding rating.The selected clusters are processed, and the difference in their ratings is computed.The smallest centroid of the cluster is assigned as the current rating of the user.

Multicollinearity reduction scheme
Generally, the clustering scheme suffers from the multicollinearity issue, which reduces the accuracy of the prediction system [26].The PCA (Principle Component Analysis) is considered the most suitable scheme [27] to reduce multicollinearity.We incorporate a regression model with PCA to overcome multicollinearity in this work.Let us consider a model given as Where  is the data obtained from the maximumlikelihood model,  is the response of regression [28] PCA,  denotes the constant value, and  denotes the random error.The regression model of the principal component can be given as:  =  +   = ,  =  (10)  is a matrix that satisfies the condition as  ′ ( ′ ) = Λ and  ′ =  ′  = 1, Λ is a diagonal matrix.Later, we apply an optimization scheme to achieve a reliable rating to improve the system performance.

2.2.3.GA Optimization scheme
This section describes the Genetic algorithm [29] for the optimization scheme to improve the prediction accuracy.This scheme [30] is a population-based stochastic approach that includes several steps such as parameter initialization, selection operator, crossover, mutation operator and finally, achieves the optimal solution.

EAI Endorsed Transactions on Scalable Information Systems
Online First
Input: cluster data, number of clusters, GA parameters such as maximum iterations=20, size of population =50, Crossover probability   , Mutation probability   Output: optimal cluster according to the fitness value (distance) Step 1: Initialize the GA parameters and define the fitness function as (ℎ) = ∑ min ((  ,   )) ∈ Step 2: generate initial population: generate an initial random population as(0) for each cluster.Step 8: select the best-fit outcome chromosome as the optimal solution.

Results and Analysis
In this section, we introduce the experimental analysis using the proposed approach and compare the obtained performance with existing techniques [34].

Dataset description
In order to evaluate the performance of the proposed model, we have considered several publically available datasets.MovieLens dataset: This has the data rating of different movies.This dataset contains 6040 datasets and a rating for 3952 movies.These ratings are provided on a scale of 1 to 5 stars.We have considered the users who have provided a minimum of 20 ratings to any movies from this dataset [35][36][37][38][39].
Yahoo movie dataset: in this dataset also, the ratings are provided on a scale of 1 to 5 [40].The training dataset includes 7642 users, 11915 movies and 211231 ratings for movies.The test data includes 2309 users, 2380 movies and a total of 10136 ratings [41].
Netflix dataset: in this dataset, a total of 480,000 customers are identified using a unique ID for each user [42][43][44][45].There are over 17000 movies considered to extract the rating data where each user's unique ID and rating are organized for recommendation purposes [46].This dataset was collected from October 1998 to December 2005, and 100 million ratings were achieved [47][48][49].Table 1 shows a general arrangement of these datasets [50].
Table 1.Movie data description

Performance measurement
The proposed model is experimented with using MATLAB simulation tool running under 4GHz processor, 8GB RAM and 64-bit Windows 8 operating system [51][52][53][54].The performance of the proposed model is computed in terms of Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) and compared with the existing techniques [55][56][57][58].
The mean absolute error is computed as: Where  denotes the number of users, actual and predicted ratings are denoted as  and .
Similarly, RMSE can be computed as follows: First of all, we measured the performance for a varied number of clusters and compared the MAE and RMSE performance with the existing K-means and cuckoo search-based optimization [17]; [59][60][61][62].The above-given table 2 shows a comparative performance in terms of MAE [63].In order to achieve this performance, we have varied the cluster size from 4 to 40 [64].In this work, we have obtained the average MAE as 0.78 0.695 using the cuckoo search and proposed approach, respectively [65][66][67].Table 3 compares the performance with existing techniques for varied cluster sizes.This study aims to achieve the MAE performance of 0.681 and 0.701 for the Yahoo Movie and Netflix dataset using the proposed approach.Similarly, we measure the performance in terms of RMSE for the movielens dataset, and the obtained performance is introduced in belowgiven table 5.According to this experiment, we achieved RMSE values like 1.303 and 1.309 for Yahoo Movie and Netflix datasets using the proposed approach.These experiments show that the proposed approach achieves better accuracy for movie recommendations using hybridized modelling.

Conclusion
In this work, we have introduced a hybrid movie recommender system.According to the proposed approach, we first apply an EM clustering scheme for grouping the rating data, and later, the nearest cluster is selected, and a rating is assigned for each user based on the centroid.In the next stage, we apply a PCA-based regression model followed by a genetic algorithm to minimize the clustering error and achieve better rating prediction accuracy based on the optimal rating.Top neighbours are selected for achieving the recommended movies.The experimental study shows that the proposed approach achieves better performance in terms of RMSE and MAE (a) We introduce an expectation-maximization clustering scheme for data clustering.(b) We introduced PCA based regression model to reduce the multicollinearity (c) A genetic algorithm-based optimization scheme is introduced to reduce the clustering error and improve prediction.

Table 2 .
MAE performance for varied cluster numbers for Movielens dataset

Table 3 .
MAE comparative performanceSimilarly, we computed the MAE performance for the Yahoo movie and Netflix datasets.The obtained performance is introduced in table 4.

Table 4 .
Comparative MAE performance using the proposed approach.

Table 5 .
RMSE PerformanceSimilarly, we have measured the RMSE performance for Yahoo and Netflix datasets as given in table6.

Table 6 .
RMSE performance for Yahoo and Netflix data