MFRLMO: Model-free reinforcement learning for multi-objective optimization of apache spark

Hyperparameter optimization (HO) is a must to figure out to what extent can a specific configuration of hyperparameters contribute to the performance of a machine learning task. The hardware and MLlib library of Apache Spark have the potential to improve big data processing performance when a tuning operation is combined with the exploitation of hyperparameters. To the best of our knowledge, the most of existing studies employ a black-box approach that results in misleading results due to ignoring the interior dynamics of big data processing. They su ff er from one or more drawbacks including high computational cost, large search space, and sensitivity to the dimension of multi-objective functions. To address the issues above, this work proposes a new model-free reinforcement learning for multi-objective optimization of Apache Spark, thereby leveraging reinforcement learning (RL) agents to uncover the internal dynamics of Apache Spark in HO. To bridge the gap between multi-objective optimization and interior constraints of Apache Spark, our method runs a lot of iterations to update each cell of the RL grid. The proposed model-free learning mechanism achieves a tradeo ff between three objective functions comprising time, memory, and accuracy. To this end, optimal values of the hyperparameters are obtained via an ensemble technique that analyzes the individual results yielded by each objective function. The results of the experiments show that the number of cores has not a direct e ff ect on speedup . Further, although grid size has an impact on the time passed between two adjoining iterations, it is negligible in the computational burden. Dispersion and risk values of model-free RL di ff er when the size of the data is small. On average, MFRLMO produced speedup that is 37% better than those of the competitors. Last, our approach is very competitive in terms of converging to a high accuracy when optimizing Convolutional Neural networks (CNN).


Introduction
Apache Spark is a prevalent distributed data processing platform that provides various machine learning algorithms to interpret raw data.Generally, machine learning [1,2], fog computing [3], event detection [4], and interactive analysis [5] are the foremost application areas of Spark.Optimization is key to understanding the underlying mechanism of parameters of Spark providing data processing and machine learning methods that enable us to plan computational sources.In order to obtain the best performance in Spark, it is necessary to seek the ultimate configuration of the parameter set that is determined before the execution and changes depending on the type of data set.That parameter set is called hyperparameter configuration.
Spark is not only optimized for making the use of time and hardware sources more efficient, but for obtaining reliable measurement methods that evaluate a machine learning model.Therefore, instead of relying on a single objective function, multi-objective function groups should be preferred in HO.To configure the parameters of Spark, there are various ways including choosing from a machine learning group [6], heuristic methods [6], cost estimation models [7], and utilitybased designs [8] which have been commonly applied to this field.The major drawback in this field is the lack of dynamic models considering the internal states of HO rather than employing traditional black-box approaches in the tuning of Spark.As such, the HO performed regarding the interaction of sequential tiny tasks is more feasible than establishing optimization models [9] along with a single objective.
There are few works proposing HO with multiobjective designs [8,10].Although these works optimize Spark with more than one objective, they were not able to combine three or more objective functions.Further, generally, the objective functions are constructed in relation to the system's performance.In that case, there is a greater risk that the reliability of evaluation of the performance of sub-jobs of Spark degrades gradually.Therefore, traditional methods obtain the optimal hyperparameter configuration limited to accuracy and execution time.
In the last decade, random search [11], grid search [12], bayesian optimization [13], and evolutionary methods [14] have been commonly applied to HO problems.However, none of them always can yield the best solution for every type of HO problem due to their design constraints.Instead, there is a possible solution that a choice is made regarding problemspecific constraints and the size of experimental data.For instance, though grid search is an exhaustive algorithm, it takes a lot of time compared to the alternatives.On the other hand, although Bayesian optimization performs relatively fast in HO, it poses a significant threat that some of the optimal results may be ignored.Since Spark processes resilient distributed data (RDD) and divides the problem into small job tasks, there is a need to develop a sophisticated HO technique to reveal the internal dynamics of big data processing.
While RDD provides explicit parallelism, it brings two critical issues as follows: 1) Since the Spark kernel divides small jobs into threads, it precisely hardens to establish a second parallel architecture.However, a second parallelization is only possible thanks to an asynchronous execution [15].2) The objective functions are prepared based on speedup.Therefore, increasing the types of objective functions becomes almost impossible.
There are two common targets in tuning parameters of Spark: storage and execution improvements.While configuring some operations like sort and shuffle means coping with execution issues, storage problems are focused on the caching parameters of HO.Besides, the parameters associated with shuffle and partition are of great importance to achieve optimal performance in the big data processing.In rule-based tuning, a user defines the execution strategies and the performance is improved without applying expert knowledge in Spark [16].But that operation is very effort intensive and it is very difficult to find the optimal configuration hyperparameter set when there is not any expert opinion.It is thus immensely challenging to find the best combination of a set of specific hyperparameters.
To yield minimum execution time, the values of hyperparameters can be represented as bytecode definitions [17].However, if the size of the search space is very large, search time increases tremendously.In order to shorten that time, the less effective hyperparameters can be ignored in small iterations [18].However, the optimal hyperparameter set may be lost if the threshold of trial execution is not set well.Therefore, detecting the most effective hyperparameters in HO is an alternative way to ignoring the less effective ones.The target hyperparameter set is produced leveraging different log files of the execution thanks to a block diagram [19].In this way, some of the hyperparameters constituting the search space are very difficult to be held out of scope in which there are a great number of setting options.In adaptive methods, resource sharing [20] is reconsidered in each iteration of HO and the changes in workload are observed [21][22][23].However, the possibility of adverse effects of concurrent workloads arises from online systems.Instead, the optimal performance can be achieved by leveraging the hyperparameters detected during the training of machine learning methods [16,24].But that approach is a black-box model that is strongly related to the internal settings of Spark and hardware.Retrieving training data is very costly in machine learningbased models and it necessitates different execution trials to avoid misleading results.If we define the basic execution steps of Spark as a group of interrelated processes, reinforcement learning (RL), which is one of the popular machine learning techniques, quite fits that definition.RL traces the major principles of the Markov Decision Process (MDP) and it has been tested thanks to a Factor Analysis, thereby utilizing a dynamic configuration that makes reward values more informative in terms of performance metrics [25].But multiobjective optimization is not considered in that study.Metareasoning techniques are of great importance for deciding optimal stopping criteria and they are very useful to plan HO.They have been tried to increase the comprehensiveness of the approaches achieving optimal solutions using quality and time features [26,27].Despite the fact that metareasoning techniques could have yielded promising results in robot path planning and classification issues, interpreting modelbased techniques is very difficult when the environment dynamics are quite complex.The aforementioned studies have some drawbacks as follows: 1.They are mostly designed with objective functions based on speedup, 2. Seeking the optimal hyperparameters is not easy when the search space is very large, 3. The training time increases if the features of Spark are involved in machine learning processes, 3.
The models, which are developed for HO, are designed by disregarding the internal dynamics of Spark along with a black-box approach.Therefore, the relationship between the configuration of the hyperparameter set and the obtained performance can not be explicitly defined.
The disadvantages of existing works lead to a waste of computational resources or result in longer training times.This study aims at developing a multi-objective HO method, thereby utilizing the major principles of model-free reinforcement learning [28].In this way, we provide new insight on how internal dynamics of Spark affect HO efficiency thanks to a white-box approach.The method produces optimal hyperparameters for three objective functions: memory, time, and accuracy thanks to an ensemble technique.Hence, the tradeoffs are found for three objective functions.To this end, a dynamic set of hyperparameters modeled with 14 grids, is retrieved from a search space created with grid search.Thereafter, the HO is completed with the ultimate rewards updated in each iteration of RL.Before configuring the hyperparameters of ml_multilayer_perceptron_classif ier, Direct Search is performed for binary classification data sets.One of the distinctive properties of the proposed method is that it does not include the properties of the environment in the training data thanks to the modelfree design.In this respect, the computational burden of determining which feature does contribute to the performance of HO is alleviated.Multi-objective design is a requirement for hyperparameter optimization since traditional optimizers generally choose one target that does not seek a tradeoff between various purposes.
The study claims the contributions as follows: 1.Unlike the literature, MFRLMO is the first model-free adaptive HO method developed for Spark, 2. MFRLMO is devised not just for tuning machine learning algorithms, it can also configure the hyperparameters of HiBench benchmarks, 3. When comparing MFRLMO with the-state-of-the-art methods, It could produce the highest speedup, 4. If MFRLMO is used for tuning CNN, Accuracy is improved up to 150 iterations in the observation of training, 5.The execution memory and the number of cores are the most effective hyperparameters for MFRLMO.6.Compared to the state-of-the-art, MFRLMO is much more suitable for large-scale tuning setups when the limit of budget HO is not low.
The rest of the study comprises six sections.Section 2 summarizes the related works.The method is elucidated in Section 3. The details of the experiment are presented in Section 4. Section 5 discusses the results obtained from the comparison methods.The paper is concluded in Section 6.

Related works
Robotics manipulation is one of the real-world application areas of RL.In particular, RL is utilized for the problems encountered during the guidance of moves of robots.Generally, robot navigation studies are focused on two approaches: heuristics and learningbased techniques.Concretely, RL is very useful for collecting or squeezing some objects with robots.In the case of a limited budget, the constraints should be regarded in real-time scenarios [29].RL may be combined with traditional control schemes for unprecedented circumstances.In this context, the results produced in the simulation environment are very promising for real-world problems with such techniques [30].Cooperative exploration strategies help reduce the time allocated for training RL [31].Hence, the robots move more reliably in which there is little or no information about the obstacles.Oliff et al. investigated the benefits of RL in manufacturing techniques [32].The optimal policy achieved with the changes of conditions of the environment may lead to faster degradation of performance.In the meantime, RL has been tested for the coordination multi-robots [33,34].To that end, the data trained with CNN and the instances collected with RL consist of the elements of the experiment.Although simulation environments contribute to finding solutions for realworld problems, RL requires sophisticated HO methods to improve the reliability and the generalizability of training performance.If RL is employed for finding ways to robots in a cluttered terrain, the time and range spent for reaching the target can be reduced.However, diversifying the conditions of the environment chosen for testing robots enhances the generalizability of the RL techniques [35].
Game theory is an active research field that utilizes RL.Model-based RL is commonly applied in game theory due to the use of environmental features.The policy parameter of RL is optimized to support the models of game theory [36].The coalition formula of game modeling is devised for reducing the error of convergence.To this end, approximated RL is one of the solutions for fixing problems that may arise in training.The anticipation of levels of opposed agents in game theory has unresolved issues.For instance, how to proceed with the interaction of agents having various levels is a complex process and it requires coping with a high computational demand [37].Establishing a learning-based scheme for mobile social networks without knowing the details of network parameters may lead to security deficiencies.To solve that problem, a novel Q-learning-based edge caching strategy was designed [38].Despite the fact that the method tested on the Stackelberg game has shown resistance to degradation of the quality of caching in the increase of the number of cost parameters, it should be combined with special techniques that have the potential to reduce training time.Unifying game theory with deep learning resulted in a remarkable increase in the success of simulation-based anticipation studies [39].However, cross-validation may be tried to avoid overfitting when there are independent experimental data sets.Ahad et al. [40] utilized RL to choose a way to 5G-based network packet transmission.RL helped to reduce energy consumption during the transfer of data.Bui et al. [41] proposed an RL-based technique to plan the budget of trading costs in grid systems.They stressed that the number of episodes of RL should be high to compare the congestion of internal trading and external trading.
If we want continuous improvement with respect to the feasibility of chosen values of hyperparameters, RL can be considered as an auxiliary method.In Dong et al.'s work, CNN-based algorithms, which trace target objects defined by a user, were tuned with RL and the results of it were compared with those of the default configuration.Specifically, RL contributed to the acceleration of object tracking.If the use of the model is driven dynamically, the complexity of the related algorithm increases depending on the iteration of RL [42].In such cases, model-free RL is an alternative way to alleviate the computational burden posed by a high number of iterations.Preceding works revealed that if RL is preferred in HO, the budget allocated for computing is reduced [43].But if the features of the environment are not involved in the training data set, a high number of training iterations is needed.RL was investigated in the multiobjective optimization problem of image classification data to improve the success of CNN [44].Some performance metrics such as accuracy are of vital importance as to which hyperparameter set is optimal or not.For instance, latency refers to the time passed between two consecutive iterations.Although [44] was able to achieve significant success for these two objectives, ignoring the consumption of memory, which comprises the scalarization function of reward, limits the generalizability of results.

Background
In this section, basic definitions and proofs are presented along with the formulations explaining the relationship between HO and RL.We then elaborate on the structural properties of model-free reinforcement learning that differentiates it from model-based approaches.

Hyperparameter optimization
Let SS be a search space representing the hyperparameters of Spark and MLlib.Assuming h 1 , h 2 , ..., h m is the optimal hyperparameter configuration chosen from that search space, there are m number of hyperparameters.Applying one function to a search space f → SS means performing single objective optimization.f may be related to classification accuracy, memory consumption, and time.On the other hand, a set of functions f → SS denotes multi-objective optimization and should satisfy n ≥ 2. f decides whether the optimization meets the minimization or maximization criteria.For instance, if the objective function is associated with time, minimization is preferred in satisfying the criterion.
Definition 1. Hyperparameter: Let T r be training, V be validation, and T be testing, hyperparameter is a parameter applied between hf → S and T r, thereby observing the effects of it are observed in V along with the results at T .Definition 2. Search space: S is a set of values obtained from experiences gained in the preceding studies of HO.

Model-free reinforcement learning
Contrary to model-based learning, model-free RL does not require the use of the agent, thereby modeling the details of the environment.Instead, new actions are decided by controlling the rewards produced in the old episodes.Since there is no modeling of the environment, the computational cost of feature selection is alleviated as well.However, to achieve the ultimate solution, model-free RL necessitates a great number of training iterations.As such, unlike blackbox approaches, a distinct mechanism is adapted for improving traditional HO techniques.
Assuming an agent is associated with action A t , state S t , and reward R as sequentially at discrete time-points When the learning process is formulated with Markov Decision Process (MDP), the transition function is denoted with a tuple (S, A, R, σ ) in which σ : SxAxR → S generates a new state in the episodes.A probabilistic policy manages the agent for the states p : S → A that S − A is yielded as in Equation 1.Those are called Qvalues.
where γ is the discount factor quantifying how much importance will be given for future rewards.
Definition 3. Exploitation: is a mathematical indicator showing agents have produced the best results up to the current episode.Definition 4. Exploration: is inversely proportional to exploitation and refers to long term to find the optimal strategy.
The balance between exploitation and exploration is determined with ϵ and ϵ = 0.3 means that 30% of the total actions are reserved for exploration and the rest of them are used for exploitation.
Theorem 1. New Q-values can be formulated with a constant a ∈ (0, 1) as (1 where n is the number of iterations.In the weighted average, (1 − a) n + n i=1 a(1 − a) n−i = 1.Proof.The weight a(1 − a) n−i given in Theorem 1 depends on the time when R i is produced.1 − a is less than 1 in which R i decreases as the number of rewards increases.So that summing R i → 0 and (1 − a) goes to 1.
Theorem 2. Given a state set S 1 , S 2 , ..., S u and action set A 1 , A 2 , ..., A v , v ≤= u in which a gridworld environment is defined with limited computation resources.Proof.Assume the number of training=50, the number of states=y, and the number of actions=z.For 1000 bootstrapping random walks, state transitions overlap at least x/(z y − y y ).For (x=50, y=3, z=4), the overlapping occurs 2.85 times.To prevent that problem, the number of actions should be tried to make as large as possible.Theorem 3. If ϵ − greedy is applied to find a tradeoff between exploration and exploitation, ϵ = (1/x) + v is a general formula for ϵ = v in which x is the number of iterations.Proof.In increasing iterations, 1/x → 0 for x → ∞.For instance, if v is 0.2, the left part of the formula goes to 0 after a specific number of iterations and v becomes the ultimate value.

MFRLMO
The main steps of MFRLMO are given in Figure 1.Different from model-based approaches, it manages the HO process according to the results of action-reward interactions.In that process, the experimental data is converted to RDD after feature selection.Once the cross-validation partitions are generated, a dynamic cell update is done in RL depending on the obtained reward values.Subsequently, the reliability of the RL model is evaluated with dispersion and risk analyzes.In the last step, the success of optimal configuration is interpreted with various performance parameters such as accuracy.
MFRLMO comprises the details of a definition process since the experimental data is related to a Spark connection object that metadata contextual information including start time, identifiers, and session.The steps of MFRLMO are presented in Algorithm 1.The label column (0/1) of classification data sets is converted to factor values (true/false) in Step 1 thanks to the P reprocessing function.In Step 2, the experimental data sets are exposed to feature selection with Direct Search algorithm that works as follows: 1) Determines a K value that shows the number of features to be chosen, 2) K features constitute the ultimate feature set, thereby employing a threshold value that eliminates some features.To perform feature selection, the formula given in Equation 2is applied to the data sets.
where R 2 is a coefficient used for continuous features.While SS r refers to the sum of the square of residuals, SS t is the sum of squares.Direct Search has been executed with the help of FSinR library of R package [46].Step 3 adds new form of data to Spark connection and Step 4 D 2 is divided into 10 parts to assign D 3 in 10x10 cross-validation.hss is composed of some values retrieved from a search space generated by grid search.count given in Step 5 refers to the number of updates of 4x4 grid env.Initially, the cells of the 4x4 grid are randomly generated values from hyperparameter search space.They are renewed checking rewards obtained from consequent actions.For instance, let (0.1,0.3,0.4)be a hyperparameter tuple of a hyperparameter set of a specific cell.If a remarkable increase is observed in the reward after that tuple is updated as (0.1,0.4,0.4),(0.1,0.5,0.4) is taken from the search space generated with grid search to update the RL grid having 14 cells.Hence, a dynamic grid is defined as modelfree RL.The update mechanism of the 4x4 grid is The exploration potential of the agent may be limited due to a high alpha in a modeled environment.
Step 16 records Q values depending on the state-action interaction for 50 iterations of 14 grids.In Steps 17-18, the mean hyperparameter values of three objective functions are calculated along with the rewards.
Step 20 returns the average reward and optimal hyperparameter configuration.

Accuracy = T P + T N T P + T N + FP + FN
(3) Being able to measure the time difference on two grid types, a mini-experiment is performed for various iterations.Time-iteration results given in Figure 3 show that the type of grid is negligible although the difference is clear after 200 iterations.The more cells we have, the more design of HO is updated so that a 4x4 grid is preferred in the experiment.Notably, the trends observed in the two grid types are quite similar in the way they rise and fall.

Experiment
In the experiment, the structure of 4x4 gridworld is given in Figure 4.Here shaded cells can be called terminal states that halts the iterations.Otherwise, four actions including "up", "down", "right", and "left" are considered for each state.If Accuracy is achieved above 97% optimal configuration of hyperparameters is returned by Algorithm 1 regardless of we have attained the end of iterations.

Settings
The machine used for establishing the experiment has been configured as able to run Spark and Yarn  (a) 2x2 grid  along with the following features: CentOS Linux, 64bit, Intel(R) Xenon(R) 2.9 GHz, 24 CPU Cores server with 222 GB RAM.To generate a Yarn cluster, the machine has allocated four workers each of which has four cores.In total, 30GB of memory has been spent on the Yarn experiment for each worker.The algorithm and the details of the experiment have been coded with R version 4.0.2.

Search space
The search space of the experiment is generated with the hyperparameters of ml_multilayer_perceptron_classif ier and the parameters of Spark.Although MLlib can run a great number of algorithms featuring supervised and unsupervised learning, the chosen algorithm for classification is the feedforward artificial neural network constituting the fundamentals of deep learning.The hyperparameter layers is not involved in the tuning process.The reason is that the number of features does not change after feature selection.Further, it is not possible to change the number of features in the output layer since the type of classification is binary.In the first layer, averageReward ← getEnsembleReward(model) meanHyperparameter ← getEnsembleHyperparameter(model)  1.Moreover, the hyperparameters of CNN are given in that table since one of the comparison methods was tested on CNN using image classification data sets.Therefore, the settings of the range of hyperparameters have been set regarding the baselines.Even though Spark has a large number of configurable parameters, an optimal hyperparameter group has been created considering pioneering studies [10,47].On the other hand, since multi-objective optimization takes a lot of time compared to single-objective optimization, the number of hyperparameters has been kept to a minimum to alleviate the computational burden.17 hyperparameters and parameters of Spark are subjected to HO.The hyperparameter ranges have been set as large as possible due to the need for high iterations in model-free RL.As such, employing constantly changing cells of the grid has bolstered the comprehensiveness of the hyperparameter configurations.

Data sets
The data sets presented in Table 2 have three types: Spark benchmark, classification, and Keras.After executing Apache Hadoop YARN, the benchmarks of HiBench are included in the experiment to evaluate the success of Algorithm 1 in terms of resource management.The second type consists of classification data sets that are suitable to be exposed to the algorithms of MLlib.Image processing data sets have been retrieved from the Keras library of R packages [48].WordCount is the largest benchmark that was designed for counting a word in a text corpus.TeraSort performs sorting on big data leveraging Hadoop.Kmeans is an iterative clustering algorithm that uses some distributions such as Gaussian and it is available at MLlib.Bayes is also a multi-label classification algorithm that works via MLlib.That algorithm is involved in the experiment since it is based on various mathematical inference techniques.Dense was originally devised for developing neural network models in classification 1 .Microsof t consists of instances collected from a security intrusion contest and it has 1805 features2 .P ayload data set, which has 32 features, was generated from a research project conducted by Politecnico di Milano University3 .Santander data set includes completely numeric values comprising 199 features that were retrieved from customer transaction prediction 4   platform and the other data sets were retrieved from Kaggle5 .We employ the same data set group chosen in our preceding study since in that study Pareto-based multi-objective optimization produced promising results [49].

Performance parameters
There exist various measures to evaluate the reliability of RL methods [50].They are generally employed for interpreting variability during training.Two types of RL evaluation measures were chosen in the experiment: Dispersion and risk.Different from distribution, dispersion shows to what extent the data is aggregated [51].To measure dispersion, the reward values were exposed to the Coefficient of quartile variation (CQV) designed by improving the Inter-quartile range (IQR) as defined in Equation 4.
where Q 3 and Q 1 denote the third and first quartiles giving general dispersion of data sets.CQV is preferred in the experiment since it is much more suitable for asymmetric distributions.cqv_versatile function of R cvcqv library has been combined with Adjusted bootstrap percentile (BCa) statistical analysis [52].
A complementary method called Conditional Value at Risk (CVaR) for dispersion is utilized to measure how heavy is the low tail of the distribution.CVaR is a numerical measure defining the risk of the occurrence of the worst scenario with respect to dispersion and it is calculated as in Equation 5. CVaR has been recorded as having varying iterations.
where φ ∈ (0,1) and V aR φ (V ) represents the value of risk in the related quarter.It is also used for measuring risks in various research fields including finance [53], robotics [54], and health [55].The results were EAI Endorsed Transactions on Scalable Information Systems Online First MFRLMO: Model-free reinforcement learning for multi-objective optimization of spark yielded with cvar library of R, thereby setting standard deviation and quadratic form (QF) distributions.The efficacy of tuning processes in Hibench benchmarks was evaluated by speedup.While perf 0 of speedup given in Equation 6is the time recorded in the default settings, perf 1 denotes the time after applying the tuning process.Since the machine utilized in the experiment enable us to increase the number of cores up to 24, speedup was recorded between 4 and 24 cores.

Competitors
MFRLMO is compared with Effective Multi-Objective Reinforcement Learning (EMORL) [44] and Hyperparameter Optimization by Reinforcement Learning (Hyp-RL) [43].Two criteria are used for choosing the baselines: 1) The comparison method should have been revealed in the recent past, and 2) The baselines regard RL as a tool to solve a HO problem rather than aim at configuring the hyperparameters of RL.Hyp-RL works as follows: 1) Asks users to choose a data set D and the inputs of hyperparameters such as Λ, 2) RL initiates Q-networks, 3) The reward and Qvalues are updated along with the data of minibatch, 4) The Q-network is re-defined for reward and discount factor having the maximum Q-value, 5) The set of Qvalues is returned at the end of the iteration.Hyp-RL was compared against five alternatives in estimating the life of large batteries.
EMORL has five steps to perform tuning and it can be summarized as follows: 1) The user is asked to give a target hyperparameter set after assigning the first state of RL, 2) The accuracy and latency changes are observed, thereby sampling new hyperparameters during the state transitions.3) Q-value is updated with the help of the PPO-clip method [57] to return the optimal value for each hyperparameter.
The reason we compare the proposed method against Hyp-RL and EMORL can be ordered as follows: 1. EMORL is a multi-objective tuning technique and it is compatible with big data processes in updating optimal hyperparameter sets as a way of RL.It thus includes a formula namely latency indicating the magnitude of the reward signal.EMORL performs a two-objective (accuracy-latency) optimization by using that formula.The complexity of EMORL is O(n h * k LAT * k accuracy ) wherein n h is the number of hyperparameters, k LAT is the constraint of latency, and k accuracy is the constraint of accuracy.Since EMORL is defined considering various constraints, dynamic changes are observed depending on the type of optimization problem.2. Hyp-RL prefers to update reward values in each episode of actions.In this context, Hyp-RL shows similarities in design variations to the proposed method.The Q-values have been produced for each objective function of MFRLMO as presented in Table 3.The three-objective optimization is established for 50 iterations of each data set that results in 3x50=150 Qvalues.In that table, the bold-faced values represent the closest state action to 97% of accuracy.It is worth noting that the minus values in the transitions of other cells are yielded with a model-free approach.Since the experiment exploits 11 data sets in total, 11x150=1650 tables should be evaluated.However, the performance metrics are only associated with the changes in rewards, Q-values have only been used for controlling the internal dynamics of the cells.

Results
The optimal values of hyperparameters have been found in the low number of iterations in classification as given in Tables 4-5.We can conclude that the variability is relatively high for the hyperparameters tol and step_size.However, since one algorithm of the MLlib library has been tested in the experiment, a more general result can be achieved, thereby increasing the number of algorithms.On the other hand, the parameters that are directly related to the size of data such as spark.memory.fraction differs in Hibench benchmarks.The largest value of spark.executor.memory,which defines the memory allocated for the execution, has been detected in WordCount (120GB).The storage memory was set to EAI Endorsed Transactions on Scalable Information Systems Online First 0.4 since TeraSort requires large memory.However, the compression rate of is set by MFRLMO without considering the type of Benchmark.Although KyroSerializer performs faster than JavaSerializer, it was not chosen for finding the optimal configuration due to the need for registering classes.When it comes to interpreting the results of speedup, as MFRLMO traces a relatively more linear line in machine learning processes, TeraSort and WordCount, which are workloads, have low acceleration as given in Figure 5. Since Hyp-RL has a high complexity, it has a stable speedup though showing a remarkable improvement in accuracy.It is thus not effective in cases of high cores above 8.In particular, EMORL differs from Hyp-RL in a high number of cores.However, MFRLMO is similar to Hyp-RL in terms of convergence.Regardless of the type of benchmark, it is clear that each algorithm has a specific threshold.For instance, the threshold of MFRLMO for TeraSort is 20.Exceeding that value does contribute to speedup that leads to the waste of computational resources.Despite the fact that the threshold of Hyp-RL is very high for Workload types of benchmarks, it has a lower speedup than that of MFRLMO since Hyp-RL is devised based on a two-objectives optimization.That finding validates that Hyp-RL is much more suitable for low-budget HO operations.These results confirm the inference that the threshold for the number of cores is 16 regardless of the type of benchmark when the optimization is devised based on RL.
The changes in mean rewards are given in Figure 6.The mean reward is initially very low but it increases remarkably with high iterations thanks to setting ϵ = 0.2.It is worth noting that there is a gradual decline in the churn of rewards.That churn is not dependent on the type of data set that the tradeoff between exploration and exploitation is achieved after a specific number of iterations.The beginning of the rise and decline of average reward for Santander and Dense are very similar for mean rewards.On the other hand, the analysis confirms the opposite trend for Microsoft and Santander data sets after 20 iterations.
The CQV results of all the data sets are given in Figure 7.The data set having the lowest dispersion is Santander which has no reasonable churn although it could not yield the best average reward.The payload data set has progressed a narrow area in yielding average reward.Likewise, the distance between the extreme outliers of the CQV boxplot is very short.WordCount has the best dispersion among the benchmarks.Note that it is also the largest experimental data set.WordCount may have produced the best result thanks to the need for operation increasing its size linearly depending on the number of instances.Although Bayes yielded the worst CQV, it showed promising results in classification data sets.The results given in Figure 8 claim the exact opposite that the inferences we draw from dispersion are not valid for CVaR.For instance, Dense, which is a classification data set, produced worse CQV results than those of CVaR.The ordering has not changed in the benchmarks but it has expanded along with the outliers of the boxplot.

Iteration
The performance of three competitive methods is compared for tuning CNN with respect to the average test accuracy as presented in Figure 9.The comparison comprising three image processing data sets revealed that convergence is achieved after 50 iterations for Cifar10 and FashionMnist data sets.MFRLMO outperformed the other two methods for three data sets.Although Hyp-RL has a high complexity, it has been observed that Hyp-RL has the lowest fluctuation.EMORL has the highest fluctuation and produced the lowest accuracy.This is because achieving good policy in RL hardens when the search space of HO is very large.In addition to this, EMORL has only been tested for tuning image recognition in the preceding studies.It is worthwhile to note that although MFRMO is a threeobjective optimization technique, the tradeoff found during the tuning process resulted in high accuracy.
Despite the fact that FashionMnist and Minst include similar data instances, they have different fluctuations.The findings obtained from Figure 9 support the claim that the similarity across the data sets has no impact on achieving a fast convergence.MFRLMO has the highest accuracy among the baselines as given in Table 6.The payload was able to produce the highest improvement (24%).The results of EMORL and Hyp-RL are dependent on the type of data set.Concretely, although Hyp-RL could yield better improvement for the Microsoft data set, EMORL is much more feasible for Kmeans which is also a benchmark having medium size compared to Microsoft.These rates at which accuracy improves are very difficult to be grouped.The algorithmic design properties given in Section 5.5 may have led to that churn due to their effects on the complexity.

Threats to Validity
Firstly, the proposed method was only evaluated by using the ml_multilayer_perceptron_classif ier classifier of MLlib.Therefore, the results may not be generalizable for the other algorithms of MLlib.However, the experimental data sets chosen for the classification have various sizes and they are compatible with the other types of classifiers.Hence, the results may incur the loss of generality of the proposed method but it can be avoided by performing a replication study.Moreover, some hyperparameters related to the input and output layers are out of the scope of the experiment due to the stability of the number of features after performing Direct Search for ml_multilayer_perceptron_classif ier.
The experiment can be replicated by utilizing the major steps of MFRLMO for all hyperparameters of some classifiers such as the Gradient-boosted tree.
The second threat is about the chosen metrics of data sets.Accuracy is the sole formula used for evaluating the success of classification.However, the width of the distribution of reward values is calculated with CQV and its risk is interpreted with CVaR.Hence, the reliability of the results could have been examined with the help of three different evaluation metrics.Nevertheless, these measures are not always sufficient, given that Q-values are of great importance in some RL experiments.In this respect, new formulas would be designed to find out the underlying mechanism of Q-value changes during the state transition of the iterations.

Conclusion
Although Apache Spark is an open-source big data processing platform, it is a must to perform tuning regarding various criteria, resulting in a significant reduction in computational cost.In this study, a multi-objective RL-based optimization method called MFRLMO is proposed to solve the HO problem of Apache Spark.Unlike black-box approaches, some dynamic internal improvements have been made by considering the objective functions as a set of sequential HO problems.Specifically, the study is the first to use model-free RL in tuning parameters of a big data processing platform.To this end, the update of rewards has been performed in each iteration of MFRLMO by checking action-reward results.Though a tradeoff between memory, time, and accuracy has been achieved thanks to an ensemble technique, MFRLMO outperforms the baselines in accuracy.The proposed method does not suffer from a grid-based dimensionality that may affect the computation time.It can be deduced that to obtain consistent results of dispersion and risk analyzes, the number of instances should be as large as possible.We can conclude from the comparison that there is not a monotonic relationship between speedup and the number of cores.Although the model of the environment is not involved in the feature set of data sets when the type of RL is model-free, potential consequences of actions should be rigorously analyzed.Therefore, in future works, the exact number of random sequences we need to obtain a reliable RL depending on the state transitions could be investigated.Further, it is planned to analyze to what extent the interaction between the number of grid cells EAI Endorsed Transactions on Scalable Information Systems Online First

Figure 1 .
Figure 1.Overview of the proposed RL-based approach.

Figure 2 .
Figure 2. Dynamic update mechanism of 4x4 grid in model-free RL.

JavaSerializer
.parallelism number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user 20.compression.codeccodec used to compress internal data such as RDD partitions, event log, broadcast variables and shuffle outputs lz4, lzf, snappy, and zstd -Apache spark spark.reducer.maxSizeInFlightmaximum size of map outputs to fetch simultaneously from each reduce task (MB) .file.buffer size of the in-memory buffer for each shuffle file output stream (KB) class to use for serializing objects that will be sent over the network or need to be cached in serialized form batches to run during each tensor flow object 1 4

Figure 8 .
Figure 8. Short-term CVaR results of the experimental data sets.
EAI Endorsed Transactions on Scalable Information SystemsOnline First presented in Figure2.Steps 8-14 change the rules of env depending on the yielded rewards.The herein described objective functions include three types: sparkAccu, sparkMem, and sparkT ime.sparkAccu is the success of ml_multilayer_perceptron_classif ier in terms of accuracy and it is calculated leveraging the elements of the confusion matrix as given in Equation3.The optimal value gives the highest reward at the end of 50 iterations that were determined after conducting various trials on the experimental data.On the contrary, the values representing the lowest rewards are the optimal configuration of hyperparameters for sparkMem and sparkT ime.This is because they are associated with memory and time in which a maximization-based model is established.In Step 14, the rewards are saved depending on the random progressive 1000 transitions.Step 15 defines learning rate alpha, discount factor gamma, and ϵ.Since the model is established relying on a large number of iterations of actions, alpha is kept to a minimum.

Table 1 .
. P ayload can be accessed through Zenodo open source code sharing Search spaces of hyperparameters and spark parameters.

Table 2 .
Spark benchmarks and MLlib data sets preferred in this work.

Table 3 .
Mean State-Action function Q values of Microsoft data set in optimizing for time.
complexity of Hyp-RL is O(N * D * T ) where N denotes the number of episodes, D is the number of data sets, and T is the number of tasks.The complexity of MFRLMO is O(count * N rs ) where count is the number of updates of env and N rs is the number sequences in sampling defined in Algorithm 1-Step 14.

Table 6 .
The performance of optimizing ml_multilayer_perceptron_classif ier with the baselines.