Heterogeneous High-Performance System Algorithm Based on Computer Big Data Technology

INTRODUCTION: This paper proposes a scheduling algorithm for heterogeneous systems based on prioritization (PQDSA). This algorithm is a sort method based on a directed acyclic graph (DAG). The critical nodes in the network are grouped according to the communication and computing costs in the network. This increases the parallelism between task schedules and reduces the completion time of work sets. Then, a method of assigning multiple tasks to multiple processors using interpolation is proposed. The PQDSA method can effectively reduce the time of scheduling multiple tasks and improve the scheduling effect. PQDSA is compared with EDL-θ and EDF scheduling methods. The results show that this method has better scheduling efficiency.


Introduction
Heterogeneous computing is a kind of computing mode composed of arithmetic units with different kinds of instructions and structures.Commonly used computing units include CPU, GPU, DSP, ASIC, FPGA, etc. [1].Its advantage is that in addition to using the traditional CPU calculation, it can accelerate the computing processing speed through other computing units so that the computing device has a more efficient processing capacity.Heterogeneous computing has been increasingly applied in cluster systems such as supercomputers and cloud servers.Graphical Processing units (GPUs) are increasingly used in compute-intensive software fields because of their high computing speed, fast storage bandwidth, and parallelization.Whether on a PC or a supercomputer, GPUs play a dominant role.Among the 500 supercomputers in the world, including the SGI AltixUV and CrayXK6, all use graphics processors on a large scale.GPU parallel computing is moving into the mainstream.In recent years, with the development of CPU and GPU technologies, many high-performance computers have been used to build heterogeneous computing platforms [1].
CPU and GPU hardware characteristics are different.The CPU uses a chaotic mode of operation, which uses the prefetch and Cache hierarchy to reduce the memory access delay.It has a larger cache capacity and more data and logical computing units, which makes the CPU ideal for performing complex and data-dependent computing tasks such as distributed computing, data compression, artificial intelligence, and physical simulation.The GPU uses a continuous running method.It operates in many units.It relies on massive multithreading to hide its memory access latency, while GPU performance depends on the application [1].Even the same program can impact the performance of the GPU.The main uses of graphics processors are image analysis, data processing, video processing, etc.These differences lead to significant differences in CPU and GPU execution performance.Some applications can run fast on the CPU (GPU) but slow on the GPU (CPU) [2].It is imperative to effectively utilize the advantages of CPU and GPU in heterogeneous systems, improve the operating efficiency of hardware, D. Pan and achieve optimal resource allocation.Therefore, a dynamic task scheduling algorithm is adopted to reasonably allocate computing tasks to corresponding hardware computing units in a particular proportion to achieve the above purpose [3].
The academic community divides workload balancing in computer systems into two types: one is static, and the other is dynamic.Static scheduling refers to setting the assignment rate of work according to the expected work cycle before the work starts.The theoretical calculation of the prediction time is realized by calculating the execution performance of each processor, compiling time parameters and offline learning time.Although this method does not need synchronization between multiple tasks and the communication cost is low, its application scope is narrow, and there are problems of unequal load distribution.Qilin method is a classical static task scheduling method.It applies the running and data transfer speeds tested in the learning process and constructs the evaluation model between CPU and GPU.A hybrid static scheduling algorithm is presented in the literature [4].An optimization method based on a genetic algorithm is proposed.Firstly, it uses a list heuristic scheduling algorithm, including subsequent nodes, to obtain a scheduling result close to the best.Then the genetic algorithm is used to optimize the scheduling results in the first stage.Communication and calculation methods are proposed in the literature [5].The algorithm realizes the real-time statistics of system running time and data sending time by offline method.The predicted operation and delivery accuracy depend primarily on the actual operation.Due to different operation requirements and different hardware conditions, the required operation time will be different.This method has some limitations in practice.
Dynamic scheduling is an algorithm that determines the load allocation ratio based on resource sharing between GPU and CPU.Although dynamic scheduling costs are higher than static scheduling, its time estimation is more accurate.Literature [6] proposes a dynamic scheduling algorithm based on universal memory address space access (AHS).The system's load characteristics and the CPU and GPU computing speed are analysed in real-time without offline data processing.Literature [7] proposes a dynamic adaptive scheduling algorithm (DSS).The work blocks assigned to the processor during the initial work process are relatively large.As the number of tasks remaining decreases, so does the number of tasks for the CPU and GPU.Literature [8] describes a method of job theft.Because the GPU cannot actively communicate with the CPU, the GPU cannot make load requests to the work pool.In literature [9], when there is no dedicated route, the communication is controlled by a dynamic program combined with GPU and CPU, but this method will reduce the operation efficiency.This thesis aims to improve the parallelism of multi-core systems and improve the efficiency of multi-core systems.A priority queue partitioning scheduling algorithm (PQDSA) is designed for parallel work with multiple entries and multiple exits.First, the number of priority queues is determined according to the number of input nodes, and then it is sorted and scheduled.This article can simplify a complex and diverse set of tasks.

Task scheduling model
In a heterogeneous system, assigning a limited number of tasks to different processing units to complete the operation is complex.Usually, such problems are described by DAG diagrams.A task can be expressed as a node.Nodes are the minor units in the scheduling process [10].When a task is to be completed, the host finds the appropriate processor according to a particular search strategy and schedules it.It distributes the work to the appropriate processor to achieve a predetermined optimal solution.Figure 1 shows a task assignment model that assigns five tasks to two processors.p on processing j q }.Here j q refers to the j handler in a heterogeneous system.
P is a set of handlers.{ , , , }, represents the total number of processors.When one or more entry and exit nodes exist, a virtual node with only entry and exit nodes can be constructed [11].This virtual node requires neither extra communication cost nor extra operation cost.When scheduling, we need to receive all the information from each leading node, and then we can enter the ready state and schedule.The DAG task diagram is shown in Figure 2. The value at the oriented edge represents the dependence between the two nodes and is called the communication cost.T p q and 2( , ) T p q represent the start and end times of work i p on a processor j q , then the mathematical relationship between the two problems can be expressed by the formula (1) : 2( , ) 1( , ) 1

2( , )
exit j T P q represents the completion time of the entire schedule [12].If there are multiple output nodes, virtual nodes with zero computation cost and zero communication cost are added to the output node so that the output node becomes the final unique output node.The scheduling length is: The ultimate goal of the sorting problem is to distribute the work of each node in the network reasonably according to a particular order to achieve the shortest sorting time.
Definition 5: Scheduling effectiveness [13].The scheduling efficiency of the algorithm refers to the ratio of task scheduling acceleration to the total number of D. Pan processors.Its value varies between 0 and 100%.This method not only considers the execution speed of the DAG algorithm, but also considers the number of processing units.Its formula is: n represents the number of processors.Speedup is the acceleration ratio.Acceleration ratio refers to the ratio of the sequential computation time of the task set to the scheduling length.The notations and parameters that can be used in this article are summarized (Table 1).

Priority queue division Phases
When the task is scheduled, the number of entering nodes determines the priority.More than one direct leader node is called a critical node.The main idea of task partitioning is to divide key nodes into suitable queues to generate the best task-scheduling queue.This calculates the average AVGT completion time for each node [14].The node is divided into several small blocks based on the value of AVGT.And assign it to the appropriate processor.A blank queue ( 1, 2, 3 ) ( ) i Mpred p represents the total number of direct leading nodes in node i p .For complex task applications with multiple input and output nodes, nodes with no operational cost and no transmission delay can act as pseudo-input and pseudo-output nodes.Virtual nodes do not affect the overall task assignment.AVGT is obtained from the input node by recursion.Its expression is expressed in formula ( 7 Mpred p > , the node is a key node.
AVGT was compared with their leading nodes.Place a node in the queue of nodes with the largest AVGT value.
Figure 3 shows the result of the task splitting queue (image cited in Wireless Communications and Mobile Computing, 2018, 2018.).

Stages of prioritization
The number of input nodes determines the queue order.Adding nodes in the queuing process increases the communication cost and parallelism between nodes in the network.Because of the presence of essential nodes, the data correlation between queues will not be eliminated.When deciding the order of A group of multiple tasks, we must first estimate ( ) i TV p of each work node and then sort each working node.In the process of executing tasks, a significant problem is how to determine the location of essential node tasks [15].If the previous node of a critical node has not been arranged, the previous node must be arranged first before the next node can be arranged.If this node is crucial, then the number of direct leading nodes ( ) i Mpred p of this node is more than one.If all direct leaders of a node come out of the same queue, then the key node is assigned directly.If one of its immediate front ends has other queued nodes, consider whether this node has been scheduled and whether the data has arrived.
When it is judged that the leading node has been arranged and the data has arrived, the critical node can be arranged.Nodes of ( ) i TV p with the same value can be assigned arbitrarily.In the priority selection phase of the task, the ( ) TV p values of the nodes in priority queue 1 S and 2 S are first calculated.Sort by its size in units of ( ) i TV p .You get an example to represent priorities 1 S and 2 S .The corresponding set of nodes is { , , , , } P P P P P and 2 4 5 7 10 { , , , , } P P P P P .

Processor selection phase
The algorithm sorts multiple tasks in order, dramatically reduces the correlation between tasks and increases the parallelism, thus shortening the average execution time of multiple nodes.Then, according to the tasks in the queuing process, determine the order in which they are produced.Determine that the critical node meets the priority constraint condition [16].This section focuses on scheduling tasks on the appropriate processor to execute to make the node progress faster.In applying the DAG work chart, work is assigned to specific processing units to be completed.When the node is finished, it needs to wait until all of its work is completed, send the corresponding data to its processor, and start its work.
There is no communication overhead when two work nodes run simultaneously on the same processor.This critical node has the highest task priority in the same hierarchy.The formula for calculating the real start time ( ) ST and the actual end time ( ) FT of work is as follows: ( ) 0 11)   0 k = when i p and its start node j p are running on the same processor.
1 k = when i p and its leader j p are running on different processors.
If each job in the queue has only one direct leading node, then all the jobs are assigned to the same processor.

EAI Endorsed Transactions on Scalable Information Systems
Online First

D. Pan
There is no additional communication burden in this case, and no need to wait for transmission from other nodes.
When the task has more than one direct leader node, this node is the critical node.The last precursor node where data arrives is the crucial parent node.

Experimental parameter setting and deployment
The hardware requirements for this test are shown in The precise figures of DVFS power consumption parameters 0 C P and λ of CPU are given.Perform 11   tasks on the CPU with ten different voltage and frequency Settings.The average result is obtained by fitting the obtained results.The result is 0 C P =57.57W and λ =31.75.Use ten different voltages/frequencies on the GPU to complete 11 tasks.The result of fitting was 0 C P =65.11W, σ =66.44.
Eleven tasks are completed at ten different frequencies on the CPU.The measured execution time and energy consumption value establish the system performance model.The state of operation of a particular type of task at a particular type of voltage/frequency on the processor can be derived from the system performance model.
When the task of 11 test sets is executed on the GPU when the number of cores on the CPU to assist the GPU calculation is less than 3, the execution efficiency of the GPU will be significantly affected.Therefore, in this experiment, three cores from the 18 cores of the XeonE5-2695 CPU were selected as auxiliary cores to run at the highest frequency of 3300MHz.The remaining 15 cores are used as computing cores to maximize the utilization of CPU computing resources.
This paper adopts the CPU Freq tool under Linux to realize the dynamic adjustment of CPU voltage and frequency level.It can view and individually adjust the frequency of each core in the CPU.Use the Nvidia-semi software on the GPU to adjust the voltage/frequency.This tool, developed by Nvidia Inc., monitors GPU usage and changes its state.Use Intel's Power Governor tool to measure the power consumption of the CPU.Use Nvidia-smi to measure the power consumption of your GPU.The energy consumption of the whole server and the energy consumption of the switch is measured by the energy meter.
The energy consumption / turnon off E of the server switch is 262.50MJ.When the CPU and GPU are idle, power consumption is set to the lowest voltage/frequency setting.The CPU power consumption at 1200 MHZ is 28.05W, the GPU power consumption is 9.49W, and the GPU power consumption is 9.89W at 544MH core frequency.The measured other power consumption P0 value is 44.46 W. It is also assumed that each server can be equipped with 1/2/4/8/16/32 CPU/GPU pairs.It is expressed as P1/P2/P4/P8/P16/P32.This paper models the case of sending a task to a task cluster in one hour.In the base unit of 1 second, Time∈ [1,3600].The amount of work completed in each period corresponds to the normal distribution.Each completed assignment was randomly selected from 11 assignments.

Experimental comparison algorithm
This paper will give a comprehensive evaluation of these new scheduling methods.PQDSA, EDL-θ scheduling algorithm and EDF algorithm were configured on different server CPU/GPU pairs.The experimental results are given under different running times and load conditions [18].The three methods' energy consumption and running speed differences were compared.EDL-θ algorithm is a new method combining DVFS technology with DRS technology.

Algorithm comparison results
PQDSA method is a priority queuing method based on job category.The benefits of this approach are shown in Figure 4. Compared with the standard EDF method, the PQDSA method can save 20.52%,26.15%,30.42%,31.56%,36.25%,and 39.17% energy consumption at P1~P32.Compared with the EDL -theta method can save 7.08%, 5.1%, 5.73%, 6.67%, 10.73%, and 2.92% of the energy consumption.The server's idle energy consumption and switching energy consumption can be effectively reduced by using the heterogeneous algorithm of the platform.The PQDSA method is first to determine the optimal execution times of each job and then schedule it.And default tasks on that processor at an optimal frequency.In addition, the PQDSA algorithm also uses a way to group work.The characteristic of the algorithm is that there is low energy consumption and a short execution time between the tasks in the execution process.Ensure that most of the work is done at the optimal frequency on the preferred processor.Most work is done at the lowest theoretical state [19].It also prevents DVFS tasks from executing at high frequencies on undesirable processors.Before the DVFS alternative method is applied, the alternative method that does not increase the system's energy consumption should be adopted as far as possible.The PQDSA algorithm generates a smaller energy consumption in the system, significantly reducing idle and server switching consumption.It has the best energy consumption efficiency compared with the other two methods.EDL-θ algorithm is based on DVFS technology.The performance of this method is not ideal under the experimental environment in this paper.In P1-P32 cases, the performance is worse than the EDF algorithm.Total energy consumption is 148.33%,155.31%, 161.35%, 158.54%, 152.50%, and 152.40% of EDF, respectively.Although DVFS technology can significantly reduce the number of open servers and the system's idle energy consumption and switching energy consumption, the current DVFS technology does not divide the system into multiple types, resulting in solid randomness in the system's work.In addition, because DVFS technology is used, tasks are executed at a higher frequency when not preferentially selected [20].This will cause the energy consumption in the execution process to increase sharply and affect the execution efficiency of the algorithm.The PQDSA method has the best energy consumption efficiency in the case of high load.Compared with EDF and EDL-theta algorithms, the PQDSA algorithm can achieve 31.56% and 56.46% energy-saving effects when 8 CPU/GPU pairs are loaded on each processor.

Processor The effect of CPU/GPU logarithm on algorithm performance
When the number of CPU/GPU pairs in the server is small, the running power consumption accounts for most of the total power consumption.At the same time, in the operation of the system, the running consumption of the server side and the server switch consumption also occupy a large proportion.The reason is that the number of CPU/GPU dual cores is too small, making the computing power of a single server too low.More servers must be opened to process the submitted tasks, resulting in a significant server switch energy drain (Figure 5).
The total energy consumption of the three methods decreases as the logarithm of CPU/GPU in each server increases.The overall energy consumption of 32 CPUs/GPUs on EDF was reduced by 14.48% compared to 1 CPU/GPU pair.Compared with the traditional PQDSA method, the energy consumption of the whole system is reduced by 36.68%.The overall energy consumption of the EDL-θ algorithm is reduced by 11.00%.When the number of CPUs and GPUs in a single server increase, the computing capacity of a single server increases, and the number of servers that need to be opened significantly decreases.This dramatically reduces the energy consumption of the server switch, which in turn reduces the overall energy consumption.Under the proposed test conditions, when the number of CPU/GPU pairs is more, the total energy consumption of the cluster to complete the task is less.The energy consumption ratio during working hours to total energy consumption is taken as the effective energy consumption ratio, as shown in Table 3.
The PQDSA method has good performance.Especially when the logarithmic ratio of CPU and GPU is large, the effective energy utilization of the system reaches nearly 100%.10 7

Conclusion
This paper presents a scheduling algorithm based on priority queue partitioning (PQDSA).The PQDSA algorithm sorts the nodes and divides the node tasks with high complexity and low parallelism into several sequences.There is a dependency between the key node and the leader node.In the grouping process, the critical nodes are placed in the corresponding team according to the average arrival time of nodes.This can reduce the task time consumption.Compared with the traditional EDF and EDL-θ methods, it is found that the PQDSA method has more significant advantages in scheduling length and scheduling efficiency.

Figure 1 .
Figure 1.A scheduling model for five jobs on two processors

Figure 2 .
Figure 2. DAG application with ten work nodes

4 :
2) ij d represents the amount of computation of task i p in processor i q .n represents the number of processors.Among them, the communication cost in the network, the computing cost of the network itself on the processor and the resource allocation of the front end of the network are three important factors that affect the network performance.Definition 3: Task Level ( ) i TV p value.Each node is recursively calculated, and the T-level value ( ) i TV p of each node is obtained.Its formula is: Plan the length.exit P represents the output node.
i S i = L must be used when prioritizing.The size of i is determined by the number of input nodes.The parameter ( ) i Mexit S represents the number of nodes expelled from queue i S .

Figure 4 .
Figure 4. Cluster energy consumption for different CPUs/GPUs

Figure 5 .
Figure 5. Clustering energy consumption under three different time Settings

Heterogeneous High-Performance System Algorithm Based on Computer Big Data Technology
j p .ij v stands for additional communication load when two work nodes are running on different processors.{ | ij D d = represents the computation cost of node i

Table 1 .
Symbol definitions i succ p A group of subsequent nodes of Task Node i p i D The cost of averaging operations entry P Inlet node exit P Exit node i S Priority queue the blank queue required for sorting ( ) i AVGT p Average completion time of task node i p ) i Mpred p Sum of direct leader nodes of task node i p ( ) i Mexit S ): , select the item node and place it in the blank queue.The AVGT of each node is recursively calculated from the input node.When a node is selected in the queue, it is processed as follows:If node i ij v represents the communication cost required to transmit data from task node i p to task node j p .entry D p has only one direct precursor node, then this node is placed directly in the queue of its predecessor nodes.2) If node i p has multiple direct leader nodes, that is, node ( ) 1 i

Table 2 .
[17]g Rodinia standard system, 11 open CL systems are selected as research objects[17].They are backprop, bfs, gaussian, Hotspot, hotspot3D, lud, nn, nw, pathfinder, srad and streamcluster.Because the voltage/frequency range that each CPU/GPU can adjust is different, the voltage/frequency setting is standardized first.The minimum CPU frequency is 1200 MHZ.Its maximum operating frequency is 3300 MHZ.These frequencies are divided into ten frequency levels at a step size of 233 MHZ.f C ∈ [0.59, 1.64] can be obtained by standardizing it with the default CPU frequency of 2100 MHz.A GPU core with a minimum frequency of 567 MHz and a maximum frequency of 1595MHz was used.The article divides it into ten levels.The default frequency of the GPU core is 1357 MHz, which is standardized to obtain f Gc ∈[0.44,1.22].

Table 2 .
Experimental environment

Table 3 .
Effective energy consumption ratio