An Accurate Viewport Estimation Method for 360 Video Streaming using Deep Learning

Nowadays, Virtual Reality is becoming more and more popular, and 360 video is a very important part of the system. 360 video transmission over the Internet faces many di ffi culties due to its large size. Therefore, to reduce the network bandwidth requirement of 360-degree video, Viewport Adaptive Streaming (VAS) was proposed. An important issue in VAS is how to estimate future user viewing direction. In this paper, we propose an algorithm called GLVP (GRU-LSTM-based-Viewport-Prediction) to estimate the typical view for the VAS system. The results show that our method can improve viewport estimation from 9.5% to near 20% compared with other methods.


Introduction
According to various surveys, 360-degree video has become increasingly popular in recent years, which drew our attention [1]. Because the network capacity required for a 360-degree video is much more than that of an ordinary video, streaming 360-degree video is a major challenge today. Furthermore, the latency in video streaming necessitates extremely low requirements in order to match real-time applications. To reduce video capacity while maintaining good user experience quality, numerous 360-degree video streaming methods are recommended. Among of which, Adaptive viewport streaming is one of the most used ways nowadays. In this method, 360degree videos will be broken into tiles with varying weights to be delivered in a transmission medium. As illustrated in [2], the VAS system's quality will suffer significantly as a result of the incorrect view prediction in [3], [4], [5], [6]. As a result, viewport prediction is an essential need of the VAS system. * Corresponding author's email: huong.truongthu@hust.edu.vn Because estimating the viewport helps to divide the weights among the most likely tiles. We can thus reduce the capacity of tiles that users ignore while retaining the quality of tiles that they do pay attention to. To complete this viewport prediction with great accuracy is required for a VAS system.
Recently, there have been numerous suggested viewport prediction algorithms. However, because these approaches used different input data sets and parameters, comparing their quality is difficult. Some metrics used to evaluate the quality of viewport prediction systems include the root-mean-square error (RMSE) [4], the number of tiles lost [5], and the percentage of viewport black area [6]. The preceding methods have an unsolved difficulty in that the authors only measured for a short period of time, primarily near the beginning of the video because users prefer to vary the angle of the entire video, it is not possible to predict the viewport for an entire video. We use a Figure 1 to visually represent the change in the user's perspective and the difference between the predicted viewport and the viewport seen by the user.
In this paper, we assess and compare common viewport prediction algorithms for 360 video viewport  adaptive streaming. Prediction performance is examined not only in terms of accuracy but also in terms of redundancy to shed light on the actual performance of present models such as Last [3], Linear [6], LSTM [5] and GRU [7].
The rest of the paper is described as follows: Section 2 discusses related work. Section 3 elaborates the proposed method named GLVP. Section 4 contains the performance evaluation. Finally, section 5 concludes our findings and future work.

Related work
Viewport Adaptive Streaming has been proposed in [8], [9], [10], [11] to cope with the high bitrate difficulty of 360 video. The concept of VAS is to transmit highquality video segments visible to users (i.e., viewport) and lower-quality video parts to the rest of the video [8]. The majority of previous research [1] used a tilingbased VAS method, in which the entire 360 video is spatially separated into tiny portions called tiles, each of which is encoded into numerous versions of varying quality. High-quality versions of the tiles that overlap the user viewport are chosen. Low-quality versions, on the other hand, are picked for tiles beyond the user viewport [11].
The so-called viewport predictor, which anticipates where the user will gaze in the future [11], is a critical component of Viewport Adaptive Streaming. Because of its simplicity, early publications [9], [3], [11], [12], [13] used linear regression and its modifications (e.g., Weighted Linear Regression) to predict viewport location. Recent researches have used neural networks to forecast viewports. Specially, Long-Short Term Memory (LSTM) [5], [14], Gated Recurrent Network (GRU) [7], and other Recurrent Neural Networks (RNN) has gotten a lot of attention. Furthermore, probabilistic models like Gaussian Mixture [15] and Reinforcement learning algorithms [16] like Contextual Bandit were applied. In [15], the authors proposed a hybrid user and video evidenced viewport prediction method to reduce bandwidth consumption in live mobile VR streaming, the article [15] differs from our original goal. In [16], the authors proposed a viewport prediction algorithm and ran it on a testbed for streaming video, but the data in that article was based on experiments when showing users 360-degree videos, so the dataset in [16] was different from our dataset. However, all of the above solutions still have limited performance in terms of accuracy, in this paper, we propose a new model to improve the viewport prediction accuracy.
Furthermore, several other approaches for viewport prediction are mentioned in [17], [18], and [19]. The authors in [17] proposed employing FoV prediction and caching to create a live streaming system for 360-degree videos. Besides, in [18], the authors developed a clustering-based viewport prediction algorithm that uses viewport pattern data from prior video streaming sessions. Nevertheless, this method [18] is very dependent on the video content. In [19], 2 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems 09 2022 -12 2022 | Volume 9 | Issue 4 | e2 the authors extracted video semantic information, in which deep learning-based video analysis requires powerful processing resources and vast memory space. Whereas, most client devices, such as small mobile devices or Head Mounted Displays (HMD) have limited computing and memory resource. In general, the 3 aforementioned studies are based on fixed context and consume a lot of memory. Meanwhile, our solution automatically adapts to the head movement, selflearning through the training process, and is capable of removing unnecessary memory areas, thus consuming less memory. Let P (t 0 ) be the position of the viewport at the time t 0 . The longitude and latitude values of a viewport can be used to specify the position of the viewport's center point [1]. In Figure 2, we can see that spherical video captures a 360-degree view of a scene. It is the main content type in Virtual Reality, providing an "immersive" viewing experience. Viewport is the video area that a user can see at a given instant because of the human Field of View.

Problem Formulation
The viewport predictor's job is to predict the location of the viewport P (t 0 + m) in the future t 0 . The forecast horizon is denoted by the letter m. Because 360-video streaming is typically done on a segment/adaptation interval basis [20], the predictor must provide a prediction for the interval [t 0 + m, t 0 + m + s], where s denotes the segment duration, as shown in Figure 3.

Design of viewport prediction and selection
Our proposed position prediction is based on a hybrid of LSTM (Long Short Term Memory) and GRU (Gated Recurrent Unit) algorithms, which is called GLVP standing for GRU-LSTM-based-Viewport-Prediction. LSTM can estimate the long-term correlation in the data, thereby being able to model a longer term trend. As the result, LSTM can potentially provide more accurate viewport prediction. However, LSTM requires a rather long initial processing time. Therefore, we design a GRU block in front of LSTM to speed up input data processing and, as a result, improving accuracy in the initial seconds when compared to an algorithm that solely uses LSTM. Figure 4 shows the architecture of a cell in our proposed technique.
The GLVP model includes n inputs corresponding to n frames/images of a video: {x 1 , x 2 , ...x t , x n }. M t−1 , a t−1 are the cell state and hidden cell state at time t − 1. While, M t , a t are cell state and hidden cell state at time t. In our design, a t represent for the selection of the predicted viewport, and M t is the data that will be used as input for the next cell. The gates used in the cell are defined as follows: • Forget gate -q t : to remove unnecessary information from the current cell.
• Input gate -v t ,M t : to select important information to be used in the current cell.
• Reset gate -r t , n t : to control how much of the previous state is retained.
• Output gate -M t , d t , a t : to determine what information from the current cell is used as output data.
The operation of the whole model is described step by step as follows: Step 1: In the first step, q t will determine which information from Input x t and Hidden Cell State a t−1 should be removed and eliminated.
where: q t : data filter port from the output value of time step  Step 02: Add n t port Figure 5. Operation of the Reset port (t − 1) U q , W q : corresponding weight matrix of the forget gate.
b q : vector bias corresponding to the forget gate.
x t : vector input at each step time t. a t−1 : output of the cell at the previous time step (t − 1).
σ : The sigmoid function to transform the value q t to the range of (0, 1).
Completely remembered 0 Completely forgotten Step 2: In the second step, input data will not be taken entirely from the input vector x t and output from the previous cell a t−1 , so it is necessary to select what information should be used to calculate M t . In this step, we create an input gate (M t , v t ) to select important information to be used in the current cell.
The parameters U M , W M , b M , U v , W v , b v are similar to the formula (1). The activation function is the tanh function used to change the value to the range of (-1,1).
Step 3: The operation of Reset gate is illustrated in Figure 5, in which, instead of just creating one q t like in GRU, we need an additional port r t to ensure that the effect of the previous hidden state is best reduced. Also, to reduce the effect of the previous hidden state, we added a new potential hidden state n t (as seen in Figure 5b). 4 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems 09 2022 -12 2022 | Volume 9 | Issue 4 | e2 An Accurate Viewport Estimation Method for 360 Video Streaming using Deep Learning

Algorithm 1: Viewport Estimation
By adding these two new ports, we can improve the strength of both the GRU and LSTM algorithms. The Reset gate (r t , n t ) is designed with the following formulae: Step 4: In the fourth step, the cell state at current time t is calculated based on the results obtained from (1), (2), (3), and (4), as follows: Step 5: In the final step, output value a t of the proposed cell is calculated as follows: Where: Variable d t decides how much information to get from memory port M t , combining with port n t to calculate the hidden state at time ta t .
The whole process of estimating viewports can be summarized in short in the Pseudo code illustrated in Algorithm 1

Experimental Settings
In our experiment, we used 360 videos, specifically the diving videos, and head motion traces for our experiment. The head movement traces and videos were obtained from the dataset [21] because the viewport positions are given as a quaternary. The viewport positions are transformed into longitude and latitude values in our implementation to reduce computing complexity. For each of the evaluated approaches, longitude and latitude are computed independently, then blended for the final evaluation. We use the input data set of viewport locations shown in the Figure 6. Figure 6 shows that viewport position #1 differs from viewport position #2. Viewport position #1's longitude and latitude are more volatile than viewport position #2's longitude and latitude. Because of the variations between the two viewport positions, we will evaluate algorithms with numerous distinct perspective adjustments that will be good. In our evaluation, we compare GLVP with the other approaches such as LAST [3], LINEAR [6], LSTM [5] and GRU [7] under the context of tiling-based VAS [9] in the first 6 seconds. Because it has been found in our experiment that in the following seconds, the accuracy of all methods is almost the same; the difference only appears in the first 6 seconds.
Let total denote the current set of visible tiles at time t. total e stands for the estimated visible tile set. In our experiment, the Accuracy metric is used in our performance evaluation.
Accuracy [22]: The rate of accurately-calculated visible tiles to the total number of visible tiles.
Since the purpose of this forecast is to drastically reduce the capacity rather than consume it as usual, the accuracy is the important metric because it is a measure to dramatically reduce the total number of visible tiles, bringing better results for future users.

Estimation performance evaluation
As can be seen in Figure 7, GLVP is always the most accurate among all solutions in all 6 seconds. From the third second to the sixth second, the accuracy of all algorithms is larger than 80% and the redundancy of all algorithms is larger than 80%. In all 6 seconds, only the GLVP algorithm has the accuracy over 80%.
The LSTM algorithm yields a low accuracy of only approximately 20% at the first second because LSTM takes an early amount of time to process the data, therefore it is not highly accurate in the first second. From the third second, LSTM has higher accuracy at 5 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems 09 2022 -12 2022 | Volume 9 | Issue 4 | e2 over 80%. While LAST has a high accuracy range from 70% to 90%. However, LAST is not steady for the entire 6 seconds because LAST only relies on the previous viewport for prediction. Therefore, the solution will not be highly accurate in case the viewer's viewport changes a lot. In addition, in all 6 seconds, LINEAR gets the accuracy of between 60% and 85%. It is interesting to see that LINEAR is always less accurate than LAST in the whole 6 seconds. Because the LAST technique merely estimates the viewport location based on the last viewport position. Besides, the LINEAR approach, previous viewport position data is fitted to a linear function in order to reduce the root mean squared error. The accuracy of the GRU approach ranges from 50% to 95%. GRU is less accurate than LAST and LINEAR from the first second and greater than of LAST and LINEAR in the next three seconds. It can be explained that because the GRU algorithm takes the first amount of time to process the data, there is no high accuracy in the first seconds. But from the next seconds, the GRU gives good results. In another case study viewport position #2 as presented in Figure 8, GLVP also outperforms LAST, LINEAR, LSTM, and GRU. That proves that GLVP works well in different cases of viewport positions. 6 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems 09 2022 -12 2022 | Volume 9 | Issue 4 | e2 From another aspect, we also show the redundancy landscape of those solutions. The results show that GLVP has a smaller redundancy than any of the other solutions. Redundancy has the purpose of reducing the impact of viewport prediction errors on user experience.

Training time evaluation
Besides investigating the estimation performance of GLVP, we also study the training time of the GLVP model in comparision with the other existing solutions. To measure this parameter, a Python-written experiment was developed on a machine running 64bit Windows 10 with 16384 MB of RAM and an Intel(R) Core(TM) i7-6500U CPU running at 2.50GHz (4 CPUs), 2.6GHz processor. As Figure 9 shows the training time whilst LAST takes the least time. However, this is the time for training. Once the training process is finished, and the learning model is exported, the inference time of this estimation process will be very fast in real time. It is assumed that we just need to update and re-trainign the learning model periodically since data pattern of such an application is not changing quicky over time.

Conclusion
In this research, we have proposed a new method called GLVP to predict viewport position, thereby selecting the most appropriate viewports. The solution has been shown to outperform 4 current existing estimation methods in different scenarios. In the future, we will concentrate on improving the proposed method by incorporating more content-based data.