Retina-based quality assessment of tile-coded 360-degree videos on Industrial Networks and Intelligent Systems

Nowadays, omnidirectional content, which delivers 360-degree views of scenes, is a significant aspect of Virtual Reality systems. While 360 video requires a lot of bandwidth, users only see visible tiles, therefore a large amount of bitrate can be saved without a ff ecting the user’s experience on the service. The fact leads to current video adaptation solutions to filter out superfluous parts and extraneous bandwidth. To form a good basis for these adaptations, it is necessary to understand human’s video quality perception. In our research, we contribute to building an e ff ective omnidirectional video database that can be applied to study the e ff ects of the five zones of the human retina. We also design a new video quality assessment method to analyze the impacts of those zones of a 360 video according to the human retina. The proposed scheme is found to outperform 22 current objective quality measures by 11 to 31% in terms of the PCC parameter.


Introduction
Nowadays, virtual reality technology has become popular, so it is of great interest to scientists. Virtual reality (VR) systems use omnidirectional content, which includes 360-degree views of scenes, to provide viewers with immersive experiences. Omnidirectional content is typically consumed utilizing Head-Mounted Displays as opposed to standard information displayed on a flat-screen (HMDs). In addition, a user only sees a small portion of the content (called a viewport) that corresponds to the current viewing direction at any given time [1].
In fact, because 360 videos have such a high bitrate, managing limitted system resources while ensuring user satisfaction (or called QoE -Quality of Experience) is a major challenge in omnidirectional content delivery. Especially, in the future, such a immersive service is expected to be delivered over sixth generation cellular * Pham Ngoc Nam. Email: nam.pn@vinuni.edu.vn networks which require a more comprehensive and prompt capture of QoE [2].
For this goal, many encoding and delivery schemes have been proposed in the literature [3] - [9]. Tilingbased viewport-adaptive streaming is one of the most used ways for 360 video streaming that is receiving a lot of attention from both academic areas and industry due to its ability to effectively reduce network bandwidth. A 360 video is spatially divided into small sections called tiles, each of which is encoded into numerous copies of varying quality levels in tiling-based viewport-adaptive streaming.
In general, when choosing a tile version, the visible tiles (those that overlap the viewport) are delivered in high resolution, while the other tiles are delivered in poor quality. Because users only see the visible tiles, a large amount of bitrate can be saved without affecting the user's experience.
To support the tile-based viewport adaptive streaming in the VR system, findings in the weight of different zones of a 360 video are highly necessary. Therefore, Figure 1. The proposed fitting model for impact of retina-related areas the goals of our research is to study user perception of omnidirectional content, including the following contributions: • To build a consistent subjective test scenario to accurately evaluate user perception on different retina-related zones of omnidirectional video. Thereby, we successfully established a database for omnidirectional videos that can be used for further research in the VR field.
• To build an efficient fitting method that can accurately find the impact weights of those different zones to user's quality experience on omnidirectional videos.
The overview of our overall analysis process is shown in Figure 1 which will be explained in detail in the following subsections. To the best of our knowledge, this is the first omnidirectional video database devoted to the effects of the five zones of the human retina. In this condition, we measure, for the first time, the effects of different zones on perceptual quality using a simple zone-weighted formulation. The zones corresponding to the fovea and parafovea of the human retina are discovered to be particularly significant for quality perception quantitatively. The proposed fitting model is found to outperform 22 existing objective quality metrics under multicast video scenarios with heterogeneous quality, especially foveal quality index.
The following is the rest of the paper: Section 2 provides an overview of the state of the art. Section 3 describes our proposed method, which is followed by an experimental description in Section 4. Finally, Section 5 provides conclusions and future work.

Related Work
Recently, a wide variety of objective quality metrics have been proposed over several decades [10]- [28]. Some of these metrics take into account the foveation feature, hereafter referred to as foveal quality metrics, take into account the foveation feature [13]. All of these measures, however, are restricted to traditional content. So far, there has not been a foveal quality metric for omnidirectional material, except for the study of [28] mentioned, but we find the results still limited. Among of which, PSNR [26] is the most effective metric for assessing the quality of omnidirectional videos, according to experimental data. It is worth mentioning, however, that the photos utilized in that study are of consistent quality. PSNR is ineffective when the quality is spatially changeable. According to our survey, there has been no comprehensive study of objective quality metrics for omnidirectional images with tilevarying quality in the literature. In this paper, in terms of assessing the quality of omnidirectional videos, we will show that our proposed solution outperforms On the other hand, subjective quality assessment also draws researchers' attention recently since omnidirectional images/videos become popular. There have been some research on subjective quality assessments of omnidirectional content [30]- [34]. In these researches, to generate images for user's rating in the subjective tests, numerous distortion types like as compression and Gaussian blur were used. In [30], the authors used 4 types of distortion including JPEG compression, JPEG2000 compression, Gaussian blur, and Gaussian noise. The authors in [31] only used one distortion type of H.265/HEVC compression. In the study [32], JPEG compression, JPEG2000 compression, and HEVC-intra compression were used. In [33], down sampling and JPEG compression were used to create the distorted images. In [34], video references are encoded using H265 encoding with a constant rate factor (CRF) = 10 and compressed with a quantization parameter (QP) = 22, 27, 32, 37, and 42 using the libx265 encoder of the FFmpeg tool. However, the above five studies only used images and videos with uniform distortion and did not contain images or videos of non-uniform quality. Therefore, these schemes are not suitable for VR video streaming, where user-focused areas should have high quality and less noticeable areas should have lower quality to save network resources. Furthermore, the foveation feature of the human eye is not taken into consideration when these databases are built.
Currently, there are also some studies on subjective quality assessments of images/videos with non-uniform quality in the literature [28], [35], [36], [37]. However, in the [36], [37]. the studies are only for traditional content without considering the omnidirectional contents. In [36], each image is split into four equal-width zones. To produce a distorted image with non-uniform quality, 2 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Online First after a fixed step size, the quality level of the zones are gradually decreased. According to the findings, when the step size is tiny, the difference in quality of experience (QoE) between non-uniform and uniform videos is negligible. Furthermore, the greatest step size that may be used without generating noticeable quality changes is determined by content characteristics. In [37], the authors divided each image into three zones with different quality levels: foveal, blending, and peripheral zones. The experimental results show that participants rarely perceive quality declines in the periphery zones with eccentricities greater than 7.5 degrees. Unlike [36] and [37], our research focuses on evaluating the quality of omnidirectional video experiences in five different regions. Therefore, it is possible to predict the quality deterioration at peripheral locations with an eccentricity higher than 7.5 degrees.
Besides, research [35] and [28] are the two researches to evaluate the quality of experience of omnidirectional content with non-uniform quality. The authors of [35] are interested in figuring out how to spatially lower image quality without affecting user perception. They propose that an omnidirectional image be divided into three zones corresponding to the macula, the near periphery, and the far periphery of the human retina. Each zone's image quality is gradually reduced until participants perceive a difference in perception. From the study results, a model including coding parameters for regions is proposed as a guide to reduce spatial image quality without loss of sensation. In addition, the authors have also shown that modelguided non-uniform quality scales accelerate image rendering by about 10× compared to the legacy schemes with uniform quality. However, the impacts of zones of the human retina on the quality of experience was not clearly quantified in [35]. Also, there is no performance evaluation of existing objective quality assessment metrics performed in [35]. In [28], from the original omnidirectional images, the authors extract the viewports to create the original dataset. In our study, the dataset generated from omnidirectional videos is more in line with the current trend when VR video streaming is of great interest. Based on zones of human retina, the authors in [28] divided each image into 5 zones respectively: the fovea, the parafovea, the perifovea, the near periphery, and the far periphery. To produce a distorted image with non-uniform quality, these zones will be encoded with different quality levels based on two scenarios: the quality degrades from the center to the periphery and vice versa. However, creating these non-uniform quality levels based on subjective factors is not suitable for real systems when factors on network conditions will mainly determine perceived quality.  In this paper, we use tile-based non-uniform data set with quality levels of tiles being selected based on bandwidth traces, and head-movement traces. In addition, a measure to quantify the impact of regions on experience quality developed from mean squared error (MSE) is also presented. In our study, a new metric -WZUQI -was formed from the UQI index (universal image quality index). WZUQI was proven to outperform some common metrics such as PSNR and MSE under different types of image distortions.

Proposed solution to quantifying impacts of viewport zones to human's quality of experience
In this paper, we first propose a QoE metric for 360video service, which is called W ZU QI (weighted-zone UQI). W ZU QI which will be used to investigate the effects of various zones on the perceived quality of 360°v ideos. Second, we propose a new mapping function to predict QoE (or MOS) based on W ZU QI automatically, without requiring a large subject test measurement that should be done by a large pool of users. This mapping function is proven to predict the MOS effectively to real MOS rated by end users. Let us elaborate our whole method step by step as follows: 3 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Online First

Step 1: Formulate a new QoE metric -W ZU QI
In our WZUQI method, a virtual viewport is divided Figure 2 with the corresponding zone eccentricities e shown in the TABLE 1. Zone Z 1 corresponds to the region of the fovea, zone Z 2 to the parafovea, zone Z 3 to the perifovea, zone Z 4 to the near periphery and zone Z 5 to the far periphery in the retina. Let denote weight w k {w k | 1 ⩽ k ⩽ K} that represents the impact of the Z k region on the quality experience of human. Here w k must satisfy the condition K k=1 w k = 1.
..N } be the original viewport and the distorted viewport, respectively.
W ZU QI is formed from the UQI index (universal image quality index) proposed by work [12]. This objective image quality index is mathematically determined with advantages such as: ease of computation, low computational complexity and independent of the images being tested, the viewing conditions or the individual observers. UQI was proven to outperform some common measures such as PSNR and MSE under different types of image distortions. The Universal Image Quality Index (UQI) is defined as follows: where: M: the number of pixels of each image. σ x : loss of correlation. σ y : luminance distortion. σ xy : contrast distortion.
x: is the average total of {x i | i = 1, 2, 3, ..., N } y: is the average total of {y i | i = 1, 2, 3, ..., N } The parameter values are calculated as follows: Then W ZU QI is formed as follows:

Step 2 -Form a new MOS mapping function
After determining the U QI index of each region (U QI k ), the QoE objective metrics W ZU QI is defined by formulae 2. Based on this QoE metric, we propose to form a new mapping function to calculate Mean Opinion Score (MOS). MOS is rated on a scale (1bad, 2 -poor, 3 -fair, 4 -good, and 5 -exceptional). Basically, MOS is given by users that reflects their satisfaction on a video service or any internet service. Initially the average MOS of a population must be calculated based on a large subjective test measurement. This subjective test require a lot of time to collect data as well as a large number of involved participants. Therefore, it is necessary to have a mapping function that can calculated a predicted MOS based on some QoE metric. And that predicted MOS should be highly correlated with the real MOS value rated by end users in reality.
In our paper, a 5-parameter logistic function is used to predict the MOS (Mean Opinion Score) values from W ZU QI values. The 5-parameter logistic function has actually demonstrated a good performance in mapping between the objective quality indicators and MOS in the state of the art [26] and [38]. The formula to calculate the predicted MOS ( MOS) can be computed as follows: where: {α i | i = 1, 2, 3, 4, 5} are the parameters to be fitted. In our work, least squares fitting is used to fit the parameters α i and the weights w k , as described in [39]. In the following subsections, we will describe our testing scenario in order to evaluate the performance of W ZU QI in terms of fitting accurately to the real MOS rated by real users. In our experiment, the proposed model runs on top of the tile-based non-uniform dataset formed by the research team to see how the performance would be affected. Figure 5 shows that the predicted MOS gets quite close to the subjective MOS rated by viewers.

Data Set Establishment
Our established dataset is formed from a subjective test measurement with 240 non-uniform viewports. Figure 3 shows the 95% confidence intervals of the MOS values. It can be shown that the scores cover the entire value range of 2.5 to nearly 3.5. At the two extremities of the score scale, the confidence intervals are typically smaller. This is because participants are more comfortable ranking very high (or poor) quality stimuli. In the following subsections, we will describe 4 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Online First  how we prepared the videos and set up the subject test to achieve a reliable data set.

Video preparation
In this section, we will describe how this tile-based non-uniform data set is established and the experiment setup to capture consistent rating of users on the video quality. A data set with tighter confidence interval shows lower variability, which results in smaller margin of error. Therefore, designing a good experiment to achieve reliable data set that truly represents the large population of viewers is extremely important. In our experiment, we use four 360°videos available on YouTube with different types of content such as indoors, under the ocean, natural landscapes, crowded streets, containing human face, day-light, night-light, etc. The purpose of covering a variety of different video textures is to achieve a data set that could represent a wide range of user quality experience. The specific characteristics of the four videos are described in the TABLE 2.
All these videos have a resolution of 3840 × 1920 (4K), 1792 frames, and a frame rate of 30 (fps). In order to create videos of non-uniform quality, the video is divided into T = 64 tiles (i.e., 8 × 8 tiling), each has a resolution of 480 × 240. We use the HEVC format to encode each tile into N = 7 versions corresponding to 7 QP values of 24, 28, 32, 36, 40, 44 and 48. From each streaming session, we select 60 viewports to ensure that the error values vary widely, corresponding to the user experience quality scores from high to low. Therefore, we finally get 240 tile-based images corresponding to 240 viewports, forming the desired tile-based dataset.
These viewports are collected continuously, in sequence of the context of the same video. In this way, the content-correlated viewports gives viewers a consistent quality experience like they are watching and assessing the quality of a video, rather than assessing sporadic viewports from different contexts or topics. With the dataset described above, we show some viewport samples extracted from four videos in Figure  4.

Subjective test set-up
To display the viewports, we use a set of devices consisting of a HTC VIVE PRO EYE headset and a computer. HTC VIVE PRO EYE has Dual-OLED displays with a combined resolution of 2880 × 1600 pixels and 615 PPI with the 110-degree field of view. In the testing, we employed the Absolute Category Rating approach [40], which was proved to be the best method in [37]. Before starting the testing process, participants are guided to familiarize themselves with the equipment, testing process and scoring. To avoid fatigue, the viewports from different videos are displayed alternately. Before an image is displayed, participants were asked to focus their gaze on a central point of the screen and hold that gaze as the image appears. For each viewport, participants spent 5 seconds maintaining their gaze, 10 seconds for rating the quality level, and then took a break of 5 seconds. Each participant gives a verbal score on a scale of 1 (poor) to 5 (excellent) that is then recorded by an assistant. The test is divided into 4 sessions, including 60 viewports/session with an average duration of no more than 20 minutes/session. The scores are collected from 40 participants with ages ranging from 18 to 40 years old. Following Recommendation ITU-T P.913 [41], a screening analysis of the collected scores is conducted. Consequently, three participants are rejected. The average scores of the valid participants are then used as MOSs of the corresponding images. 5 EAI Endorsed Transactions on Industrial Networks and Intelligent Systems Online First   Table 3. An overview of the referenced methods

Metrics
Description MSE [25] The Mean Squared Error is calculated using visible pixels in a viewport with equal weights. SSIM [10] Structural SIMilarity was calculated using the structural similarity idea.

UQI [12]
Universal Image Quality, any distortion can be modeled as a combination of 3 factors: loss of correlation, luminance distortion, and contrast distortion.

VIFp [14]
The Visual Information Fidelity (VIFp) and the Visual Information Fidelity (VIF) in the pixel and wavelet domains were calculated using the relationships between picture information and visual quality. VIF [14] NQM [15] Noise Quality Measurement or the recovered distorted image's Signal-to-Noise Ratio in comparison to the model restored image.

IW-PSNR [16]
Combining information content weighting with PSNR metrics to create Information Content Weighted PSNR.

FSIM [17]
Low-level feature weighting and local similarity measurements are combined to create feature similarity.

FSIMc [17]
Combining low-level feature weighting with local similarity metrics, feature similarity incorporates chromatic information. SR-SIM [19] Similarity based on spectral residuals, calculated using a spectral residual visual saliency model.

RFSIM [18]
Combining low-level feature weighting based on Riesz Transforms with local similarity measurements using Riesz Transforms.

ADD-SSIM [20]
The distribution of four metrics, including distortion position, distortion intensity, frequency changes, and histogram alteration, is considered in the analysis of distortion distribution.

PSIM [21]
The perceptual similarity combines the gradient magnitude similarities using two scales of color information similarity and a trustworthy perceptual-based pooling.

WSNR [22]
The weighted signal-to-noise ratio is the proportion of the average weighted signal power to the average weighted noise power, with the contrast sensitivity function as the weighting function.

FMSE [24]
The fovea, an area of the retina with the highest density of photoreceptors, has the highest visual acuity, with visual acuity rapidly decreasing for visual regions beyond the point of view.

FPSNR [23]
Foveal Peak Signal-to-Noise Ratio is calculated by combining PSNR measurements with weighting for each pixel based on the local frequency at that pixel.

F-SSIM [13]
Foveal-SSIM combines SSIM metrics with weighting for each macroblock based on the local frequency of pixels in that macroblock.

GSIM [27]
The omnidirectional photo's visual quality. A composite assessment of all weights and patch scores is used to estimate an image quality score.

PSNR [26]
Peak Signal-to-Noise Ratio (PSNR) is calculated by multiplying each pixel's weighting by the local frequency at that pixel.

ZWF [28]
This method, which focuses on the visual features of the human eye, was used to assess the quality of omnidirectional images and concentrate on various zones around the foveation point.
original author. In this study, to reflect what viewers actually watched, 22 metrics were calculated only for the viewports (i.e. visible pixels) of omnidirectional videos. We utilize the 360Lib software created by the Joint Video Experts Team (JVET) [42] to extract the viewports.
To quantify the fitting performance of the OQM metrics with MOS values, we used 3 performance metrics including: Pearson Correlation Coefficient (PCC), Root Mean Square Error (RMSE), and Spearman's ordered rank correlation coefficient (SROCC). Similar to our proposed metric, the OQM and MOS values are mapped using a five-parameter logistic function (i. e. (3)).

Impacts of the zones
As described in subsection 3, weights w k are derived for each source video by the five-parameter logistic fitting function. Note that we use only the viewports of each video in our fitting. The values of the weights obtained through the experiment are shown in TABLE 7. As illustrated in TABLE 7, the w k values of the four videos are approximately the same. This can be explained that during our subjective measurement experiment, users were asked to keep their gaze on the center of the screen. Except for w 1 and w 2 , all other weighted values are low (i.e., ⩽0.18). That shows that the zones with eccentricity e ⩽ 4 have little impact on human perception quality. Furthermore, the results of w 1 ⩽ w 2 ⩽ w 3 ⩽ w 4 ⩽ w 5 proves that nearcenter distortions have a more significant effect on user's quality experience than far-center distortions. In addition, the fovea region of the retina has the highest cone density, which explains why w 1 always has the highest value. Overall, the results show that although we use different video streams, our experimental results show that our method has good stability and is suitable for many different videos.

Conclusion
In this paper, we conducted subjective and objective quality assessments of omnidirectional video with tilevarying quality, with a focus on the human eye's foveation feature. We found that the sensitivity of the human eye and content scoring affect perceived quality. For perceptual quality, the zones of a viewport that correspond to the fovea and parafovea of the human eyes are particularly significant. With our experimental evaluations, we discovered that our scheme can improve PCC by 11% to 31% compared to the other 22 methods. In the future, we will extend contend genres and investigate quality oscillation patterns to gain insights into the perceptual habits and performance of the metrics.