Multi-target trajectory tracking in multi-frame video images of basketball games based on deep learning

INTRODUCTION: There is occlusion interference in the multi-target visual tracking process of basketball video images, which leads to poor accuracy of multi-target trajectory tracking. This paper studies the multi-target trajectory tracking method in multi-frame video images of basketball sports based on deep learning. OBJECTIVES: Aiming at the problem of target occlusion in the tracking process and the problem of trajectory tracking anomaly caused by target occlusion, a modified algorithm is proposed. METHODS: The method is divided into two parts: detection and tracking. In the detection part, the YOLOv3 algorithm in deep learning technology is used to detect each target in the video, and the original YOLOv3 backbone network Darknet-53 is replaced by the lightweight backbone network MobileNetV2 to extract the target features. RESULTS: Based on the target detection results, the Kalman filter is used to predict the next position and bounding box size of the target to obtain the target trajectory prediction results according to the current target position, then a hierarchical data association algorithm is designed, and multi-target tracking of the same category is completed based on the target appearance feature similarity and feature similarity. CONCLUSION: The experimental results show that the method can accurately detect the targets in multi-frame video images in basketball sports and obtain high-precision target trajectory tracking results.


Introduction
Among many sports events, basketball games have the largest audience and the highest attention. With the improvement of the quality of life and the rapid advancement of technology, the requirements for basketball videos are also getting higher and higher. For example, the current passive, flat viewing model of watching basketball will gradually fail to meet the needs of TV viewers. Broadcasters need to add various visual effects to meet the visual demands of the audience. In terms of game research and analysis, basketball coaches need to extract relevant data from basketball game videos to assist basketball players in tactical research. In terms of commercial applications, game broadcasters also need to fully tap the commercial value contained in basketball game broadcasts. All of these require the analysis of basketball video data and the processing of basketball video images according to different requirements to meet the segmentation and classification requirements of online video objects [1]. Therefore, object segmentation of moving online video objects and motion attributes has high practical value and understanding relevance [2]. Through the automatic extraction of visually salient image regions in video sequences, the accuracy and efficiency of moving object detection, extraction, localization and tracking can be effectively improved [3]. Yong Gong and Gautam Srivastava 2 Some approaches have been proposed by some scholars for this area of target tracking. For multi-agent target tracking, Reference [4] formulated the tracking task as a distributed model predictive control (DMPC) problem, innovatively combined the adaptive differential evolution (ADE) algorithm with Nash optimization, and proposed a Nash combined ADE method. In the Reference [5], a multi-target tracking performance and trajectory prediction method for basketball players was proposed for the multi-target localization and tracking problem. The subjects tracked multiple targets moving on the computer screen which may disappear briefly and then re-appear during the movement, and the subjects were asked to continuously track the targets and report their final positions and the number of manipulated targets and the locations of the reappearance of the moving targets after they disappear.
To improve the tracking accuracy of spatially dense group targets, Reference [6] proposed a group target association and tracking algorithm based on the global nearest neighbour. Based on the principle of "global optimum", the closest group target and measurement are selected for priority association and updated to avoid association conflicts and reduce association errors, which can effectively solve the contradiction between association accuracy and tracking real-time. At the same time, the combination of track prediction and track forecast is proposed to solve the track intermittency and fusion problem in the tracking process. Reference [7] proposes a secure distributed detection system for industrial image and video data security based on IPFS and blockchain, and a decentralized peer-to-peer image and video sharing platform based on IPFS column (pHash) technology to detect copyright infringement of multimedia. When multimedia is uploaded to IPFS, the pHash of the same content is determined and checked against the existing pHash value in the blockchain network. Similarity to existing pHash values will result in multimedia being detected as tampered with. Blockchain technology offers the advantage of not participating in third parties, thus avoiding a single point of failure. Reference [8] proposes a sports safety information mining platform based on multimedia data sharing technology. The hardware part of the platform includes a teaching multimedia data sharing module, a shared server module, a shared client and a Web server. To realize low-latency and low-energy-consumption information transmission, ZigBee technology is introduced into the software design to realize the function of information communication and complete the evaluation of mining quality.
Although the research on target tracking has made great progress and breakthroughs to a certain extent in recent years, the effects of sequence target feature extraction, target joint detection and target trajectory tracking related methods are still not perfect due to the complexity of the external environment and the influence of noise factors and target deformation. The core problem of object tracking is feature expression. Appropriate features need to be selected according to the different application scenarios of features. However, the tracking effect is far from meeting the needs of practical applications.
In recent years, with the research of deep learning, researchers have found new ideas for target tracking. The field of computer vision has been rapidly developed since the emergence of deep learning technology, and deep learning technology has been used in image classification. In recent years, deep learning-based multi-target tracking algorithms have also made some breakthroughs. Multitarget tracking is a very challenging research direction in the field of computer vision and has a wide range of application scenarios, such as intelligent video monitoring and control, abnormal behavior analysis, mobile robot research and so on. Traditional multi-target tracking algorithms often have poor tracking results due to poor target detection. The detector based on deep learning can obtain a better target detection effect and improve the accuracy of target tracking. Therefore, the effective combination of object tracking and deep learning has become the focus of researchers in the tracking field. This paper studies the multi-target trajectory tracking method in multi-frame video images of basketball movement based on deep learning. The YOLOv3 algorithm in deep learning technology is used to detect each target in the video, and the original YOLOv3 backbone network Darknet-53 is replaced with a lightweight backbone network MobileNetV2 to extract target features. The Kalman filter is used to predict the next position and the size of the bounding box according to the current basketball target position, and the trajectory prediction result of the basketball target is obtained. The multi-target tracking of the same category is completed, and a correction algorithm is proposed for the problem of target occlusion in the tracking process and the abnormal trajectory tracking caused by target occlusion. Implemented a multi-target trajectory tracking method for basketball video based on deep learning.

Overall framework of video multi-target trajectory tracking method
The overall framework of the video multi-target trajectory tracking method based on deep learning is shown in Figure 1, which is mainly divided into two parts: detection and tracking.
The detection mainly combines the YOLOv3 algorithm in deep learning to detect and recognize video multitargets. Firstly, image preprocessing is performed on the video sequence. Then multi-scale point convolutional neural network [9] is used to achieve the target segmentation of the image and obtain the convolution feature map in the video image. The input video feature map is analyzed and filtered by the detection network. Finally, the frame of the optimal target is obtained by Multi-target trajectory tracking in multi-frame video images of basketball sports based on deep learning 3 confidence calculation and multi-scale prediction, the multi-target in the video is classified by the classifier, and the center point coordinates of the frame of the optimal target in the video are obtained.  The detection mainly combines the YOLOv3 algorithm in deep learning to detect and recognize video multitargets. Firstly, image preprocessing is performed on the video sequence. Then multi-scale point convolutional neural network [9] is used to achieve the target segmentation of the image and obtain the convolution feature map in the video image. The input video feature map is analyzed and filtered by the detection network. Finally, the frame of the optimal target is obtained by confidence calculation and multi-scale prediction, the multi-target in the video is classified by the classifier, and the center point coordinates of the frame of the optimal target in the video are obtained.
The tracking part is to combine the multi-target detection results of the detection link to conduct data association and tracking and input the optimal target center point coordinates of this kind of target into the Kalman Filter to predict the center point at the next time, that is, the multi-target trajectory prediction. The frame data of different target bounding boxes output by the detector is correlated to determine the number of different targets. The measurement values of the center points of different targets and the estimated values of the center points at this time are used to obtain the optimal estimates of the real states of different targets. If the data association fails due to occlusion, the hierarchical data association algorithm is used for data association, and the newly emerging target at that moment is associated with the target that disappears due to occlusion. If abnormal trajectory fluctuations are caused by partial occlusion, the trajectory anomaly correction algorithm is used to correct different target boxes and trajectories.

Multi-object detection based on deep learning
Multi-target detection methods based on deep learning can be divided into two-stage detection algorithms represented by the regional convolutional neural network in terms of detection modes. Among them, the idea adopted by the two-stage detection algorithm is as follows: Firstly, a proposal is used to provide location information, and then classifiers are used to provide category information. Real-time detection affected by detection mode cannot be guaranteed. However, the single-stage detection algorithm provides a new and more direct idea, that is, the whole image is used as network input, and the position and category of a compact rectangular bounding box containing objects are directly regress at the output layer, to transform the multi-object detection [10] problem into regression problem processing, which greatly improves the detection speed.

YOLOv3 algorithm design
YOLOv3 algorithm does not use classic Backbone networks such as VGG-16 and ResNet-50, but the YOLOv3 algorithm proposes its Backbone network--Darknet-53 feature extraction. There is no pooling layer or fully connected layer in the original network structure of YOLOv3. In the process of forwarding propagation, the size transformation of the tensor is realized only by changing the step size of the convolution kernel, that is, the step size is 2, which means that the side length of the video image is reduced by half and the area is reduced to 1/4 of the original one. Therefore, after 5 sampling times, the feature map is 1/32 of the original video image. The idea of FPN (Feature Pyramid Networks) is used for reference, the algorithm uses multi-scale to detect moving video objects of different sizes. Three feature maps of different scales are output, namely 13×13, 26×26 and 52×52. This makes the detection effect of YOLOv3 significantly improved compared with the previous version of the YOLO algorithm.
Compared with Faster, R-CNN and other two-stage detection algorithms, the YOLOv3 algorithm has obvious advantages in detection speed, but it has two shortcomings. First, the model obtained by training is large and not suitable for embedded equipment. Second, higher inference time is required in the model under CPU. To solve the above problems, an optimization method is designed to optimize YOLOv3.

Optimization method design
Model optimization can be carried out from backbone network optimization, optimizer optimization and model pruning optimization. Considering the implementation steps and difficulties of various optimization methods, YOLOv3 is optimized from the perspective of backbone network adjustment in the multi-target detection process of YOLOv3.
From the perspective of the lightweight model, Mobile-NetV2 is used as the backbone network to replace the original network for feature extraction. Mobile NetV2 is an optimized version of Mobile Nets and an efficient model for mobile and embedded devices. Mobile Nets is a mainstream lightweight network based on streamlined architecture, which uses deep separable convolution to build deep neural networks. Mobile Nets decompose the standard convolution into deep convolution and point-bypoint convolution to greatly reduce the number of parameters and computation.
The standard convolution is calculated as follows: If the number of input channels is M and the number of output channels is N, the corresponding calculation amount is: The standard convolution can be split into deep convolution and point-by-point convolution [11], Specifically, the processing part of deep convolution is the filter, the size of the filter is The calculation amount of deep convolution and pointby-point convolution is: The total calculation amount of the above calculation is reduced by  [12] to extract multi-object features from video images. The improvement of Mobile NetV2 is to add a PW convolution before DW convolution [13]. Due to the computational nature of DW convolution itself, it cannot change the number of channels, that is, the flow of the previous layer can only be equal to the output channel. If the number of upper channels is small, DW can only extract video multi-object features in low-dimensional space, which often leads to poor results. Therefore, the PW added before each DW is considered a tool to raise dimensions. Mobile NetV2 drops the activation function of the second PW, which is called as called as a linear bottleneck. Since the activation function can effectively increase the nonlinearity in the high dimensional space, it will destroy the multi-objective feature of the video in the low dimensional space, and the main function of the second PW is dimensionality reduction.
Mobile NetV2 uses a 1 x 1, 3 x 3,1 x 1 structure with a shortcut to add the output and the input. ResNet uses Standard convolution (SC) for multi-object feature extraction, while Mobile NetV2 always uses DW convolution for multi-object feature extraction. Intuitively, MobileNetV2 uses inverted residuals and DW convolution to effectively extract multi-object features.

Design of target tracking algorithm
After multiple detection targets for each frame of video image by YOLOv3, the same target in successive frames needs to be tracked and trajectories are generated in turn. The Kalman filter model [14] is used to predict the position of the target in the next frame.
The advantage of the Kalman filter is that the model can be applied to any dynamic system with uncertain information to make a reasonable prediction of the next direction of the system, and it can always point out the real situation even with noise interference.
Equation (5) shows that the current new optimal estimate is obtained by adding the known external control quantity to the previous optimal estimate, while the new uncertainty is obtained by adding the external environmental disturbance to the previous uncertainty.
To make the Kalman filter model work continuously, some parameters in the model need to be updated to ensure the real-time accuracy of video multi-target trajectory tracking.
In the formula, K is the Kalman gain. ˆk x is the new optimal estimate, which can be iterated k  into the next prediction and update equation.

Design of hierarchical data association algorithm
In previous multi-target tracking algorithms, the main problems in data association are as follows: Due to the target barrier, Kalman filter prediction uncertainty will increase, and when the appearance of being tracked target at high similarity, the forecast trajectory may not be that were the target of the need to track, to avoid to use other don't need to track target data association, associated with the method of hierarchical data. When the tracked object continuously appears in the adjacent video frames with little appearance change greatly and no occlusion, the data association using only the appearance features of the tracked object can achieve better results. However, in general, the situation of the tracked object is not so simple, and it often faces the problems of appearance change and occlusion. To alleviate the occlusion and appearance change encountered in the tracking scene, this paper proposes a hierarchical data association method. This method uses the appearance features and motion features of the target to correlate. The purpose of the first-layer data association is to use the appearance feature to associate the target that is not occluded or has little appearance change in the video, that is, the appearance similarity of the target meets the confidence value, and the first-layer data association is conducted. If the appearance similarity is lower than the confidence value, it indicates that the appearance of the target has changed, or occlusion occurs in the tracking scene. Then the second layer of association is adopted to judge and associate multiple targets using motion features [15]. The appearance feature similarity and motion feature similarity mentioned above are described by cosine similarity.

Similarity calculation of target appearance features
The similarity of target appearance features is expressed by Equation (7): i  represents the appearance feature of the existing target, and j  represents the appearance feature of the target to be matched.

Calculation of feature similarity
The video frame rate is required to be high enough and the motion trajectory of the target is continuous and smooth. Considering the spatial information of the target, and the speed and direction of the target movement are used to represent the motion features of the tracking target. However, the cosine similarity can only represent the direction consistency of the target. Therefore, the EAI Endorsed Transactions on Scalable Information Systems 01 2023 -01 2023 | Volume 10 | Issue 2 | e9 Yong Gong and Gautam Srivastava 6 modified similarity is considered to represent the similarity of the target direction and speed, and the feature similarity is expressed as:

Implementation of Hierarchical data Association
In the data association of the current frame of the video, the object detection box t D has been obtained by the detector, and the target trajectory 1 t T − of the previous frame is known. In the matching process, the detected candidate object is matched with the target trajectory, and the trajectory T of the current frame is obtained after the matching. The cost matrix is shown in Equation (9) : ,, , 1, In data association, the problems of mutual occlusion and disappearance and re-appearance of objects usually occur. To reduce the occurrence of such problems, a hierarchical data association method is proposed to solve problems, and different functions are used for correlation between the two layers [16]. The previous layer only considers the matching value where the appearance similarity of the target is higher than the confidence a T . If the appearance similarity of the target is high, it means that the target has not been occluded during the movement.
In the proposed algorithm, the influence factor 1 w which represents the appearance of the target is set to 1, and the influence factor 2 w of the motion is set to 0 in the similarity function of the first layer. However, when the appearance similarity of the target is low, it indicates that the target has occlusion or other problems that make the appearance change of the target in the process of movement. At this time, the association matching of the target cannot be carried out accurately only by relying on the appearance similarity of the target, so the motion feature is introduced for discrimination. After the experiment, the target appearance influence factor 2 w is set to 0.4, and the motion influence factor ω2 is set to 0.6, which can achieve the best test effect.

Video multi-target trajectory anomaly correction algorithm
When a partial occlusion occurs in the process of multitarget trajectory tracking in motion video [17][18], it is easy to fall into the problem that only the un-occluded part of multi-target can be obtained [19][20]. The trajectory fluctuates greatly when the target is occluded from the beginning to the end. Aiming at this situation, a new trajectory correction algorithm is proposed to avoid the trajectory deviation caused by the partial occlusion of the target. The algorithm uses the characteristic that the target frame will not change suddenly during the tracking process to correct the frame and trajectory.
The specific process of the video multi-target trajectory anomaly correction algorithm is shown as follows: The numbers of average height and width of the targets  

Experimental Results
To verify the application effect of the multi-object trajectory tracking method of basketball video studied in this paper in actual basketball video, a CBA tournament is collected as the experimental data set, and a CBA game is selected as an application video. The proposed method is used to track multiple targets in the application video, and the target recognition and trajectory tracking results are as follows. As can be seen from Figure 2, the method proposed in this paper is used to carry out multi-target recognition for players in white No. 12

Analysis of trajectory tracking effect
In the process of analyzing the trajectory tracking effect of the proposed method, trajectory correction and trajectory tracking error are carried out, and the results are as follows.

Trajectory correction results
The proposed method is used to correct the abnormal trajectory in the process of target trajectory tracking, and the results are shown in Figure 3. By analyzing Figure 3, it can be concluded that the trajectory of the target abnormal trajectory corrected by the proposed method is closer to the actual trajectory. The method in this paper uses the Kalman filter to predict the next position and bounding box size of the current basketball target position and obtains the trajectory prediction result of the basketball target. Similarity completes the tracking of multiple targets of the same category and corrects the problem of target occlusion in the tracking process and the problem of abnormal trajectory tracking caused by target occlusion, so the correction effect on abnormal target trajectories is better [21,22].

Trajectory tracking error analysis
In the application video, player No.13 in black is taken as an example to conduct experimental tests, and the target trajectory tracking error of the proposed method is compared under occluded conditions and non-occluded conditions. The results are shown in Figure 4.  From the analysis of Figure 4, it can be seen that under the condition that the target is occluded by the method in this paper, the trajectory tracking error range is in the range of 0.15-0.45cm, and the average difference is about 0.30cm. Under the condition that the target is not occluded, the error range of trajectory tracking is in the range of 0.05-0.25cm, and the mean value of the average difference is lower than 0.15cm. The above data fully demonstrate that the method in this paper can track moving targets more accurately, and the target trajectory correction can be significantly improved through abnormal trajectory correction.

Operation time
Taking the application video containing 100 frames of images used in the above experiment as an example, the size of the tracking target area is set as 32×32 pixels. Figure 5 shows the operation time of each frame image in the multi-target tracking process of the proposed method. According to the analysis of Figure 5, when the proposed method is used for target tracking, the average computation time of each frame is about 96ms, and the calculation results meet the requirements of conventional video image multi-target trajectory tracking, which indicates that the proposed method has better real-time tracking performance.

Comparison of center position error and coverage
Center position error and coverage describe the center deviation and the overlap ratio of the tracking box and the actual target box, respectively. To qualitatively evaluate the trajectory tracking effect of the proposed method, the above two indexes are taken as bid evaluation criteria, and the method of reference [5] and [6] are taken as comparison methods to compare the evaluation indexes of the proposed method and the two comparison methods. The results are shown in Table 1 and Table 2.
By analyzing Table 1 and Table 2, it can be seen that in the process of multi-target trajectory tracking, the calculation results of center position error and coverage obtained by the proposed method are significantly better than the two comparison algorithms, which indicates that the proposed method has a better tracking effect among the above methods. In summary, the method in this paper solves the interference caused by occlusion to the accuracy of trajectory and basketball moving target recognition to a certain extent, has high real-time performance, and has a certain application value in the field of moving video target tracking.

Conclusion
This paper studies a multi-target trajectory tracking method based on deep learning in basketball video, using high-performance detectors to detect multiple targets of the same type in basketball video, focusing on the relationship between frames before and after the same target and target tracking. Occlusion problems in the process. After experimental verification, the following conclusions are drawn: (1) The proposed method solves the interference caused by occlusion to the accuracy of trajectory and basketball target recognition to a certain extent and improves the anti-interference and recognition accuracy of the target.
(2) The proposed method has good real-time performance and can meet the requirements of multitarget trajectory tracking in conventional video images.
(3) The proposed method has good calculation results of the target center position error and coverage rate in the process of basketball target tracking, indicating that the tracking effect of the method in this paper is good.
However, the current research is still carried out on the same type of target, and the next research will focus on breaking through the multi-target tracking problem of different types and refer to the feature information to improve the existing inter-frame relationship.