Enhancing Real-time Object Detection with YOLO Algorithm

This paper introduces YOLO, the best approach to object detection. Real-time detection plays a significant role in various domains like video surveillance, computer vision, autonomous driving and the operation of robots. YOLO algorithm has emerged as a well-liked and structured solution for real-time object detection due to its ability to detect items in one operation through the neural network. This research article seeks to lay out an extensive understanding of the defined Yolo algorithm, its architecture, and its impact on real-time object detection. This detection will be identified as a regression problem by frame object detection to spatially separated bounding boxes. Tasks like recognition, detection, localization, or finding widespread applicability in the best real-world scenarios, make object detection a crucial subdivision of computer vision. This algorithm detects objects in real-time using convolutional neural networks (CNN). Overall this research paper serves as a comprehensive guide to understanding the detection of objects in real-time using the You Only Look Once (YOLO) algorithm. By examining architecture, variations, and implementation details the reader can gain an understanding of YOLO’s capability.


Introduction
Generally, a person usually stares at an image and will get to know what is there in the image and how it interacts with each other.Real-time object detection refers to the ability to detect and localize objects in a continuous stream of data.The background of real-time object detection stems from the increasing need for efficient and accurate analysis of visual data in real-world scenarios.Detection of objects is a crucial work in many well-known fields and holds immense importance in various domains including navigation of robots, augmented reality, automation, and medical diagnosis [1].Similarly, object detection plays the same role in detecting and differentiating images by neural networks.In such combination and unpredictable situations, all these detection methods will be built upon a deep learning perspective, those regions are constructed on neural networks (CNN), convolutional Networks as (SPP net), faster R-CNN, Regional based Fully Convolution Networks (R-FCN), fast R-CNN, YOLO algorithm and Feature Pyramid Networks (FPN) show better advantages than traditional methods.Feature Pyramid Networks (FPN), Networks (R-FCN) and YOLO show more convenience and importance than already existing working methods [2].You only look once algorithm is the best secured and quick object detection method with high accuracy and best training performance, also been improving since it was introduced, by involving all its yolo variations.

Unified Detection
The total concept of unified detection in YOLO is based upon dividing the given input source into a grid, and every grid cell estimates some certain bounding boxes in fixed numbers along with respective class probabilities.These bounding boxes are responsible for detecting objects that fall within the grid cell.For processing an entire image Yolo uses a neural network and makes predictions for all objects simultaneously.Human beings interface the indifferent elements of object detection mixing into one network.To predict each bounding box, the Yolo network takes every feature from that complete image [3].Key components of YOLO's unified detection approach include Grid Division, Anchor Boxes, Prediction Generation and Non-Maximum Suppression (NMS -eliminates redundant bounding boxes).First, the machine divides the given image which contains the object into S x S grid.The confidence score tells how confident the training model will be confident the box containing the object, with how much accuracy it thinks the box has and the prediction capacity of the box how accurate it is to predict [4].Figure 1    YOLO is a single-stage object detection model [5].A simple neural network predicts class and bounding box probabilities directly from the images in just one set of evaluations.If once the image is detected by the machine, the Yolo algorithm will start image processing after that it detects objects from the image using respective libraries.It faces errors and difficulties which appear as groups in detecting small objects.Object detection algorithms should not only be accurate in the prediction of object class but also with the location and must be incredibly fast while doing the process of the video processing in real-time demands.YOLO-V2 takes out all the connected layers which are only fully or linear structured and for the prediction of bounding boxes it introduced anchor boxes features like multi-scale training and included higher resolution capacity.In general, already the object detection models having tiny objects were facing the problems of poor performance and low precision.Instead of predicting the coordinates of the bounding box directly from that convolution network, it uses linear connected layers to predict bounding boxes [6].From figure 3, it solves issues like low performance and precision, a model based upon a deep learning approach which was yolo-v2 having tiny objects, called O-YOLO-v2 (Optimized yolo v2) [7].To increase the improvement of object detection it targets various measurements mainly focused on smaller objects, and a different multi-sectoral detection algorithm is introduced involving the latest YOLO-V3 [8].Yolo-V3 boosts excellent performance in a wide range of input or given resolutions YOLOv3 further improved upon its predecessors by introducing a few key enhancements.It utilized a variant of the dark net architecture called Darknet-53, which improved feature extraction capabilities.Yolo-V3 scored 37 mAP in the Test with a given input resolution of 608x608 on the COCO-2017 validation set [9].The new upgradation of this algorithm with a comparatively smaller archetype size of artificial environments is tiny yolo v3.In real-time performance, the system falls short of detection accuracy on slow computational machines [10].Figure 4 represents detection of fruits using yolov3 algorithm.

FAST YOLO: A fast you only look once machine for real-time embedded object detection in recordings
The most challenging thing in computer vision is object detection because it involves both image classification which classifies the image and image localization which localizes the image.For achieving maximum object detection output as compared to some other approaches, the deep neural networks (DNNs) were revealed, as all know YOLO version2 is an existing state in DNN-dependent object detection techniques in both terms of accuracy and processing speed.Even though yolov2 can achieve real-time high performance on a powerful graphics processing unit, it remains very objectifying for holding this method for actual time detection of objects in video on devices like embedded systems with only related computer memory or power.Here there is a proposal of a new framework which is Fast YOLO, called framework as fast You Only Look Once which advances yolo version2 to perform detection of objects in running video in a real-time manner on devices that are embedded [12].This type of a single convolution network simultaneously forecasts numerous bounding boxes and the probability of classes for the boxes [13].With zero processing of batch on a GPU TitanX this base network will run at 45 fps and the faster version will run at more than 150 frames per second and by this we can process the detection of any streamed videotape in real-time with minimal 25ms of latency [14].YOLO is globally the best when comes to the image and its process of prediction.Contrastive to the sliding window method and regional-based proposal methods YOLO can observe the entire image in the process of test time and training.So that it completely encodes the provisional info about its appearance and classes.YOLO has the learning capability to recognize the generalized portrayal of objects.As soon as YOLO is trained on general images and tested in real-time work, YOLO outruns best object detection methods like DPM and regions with CNN by an extensive edge [15].From figure 5, it visualizes the detection of different objects.yolo-SA is an improved version of the one-stage detection model YOLO v4 [16].Table 1 represents comparative analysis of all yolo versions in the division of Years, improvements, and total no. of layers in the network.

Network Architecture
Next to network design of the detection of objects was increasingly trending and has grown broadly, in the Deep Learning generation.So, the network architecture consists of mainly three different layers: The convolution layer, the Max pool, and the Fully Connected layers.Yolo network has 24 convolution layers which is connected by 2 fully connected layers.As a substitute of the established units worn by Google net, it generally uses 1 × 1 reductant layers coming after 3 × 3 convolution layers.With 9 convolution layers instead of 24 layers and some filters in those layers will be used by neural networks of Fast YOLO.By not considering the dimensions of the network, coaching and trailing parameters are so equal between YOLO and Fast YOLO [17].Fig6 presents the Network architecture of the Yolo algorithm as this model advances for sum squared misconception or error in the desired output.The main advantage of the sum squared error is its ease to optimization even though this doesn't have perfect alignment with the moto of increasing mean accuracy.This loads classified error the same as the localized error which wouldn't be absolute.We can also observe in every image, that not grid cells contain objects that may be empty.So that those empty grids' confidence scores will be pushed towards zero.Therefore, it affects model instability [18].
Coming for analysis of error in this algorithm in comparison to fast regional CNN proves that localization errors in YOLO have a large signification count.
For the regularization model, we use batch or group normalization.For removing dropouts from representation without overfitting for this we use Batch Normalization.Convolutional layers with pre-well-defined bounding boxes with custom widths and heights which are called Anchor boxes.For the prediction of the bounding box coordinates Yolo uses a feature extractor that is fully connected layers which are high in the convolutional layers.But we get some decrease in its accuracy due to the use of anchor boxes.Generally, YOLO can predict 98 boxes for any single image but by using predefined bounding boxes (anchor boxes) it can predict more than thousands of boxes.If Yolo doesn't use anchor boxes, then the basic model assures to get 69.5 mAP with 81% of recall [19].In addition to anchor boxes, the model assures to get 69.2 mAP with 88% of recall [20].It implies there is a decrease in MAP and increases in recall percentage so there is a chance to improve the performance of the model.After it increases the loss of predictions from bounding box coordinates then considers boxes that don't contain objects and decreases the dropping from Confidence predictions.In this way we mainly focus on the improvement of localization and recall percentage and to maintain classification accuracy it adds the Batch Normalization method on every convolution layer present in YOLO then we can increase 2% improvement in mAP.Below figure 6, visualizes the network architecture of the Yolo algorithm.

Literature Survey
For the detection of objects, YOLO requires a neural network which is only one side front propagation.Between two frame images for the frame differencing, it uses pixelwise differentiation which is background representation for the detection of objects and background subtraction for detecting moving regions by a Gaussian mixture model [22].The proposal of a backward history model for identifying the object is raised by Stauffer and Grimson [23].Background subtraction for the detection of moving objects is proposed by Liu et al that is in any image is done by recording the pixel-by-pixel difference between reference and present background pictures [24].Sungandi et al. have introduced the detection of objects in low-resolution images by using frame differences [25].The proposal of a new shadow detection in video clippings and untouched background model is done by Jacques et al [26].Instead of the classification method, yolo got a name like detection of objects as a regression problem by the authors of YOLO [27] which implies that YOLO performed with more accuracy and much faster.Even it can also predict artwork perfectly.Figure 7 represents the flow chart of the process.Ghosh et al. ( 2023) embarked on a comprehensive study to assess water quality through predictive machine learning.Their research underscored the potential of machine learning models in effectively assessing and classifying water quality.The dataset used for this purpose included parameters like pH, dissolved oxygen, BOD, and TDS.Among the various models they employed, the Random Forest model emerged as the most accurate, achieving a commendable accuracy rate of 78.96%.In contrast, the SVM model lagged behind, registering the lowest accuracy of 68.29% [33].Alenezi et al. (2021) developed a novel Convolutional Neural Network (CNN) integrated with a block-greedy algorithm to enhance underwater image dehazing.The method addresses color channel attenuation and optimizes local and global pixel values.By employing a unique Markov random field, the approach refines image edges.Performance evaluations, using metrics like UCIQE and UIQM, demonstrated the superiority of this method over existing techniques, resulting in sharper, clearer, and more colorful underwater images.[34].

Yolo object detection algorithm is crucial because of the given reasons:
Speed: As YOLO predict objects in real-time, it improves detection speed.Compared to other algorithms yolo can perform much faster running at 45 frames per second.Another main difference is YOLO has the capability to see complete images at only once which is not present in previous methods [28].We will run the image on CNN for only one time at run time.All the testing and training parameters are as same as between fast Yolo and YOLO [29].

High accuracy: YOLO has a high prediction capacity that gives the best results with fewer background mistakes.
There are some different heuristics to increase yolo accuracies like cosine learning rate scheduler, data augmentation, batch normalization (synchronized) and image mix-up [30].Having a larger pixel quality improves accuracy but takes off with inference and slow training time.For more accuracy large pixel quality may help the model to detect small objects.
Learning capabilities: yolo has a high learning capability, which allows one to find out the patterns of the objects and apply them in the process of detection of objects.Yolo acquired the object detection by division of an image into N grids, of equal dimensions S x S. Based on the COCO dataset (common objects in context), this algorithm can detect classes of 80 COCO objects: bus, person, car, Bicycle, motorbike, aeroplane, truck, train, boat [31].
High-resolution classifier: Generally, before training, a real YOLO neural network uses 224×224 pixels and then changes to 448×448 P while recognition.In the process of changing from one model to another model which is classification to detection model, the model adjusts to the classification of the image.YOLO version 2 segments the prior training method into just two steps: train the network with 224×224 Pixels and convert the pixels to 448×448 [32].

Physically based training:
For the detection of images with different resolutions then this model enables the same before the network.We know the training speed is fast when the given input is small and when the given input size is high then the speed of training is low.Physically based training can also improve its accuracy so that there will be a good balance between speed and accuracy.

Yearly Trends
This section has organization of all the publication data for the purpose of displaying yearly growth of YOLO versions.Table 2 explains the count of educational research papers of all versions of yolo are yolo v1, yolo v2, yolo v3, yolo v4, yolo v5, yolo v6, yolo v7 and yolo v8.This breakdown shows that the publication number of those papers has increased slowly in the 2020 and 2021.Apart from, YOLO V3, YOLO and V2 versions have interested most of the researchers due to its properties, here the time factor comes under separate element.YOLO V5, V6, v7 and V8 versions count is low because both are recent to the trend now so they will improve in future years.

Future Scope
Object detection in real-time is the main ability that is wanted by most robots and computer vision systems.It's making great progress and giving output in many directions because of the early research in this area.It has to be considered that object detection with the Yolo algorithm is not used much in many areas where it could be of great help and this could be improved in future.In fact, YOLO object detection in images has received a lot of observation in the pattern recognition sectors and computer vision in recent years.The future of these mechanisms is in the process of proving and could give freedom from routine jobs which will be done more precisely by systems and machines.Keys areas of future exploration are improved accuracy, handling complex scenes, multi-object tracking and domain-specific object detection.
represents grid division and the cell parameters.And Figure 2 represents image and object classifications of a single object image.(a) How it works when there are multiple objects.

Figure 1 .
Figure 1.4x4 by 7 volumes (four by four total grid of 16 cells and every cell of vector of size 7) (b) Work progress when there is a single object.

Figure 2 .
Figure 2. Image of single object and its classification (c) Brief about 3 main versions and improvements of all versions:

Figure 7 .
Figure 7. Flow chart of object detection model Fig 8 represents the graphical view of the mentioned table2.Table3represents various comparative analysis of object detection algorithms which includes invention year, novelty of algorithm and recent searches of mentioned algorithms.

Figure 8 .
Figure 8. Graphical representation of yearly publication data

[ 13 ]
Journal article: George, Jose, Shibon Skaria, and V. V. Varun."Using YOLO based deep learning network for real time detection and localization of lung nodules from low dose CT scans."Medical Imaging 2018: Computer-Aided Diagnosis.Vol.10575.SPIE, 2018.madJavad, et al. "Fast YOLO: A fast you only look once system for real-time embedded object detection in video."arXiv preprint arXiv:1709.05943(2017).

Table 1 .
Comparison of YOLO version improvements

Table 2 .
Yearly trends of publication data

Table 3 .
Comparative analysis of different object detection techniques