Personalized recognition system in online shopping by using deep learning

This study presents an effective monitoring system to watch the Buying Experience across multiple shop interactions based on the refinement of the information derived from physiological data and facial expressions. The system's efficacy in recognizing consumers' emotions and avoiding bias based on age, race, and evaluation gender in a pilot study. The system's data has been compared to the outcomes of conventional video analysis. The study's conclusions indicate that the suggested approach can aid in the analysis of consumer experience in a store setting


Introduction
Product identification aims to make it easier for retailers to keep track of their stock and provide a better experience for customers.Today, barcode [1] detection is the standard method for automated product authentication in many fields, including academia.By reading barcodes printed on retail packaging, product tracking and inventory control are greatly simplified.You can find a barcode on pretty much anything you buy.However, due to the barcode's unpredictable writing location, it's common for the cashier to locate the barcode and help the computer physically recognize it.Digimarc [2] found that 45 per cent of consumers said they avoided barcode scanners because of the hassle they posed.Since the advent of ubiquitous computing, RFID (radio frequency identity) [3] has been implemented in commercial settings to facilitate the computerized tracking of merchandise.Using radio frequency impulses, a technology that automatically sends data and information.Every item is tagged with RFID technology.Every identifier is assigned a unique number that can be used to identify the item being tracked via radio waves.Any visual reader can view RFID tag data, unlike barcodes, which require a clear line of sight.There's no doubt that RFID has some drawbacks.Unfortunately, there is still a lot of room for mistakes when trying to identify numerous goods simultaneously because radio signals can be obstructed or influenced by one another.In addition to increasing sales expenses and environmental concerns, RFID stickers are costly and challenging to discard [4].
Customers may now scan and pay for their items through self-checkout machines and other forms of SST without interacting with the shop staff [11].By eliminating the need for customers to wait in lengthy checkout lines, SST has been shown to improve customer happiness and convenience.[1,2] SST, which uses barcode scanning and self-service payment equipment, is mainly deployed in major retail centres [12].Indeed, the checkout capacity is theoretically increased at peak moments [1] when customers take their baskets to self-checkout terminals to pay and get their purchases.Current self-checkout technologies are already widely used in ordinary grocery shops, making them a good fit for the digital transformation of these businesses [12].However, they are not relevant in a quick-service restaurant setting.Unlike frequent shoppers at supermarkets, customers in convenience shops often only buy a few goods.Convenience shops in China are usually found in dense urban areas with premium space and high rents [13,14].The store's checkout lines tend to get quite long in the early morning and late evening.
The quality of the customer's shopping experience is a significant factor in the overall profitability of a retail business.It may even affect customers' propensity to buy [1], their level of happiness [2], and their level of devotion [3].A shopper's delight and inclination to make a purchase are influenced by how actively they are involved in the buying process and by the unique and thrilling events created for them [4].
The supermarket's efficiency, however, could be better with customer expectations.Customers often have a terrible time when they go grocery shopping, such as when they have to wait in a long queue at the checkout counter, which, according to studies, may be due to the lengthy process of scanning bar codes [4][5][6][7].All items must be removed from the bag and checked individually to complete the checkout procedure.It's tedious and timeconsuming, and buyers sometimes feel overwhelmed by the sheer number of products available when they could send the latest, most relevant information about those products directly to the people most likely to buy them.However, there are only so many options (such as paid news subscriptions).As a result, we anticipate more userfriendly manuals and aids (such as apps) [8][9][10].

Literature review
Since the advent of mass media, the advertising industry has relied heavily on personalization to reach its target audience [12,13,14].Machine learning-based methods have now been included in ad personalization development.
Modern recommender systems and AI technologies are utilised to find new customers [6].Parsable information, such as textual data, is analysed by Mooney and Roy [15] to determine the similarity between customer and product content profiles.Text mining of usergenerated material on social networking sites (SNS), especially Twitter, has been widely employed for audience prediction and segmentation [10], blogger interest discovery [16], and content-based recommender systems for SNS users with appropriate interests and preferences [17].Online content keyword extraction may uncover relevant language patterns [18], and machine learning classifiers can identify demographic information like age, gender, and personality type [19] when consumer data is unavailable.
Using social interaction data, a data-mining framework [20] is built to create a tailored advertising system.Consumer preferences may be inferred from gathering behavioural data like likes, comments, searches, and tweets [21].Users' interests can be predicted using collaborative filtering by comparing their profiles and the similarity of their preferences of rated items [22].In contrast, users can be identified using similarity measurement methods like the Pearson correlation coefficient [23].Furthermore, users' preferences can be identified using graph-based approaches and purchase time information [24].
Categorizing user interests is crucial to personalized advertising since it gives valuable insight into user behaviour that advertisers and marketers can utilize.Several studies [25,26,27,28,29] have attempted to categorize users' passions.Social networking site research [25] uses Word2vec and a support vector machine (SVM) classifier to organise users' interactive communication through comments into a grading system.The weighted ensemble approach was presented to categorize a user's feelings into multi-label binary via analysis of usergenerated material [26].This model produces a classification result without hyperparameter tweaking or overfitting.Online shoppers read a lot of remarks from other customers before deciding whether or not to buy a product.To alleviate the burden of having to sift through so many comments every day, [27] user comments are sorted into categories using topic modelling and classification methods.Another research employs deep learning to anticipate user interest and reaction to online advertising given to the user [28].This is done by correlating the ad's content with the user's interests.Intent segmentation in email is used to categorize messages for the benefit of their recipients.Here, we propose a method for detecting harmful spam emails [29].
However, there is a shortage of studies that use social media images or attempt to categorise the interests of SNS users by examining their profile pictures.Imagebased research evaluated users' emotions and found that most previous studies only considered the viewer's perspective, indicating a lack of image-based data for emotional classification.In light of this, we utilised the SNS user's post pictures and hashtags to categorise their feelings [30].Since the influence of SNS platforms on cybersecurity is substantial, another study analysed images shared on SNS and categorised the SNS platforms to which the images were shared [31].
Traditional image classification model ResNet is combined with Google Incidence in the Inception res v3 [32] model.Multiple model combinations have been investigated to boost image classification accuracy.Unfortunately, the model's performance suffers due to the increased size brought on by merging various models.To address these issues, a simpler input model is provided and integrated with ResNet in the Inception res v3 models, yielding impressive results.
Regarding picture categorization on mobile devices with limited resources, a deep learning model called MobileNet v3 [33] excels.This can only be achieved by drastically cutting down on existing models' computational effort and complexity.MobileNet v1 presented a model that may radically cut down on computations and model depth by introducing the idea of deep-flow separable convolution.In MobileNet v2, the model structure was modified by inverting the residual structure to increase performance, using less computing power and thereby boosting performance.While internally identical to Google's Inception v2 model, Inception v3 [34] is the most capable model by several different training techniques.Even though there are fewer parameters than VGGG and AlexNet, which were released simultaneously with the publication of Inception v3, this was proposed to compensate for the shortcomings that were not widely used due to the need for a large number of parameters.
ResNet [35] provides a residual layer, while most other suggested models for image classification include intense layers and perform well across multiple operations.Bypassing layers while running processes minimise calculations and boost speed, especially for models with deep layers.ResNet v2 offers techniques for minimising computations and maximising performance in deep-layer models using the bottleneck structure with 1D convolution.When researchers were not enthusiastic about studying image classification, the EfficientNet [36] model was proposed.The creators recommended changing the model's width, depth, and resolution to change the model's overall size.Scaling the width of a picture changes the number of layers, whereas scaling the depth of an image changes its resolution.However, it is unusual to consider width, depth, and resolution scaling all at once, as in the case of EfficientNet, which appears to be the best-performing model.ResNet, a previously proposed image classification model, regulates depth scaling, and MobileNet is a representative method of controlling the model through width scaling.

Framework to recognize the Customers' emotions and needs
The suggested platform is non-intrusive in analyzing customer sentiments and actions throughout the customer experience.It's composed of primary components as represented in Figure 1.

Figure 1: Framework with primary component
Sensor analysis and Identity, Gaze Detection, Voice Recognition, Age and Gender Recognition, and Facial Language and Feelings Classification are all part of the suite.These modules, except the last two, use IP Wi-Fi full-HD cameras equipped with PTZ technology and autofocus or 4K webcams when possible; this is because some environments don't allow for cams to be placed at eye level, so even if the webcam configuration is optimal, it's often the only option to use IP or security cameras.All of the Key points' cameras are connected to a centralized computer, which analyses each picture to determine the customer's age, gender, emotional state, and look locations.They detect primary Ekman emotions (happiness, sadness, anger, fear, contempt, distaste, and surprise) by a Convolutional Neural Network built into the Expressions and Emotions Recognition module.This network takes as input all the frames that make up the video stream and outputs a percentage value.Additionally, it offers metrics for Interest (i.e., how "focused" the topic is) and Valence (how positive or negative the encounter was).Ekman's FACS handbook describes a method for mapping users' feelings and providing an approximation of facial expressions using a concept called "Action Units."These results show a high degree of correlation with this idea.This component employs a Convolutional Neural Network educated on the FER + EmotioNet crowd-sourced datasets.The first dataset comprises around 13,000 pictures categorized using Ekman's feelings classification system.The second dataset contains about 100000 images obtained and classified using the FACS action units.
CNNs are also the basis for the Age and Gender module, which learns using data scraped and annotated from the IMDB website (the IMDB-Wiki dataset), which includes over 700,000 pictures.
The gaze detection mechanism is the second component built with Deep Learning.Once again, this module uses a CNN trained using publicly available datasets and additional datasets acquired via crowdsourcing platforms like Micro workers and Amazon Mechanical Turk.This component, To determine what the user was looking at when the video was taken, receives as input a video feed acquired from the camera, splits it into various frames, and examines each structure to identify the face and the ocular location through the neural network.An instrument like this can help businesses estimate which product types attract the most customer attention.For output, such a tool creates heatmaps based on either photograph of the retail space or detailed diagrams of the various places of interaction.
Speaking examples can be gathered in a poll setting or captured from the mics built into each Touchpoint and used by the Speech recognition subsystem.If an expression detection system already exists, use the suggested framework.Scientists are trying to determine which programmes are the most effective at processing various speech characteristics (e.g., the AudioProfiling tool).
The collection of physiological data, such as heart rate and respiration rate, is made possible by the Biofeedback mechanism.Such factors can be monitored covertly with the help of VPGLIB (formerly QPULSECAPTURE).This software is an OpenCV expansion module that analyses facial footage using digital image processing to determine a person's heart rate and estimate how often they breathe.The same sensor used for face identification can also be used to obtain the subject's pulse rate and breathing rate, with an exact inaccuracy of fewer than five beats per minute (bpm).The following stage correlates these numbers with the proportion values of how we feel (Joy, Anger etc.).It has been shown that there is a connection between these metrics and mental health.To achieve this goal, we use SensauraTech's Tools.
The Identity Module employs a facial recognition engine that consults a library of previously saved pictures (such images can be gathered during the customer enrollment necessary for the loyalty card) to verify the customer's identity at each point of interaction.This way, their unique image is captured each time a client walks past a camera.It's possible to accomplish this with the help of several application programming interfaces (APIs) (like the Facial Detection APIs from Lambda Labs [6]).
A web-based graphical user interface presents the information generated by the platform in the shape of panels.The user's mental state is graphically represented in such a way to interface as a product of time and worth expressed on a range from -100 to 100.In addition, it shows breakdowns of how people generally feel across a range of options.Can visualise One or more clients' touchpoint data in real time.Lastly, medians are here.

Application of the proposed framework
A pilot study compared the system's performance to that of conventional video analysis in identifying the sex, age, and emotional state of the system's target audience and why the system has been installed in a modest men's and women's apparel shop for the past one week.
To keep an eye on the customer experience about these three factors, effectively integrate the technology into the retail space: Key factor 1: Allowing Customers to Enter the Store Key factor 2: adjusting one's garments in front of a reflection Key factor 3: The clientele must remain at the register in order to pay.As shown in Figure 2, the actual system configuration consists of the following: 1. Foscam R2 1080p IpCamera mounted on a bracket 2.10m above ground level in the entrance's front right corner, directly behind the counter; To the reflection's left is a platform at the height of 1.62 m, on which a Logitech QuickCam Pro 9000 and a Logitech Camera Brio 4k are.When a client turns their back to the reflection but turns their head and shoulders towards it, the video cameras are angled in such a way as to capture the entire length of their bodies in the picture.On the same rack as the two USB webcams is a Linux Ubuntu 14.01-running Asus VivoMini UN62 with an Intel Core i5-4210U processor, Intel HD Graphics 4400, 4GB of dual-channel DDR3L 1600MHz Memory, and Ubuntu 14.01.A Sitecom N300 modem for use with the Asus VivoMini and the IpCamera over Wi-Fi.Thirty shoppers (45 men and 45 females, evenly split between three age groups: 21-35, 35-45, and 46-55) were randomly chosen from the store loyalty card programme.Participants provided their full permission for their confidential and private data to be processed after being briefed on the study's aims and procedures.They were given coupons good for a 9% reduction as an incentive.Three regular business days have been used for testing.Some of the store's clients have been given the option of coming in on any scheduled day.They were at liberty to select any available date they liked.Everyone was instructed to pick out an article of clothing and give it a go at fitting it.The door is always open for them to make a transaction whenever they're ready, but they're under no pressure.At any rate, they must pick up the coupon at the register before leaving the shop.
To ensure the accuracy of the system's output data, a behavioural psychology specialist has analysed captured videos.For key point 1, we focused on the first thirty seconds a consumer spends in the shop.Time spent in front of the reflection is associated with the second key point, and time spent at the cash register to receive the coupon and ultimately pay associated with the third.

Result for the application
They analyze the movie by watching the videos that accompany the contact areas taken into account.There were 1116 video clips, each correlating to a distinct consumer reaction (312 for keypoint 1, 551, and 392 for keypoint 3).A made connections between each of them and the following: • An important Ekman feeling (or one associated with "neutral" or "not disclosed") • The client's sex and age vary in the clip.
The suggested Face Emotion Detection System's data has been contrasted with the results.On average, the algorithm successfully identified faces in 96% of video cuts captured by the Logitech Quick cam and 94% of those captured by the Logitech Webcam, but only 82% of movies captured by the Ip Camera.The Ip Camera's location near the storefront made it susceptible to fluctuations in natural light, which caused several recordings to be overexposed.

Figure 3: Customers emotions characterizing
Observe this method to foretell feelings of joy and astonishment accurately.Nonetheless, a few glaring misinterpretations stand out.The algorithm consistently made incorrect predictions, such as calling for "neutral" when "sadness" was intended, "anger" when "disgust" was designed, and "surprise" or "sadness" when "fear" was intended.Though the algorithm has a higher error rate for women, it can still accurately forecast consumer sex (total accuracy equals 0.92).In terms of age bracket, the precision of the forecast is equivalent to 0.83.Figures 5 and 6 demonstrate that the method is most accurate at estimating a person's age when they are between the ages of 21 and 35, while it performs less well for those between the ages of 45 and 50.This is likely attributable to features of the actual training sets used.

Conclusion
This article presents an emotional detection system that can monitor customers' feelings at every stage of the customer life cycle.This method represents several firsts in the shopping industry.By watching customers in a nonintrusive manner, the system can provide the business with vast data about consumers (including random feelings) that have previously been impossible to gather with conventional anthropological methods.As a result, businesses can use this technology to compile a more comprehensive picture of their customers than just basic information.
When compared to the current state-of-the-art in emotional identification and research technology, the suggested Realtime Emotional Recognition Platform offers the following benefits: The suggested system will enable tracking consumers' emotional state without them being conscious of it (although, for privacy issues, they should be told that they are being monitored), and it is modular, implementing noninvasive methods and integrating multiple technologies.Furthermore, it uses numerous emotional detection tools and techniques, so the information it provides will be more trustworthy than any competing system.Each emotionrecognition device represents a standalone component.In this manner, each piece can operate as a standalone utility even when others are unavailable.
User interface that exists solely on the web; can be accessed from anywhere (protected by security protocols).This allows the person to view information stored on the computer.
The Face Expression and Mood Detection Module and the Age and Gender Identification Module have undergone thorough testing to ensure their correctness.The data indicates that the suggested system can be used to successfully evaluate CX in a shop by monitoring customers' feelings and gathering personal information (i.e.age and gender) in place of the standard video analysis done by specialists in behavioral psychology.However, it showed several shortcomings.In particular, external circumstances (such as lighting and customer-camera distance) significantly impact the efficacy and precision of the examined module.More research is needed to fully comprehend the constraints of its execution and evaluate the effectiveness of its other functional components.

Figure 2 :
Figure 2: System setup with different cameras

Figure 6 :
Figure 6: Confusion matrix for Age prediction

Figures 5
Figures 5 and 6 display the system's performance on sex categorization and age range classification assignments.Though the algorithm has a higher error rate for women, it can still accurately forecast consumer sex (total accuracy equals 0.92).In terms of age bracket, the precision of the forecast is equivalent to 0.83.Figures5 and 6demonstrate that the method is most accurate at estimating a person's age when they are between the ages of 21 and 35, while it performs less well for those between the ages of 45 and 50.This is likely attributable to features of the actual training sets used.