An Intelligent Fashion Object Classification Using CNN

Every year the count of visually impaired people is increasing drastically around the world. At present time, approximately 2.2 billion people are suffering from visual impairment. One of the major areas where our model will affect public life is the area of house assistance for specially-abled persons. Because of visual improvement, these people face lots of issues. Hence for this group of people, there is a high need for an assistance system in terms of object recognition. For specially-abled people sometimes it becomes really difficult to identify clothing-related items from one another because of high similarity. For better object classification we use a model which includes computer vision and CNN. Computer vision is the area of AI that helps to identify visual objects. Here a CNN-based model is used for better classification of clothing and fashion items. Another model known as Lenet is used which has a stronger architectural structure. Lenet is a multi-layer convolution neural network that is mainly used for image classification tasks. For model building and validation MNIST fashion dataset is used.


Introduction
Fashion is among one of the world's most rapidly growing and ever-changing industries.With new trends and designs appearing every season, the fashion industry is always evolving.With the advent of technology and the widespread use of the internet, fashion has become more accessible to people worldwide.Fashion object detection is a rapidly growing field of computer vision that has garnered significant interest in the past few years due to its potential uses in various industries, like fashion retail, e-commerce, and social media.The ability to detect and classify different fashion objects such as clothing, accessories, and footwear in images has numerous practical benefits, including improved product search and recommendation systems, enhanced inventory management, and more effective marketing campaigns.In the fashion industry, object classification can aid in automating product categorization, improving online shopping experiences, and enabling more accurate visual search capabilities.This can help identify the type of clothing, material, and color of an item in an image, retailers can better organize their inventory and offer personalized recommendations to customers.Additionally, social media platforms such as Instagram and Pinterest can use fashion object detection to enable more accurate tagging of products in images, making it easier for users to purchase items they see online.In past years, deep learning has come up as an important tool used for image recognition and its classification as well, allowing machines to recognize patterns in complex datasets with remarkable accuracy [1].Deep learning models have been used in a wide range of fields, including robotics and autonomous systems as well as computer vision and natural language processing [2].The main goal of this research is to create a deep learning model that can correctly identify a variety of clothing articles.This kind of application can be helpful for visually impaired individuals to recognize different fashion-related items.Convolutional Neural Networks (CNNs) [3] are among the type of deep learning model that is frequently used in image recognition tasks.These models consist of multiple layers of interconnected neurons that learn to identify different features in images, such as edges, shapes, and textures.In this work, a fashion object recognition system has been developed using CNN.

Literature Review
Owais et.al. [4] focused on one of the hyperparameters of CNN which is the filter size.They recorded the Accuracy of 3x3, 5x5, and 7x7 filter sizes.Aside from the filter sizes everything is kept unchanged, with the model architecture remaining unchanged.The purpose of this experiment was to clarify how filter sizes affect image classification.13 layered CNN was used here and they revealed that a 3x3 filter size was the best fit for the model.According to their experimental findings, accuracy, and filter size are inversely related.They have achieved an Accuracy of 92.68%.Though being less dense our proposed model gives better accuracy.Their proposed model uses Max Pooling while ours uses Average Pooling.Greeshma et.al. [5] performed several techniques to examine the effects of various regularisation and Hyper-Parameter Optimisation (HPO) strategies using deep neural networks.Two alternative neural network architectures having 2 ConvNets and 4 ConvNets are displayed in this work.Data Augmentation is also used along with deep ConvNet.The pooling function is also used here to resize the dimensions of the input image and also to prevent the model from being overfitted.This paper uses Adadelta as an optimizer along with Adam.Adam is more powerful than Adadelta which is one of the reasons our proposed system uses Adam as an optimizer.They have concluded that among all the models used, they have obtained the best accuracy in combination with CNN4, HPO, and Reg models.It achieves an Accuracy of 93% by Using classical CNN which is a bit less than ours as well.Tang et.al. [6] have studied in depth the roots of deep residual networks and worked in the direction of their optimization.It also includes the use of Residual Networks (RNN), Wide Residual Networks (WRN), and Pyramidal Residual networks (PyramidNet).The results gained here demonstrate that the model's performance can be enhanced by widening the network partly.This model takes more time and requires more computation power than our proposed system.They have found a combination of models, TTA technologies, Ranger, Auto-Augment, and CutMix, and is based on PyramidNet; which gives an accuracy was 96.21%.Kadam et.al. [7] explores the development of convolutional neural networks (CNNs) for image classification, focusing on their application to fashion classification.The review highlights five different architectures with varying convolutional layers, filter size and fully connected layers.This analysis is crucial to understanding the effectiveness of the proposed state-of-the-art model for fashion image classification, which achieves 92.76% accuracy with batch normalization using CNN.Greeshma et.al. [8] have utilized one of the most straightforward and efficient single feature descriptors HOG.The HOG-based feature extraction method is used here for recognizing fashion products.One of the popular machine learning classifier techniques, multiclass SVM, is utilized to train the images.As a robust gradient-based feature descriptor that excels at data discrimination and is substantially superior to other feature sets, this combination is used.Feature Extraction is also used here.This work is performed in MATLAB and it has achieved an accuracy of 86.53%.Vijayaraj et.al. [9] have proposed a model with different activation functions, optimizers, learning rates, dropout rates, and batch sizes.Used deep neural network for Fashion based Image Classification.For extraction of the classification pattern different layers such as max, average and, sum pooling with fully connected layers are applied.The images are tested with ANN as well as the CNN.The highest Accuracy of 90%.Yusi Tang et.al. [10] have studied deep residual networks in detail and implemented them.Hyper-parameters' considerable effects on neural network models prompted research into the best way to build model structure and streamline the training process.Next, the training models' performances under various conditions were compared, and then the best training model was filtered which achieved an accuracy of 96.21%.Kexin Zhang et.al. [11] aims to develop and optimize a Recurrent Neural Network (RNN).To reduce the vanishing gradient problem and overcome RNN, LSTM's are used here.In addition to testing the Heuristic Pattern Reduction approach and the Network Pruning method, fine-tuning and crossvalidation are also applied to optimize the model.Pytorch was used and they secured an accuracy of 89%.Bhatnagar et.al. [12] have used CNN as their deep learning model.For simplicity and learning process acceleration, they have developed three distinct convolutional neural network designs and utilized batch normalization and residual skip connections.By using a two-layer CNN along with Batch Normalization and Skip Connections They have attained an accuracy of 92.54%.Olivia Nocentini et.al. [13] have increased state-of-the-art accuracy by multiple convolutional neural networks with 15 convolutional layers (MCNN15).MCNN is a combination of Multiple Convolutional Neural Networks.Searching hyper-optimization and data augmentation techniques are also applied to improve the generalization of the models.It achieved a classification accuracy of 94.04%.Where, 0 <   < 1.

Proposed architecture
In this proposed work two models are used namely Classical CNN and Lenet as shown in figure 2 and 3. Classical CNN consists of different types of layers.Initially, a block of layers composed of the Convolution Layer, Dropout Layer, and Average Pooling layer is used twice.After that, another block comprised of Convolution, Dropout followed by a flatten, and finally two dense layers are used.Except for the output layer, where the Softmax activation function is utilized, each convolution layer has used the relu activation function [15].The second model used here is Lenet-5 which is an advanced version of CNN.It is the earliest pre-trained model.It has three different sets of convolutional layers, followed by average pooling.Two fully connected layers follow them.In the end, a Softmax classifier is used which classifies the images into respective classes [16].
The optimizer and loss function used in both models is Adam and sparse categorical cross-entropy respectively.ReLU (R(x)) has become one of the desired selections for a hidden layer's activation function because it is usually easier to teach and offers greater accuracy than others [17].Additionally, it resolves the vanishing gradients, and its behavior is nearly linear as per equation 2.

Softmax
It is also known as an exponential function, it is typically an output layer activation function [18,19].The function creates a probability distribution from the k input values in such a way that the sum of all the resulting output values is 1 as shown in equation 3.
Where e zi = Individual class output k= Number of classes

Adam
Adam is a stochastic gradient descent algorithm used for training deep learning models.The learning rate used here was 0.001.One of the main reasons for its popularity is it can handle sparse gradients on noisy problems easily [20,21] which we can observe from equation 4.
Where w new , w old = new and old weights η= Learning rate ḿ= Momentum term ŝ= Scaling term ℇ= Smoothing term to avoid zero division error

Sparse categorical cross-entropy
A low level of categorical cross-entropy calculates the difference between labels and forecasts [22].It is mainly utilized for multi-class classification models as a loss function where an integer value is assigned to the output label which can be seen in equation 5.

Dropout
Dropout is one of the most widely used Regularization Techniques [23].The dropout function performs by arbitrarily dropping out unit activations for a single gradient step in a network [24].Increasing dropout value results in a better model.

Results and Analysis
The accuracy of a classification model's predictions is gauged by its accuracy.In the figure 4 and 6, the accuracy of both the models during training and validation is depicted.The accuracy is the proportion between correctly predicted and total predicted images [25].It is possible to achieve the traditional CNN model's training and validation accuracy at 91.64 and 93.76 respectively.The LENET model's training and validation accuracy is determined as 97.73 and 98.48 respectively as shown in Table 1.
The LENET model's prediction effectiveness increases with its increased precision.When compared to traditional CNN, LENET performs better overall because of its superior architectural design and wise choice of activation function, which includes ReLu, regularization techniques such as dropout, and the use of the most efficient optimizer which is Adam [26].The accuracy and loss curve of both models is shown in figure 4 gradient descent problem [27].In this case, the Adam optimizer moved the gradient in the descent direction to reach the global minima.This reduction in the gradient value results in achieving better accuracy and reducing the loss.The pixel normalization operation also played a vital role during the process of model training.It helped the neural network to learn the features more easily from the images.This is also one of the key reasons to avoid overfitting of the proposed Neural Network structure.

Table1. Comparison between results of CNN and LeNet
Model Name

Conclusion
Our tests show that CNN and LeNet models are successful in classifying the Fashion MNIST dataset.The LeNet model achieved a higher accuracy than the CNN model, indicating that deeper and more complex architectures can yield better results for image classification tasks.
The results also suggest that the Fashion MNIST dataset is a challenging but reliable benchmark for evaluating the performance of image classification models.Future research could explore the use of other deep learning architectures and transfer learning techniques to further improve the classification accuracy of the dataset.A detailed investigation has been done on deep learning-based models for the classification of several fashion items.It can be used for robot assistance systems which could help visually impaired people as well as people of old age.The Proposed model has shown higher accuracy as compared to other models.

3. 1 .
Dataset description and pre-processing Classical CNN and Lenet are the two models proposed here for the accurate identification of different fashion EAI Endorsed Transactions on Industrial Networks and Intelligent Systems | Volume 10 | Issue 4 | This is the title 3 items.MNIST fashion dataset is utilized for the training and validation of the model [14].The size of the images is 28x28 pixels.Out of the total of 60,000 images, 5000 were considered for testing and 55,000 were used for training.The training samples are subdivided into training and validation.For validation, 5000 samples were taken and the rest were utilized for training.Each image pixel   is further divided by 255 to normalize them between 0 and 1 as shown in equation 1.This normalization helps the Neural Network to understand the features in a better manner.The dataset consists of fashion items which are classified into

Fig 3 :
Fig 3: LeNet Architecture , 5, 6 and 7.The Table 2 shows a comparative Analysis between the performance of the proposed models with the different literatures considered in this work.The Lenet model is computationally more efficient than the different models discussed in the table 2 because of its parameter efficiency and simplicity.The other reason for this significant performance of the model is due to the modern related parameter selections such as Activation function, Optimizer and Dropout.The Relu activation function and Dropout operation helped in avoiding the vanishing EAI Endorsed Transactions on Industrial Networks and Intelligent Systems | Volume 10 | Issue 4 |