Google Maps Data Analysis of Clothing Brands in South Punjab, Pakistan

The Internet is a popular and first-hand source of data about products and services. Before buying a product, people try to gain quick insight by scanning through online reviews about a targeted product. However, searching for a product, collecting all the relevant information, and reaching a decision is a tedious task that needs to be automated. Such composed decision-assisting text data analysis systems are not conveniently available worldwide. Such systems are a dream for major cities of South Punjab, such as Bahawalpur, Multan, and Rahimyar khan. This scenario creates a gap that needs to be filled. In this work, the popularity of clothing brands in three cities of south Punjab has been assessed by analysing the brand's popularity using sentiment analysis by prioritizing brands based on organic feedback from their potential customers. This study uses a combination of quantitative and qualitative research to examine online reviews from Google Maps. The task is accomplished by applying machine learning techniques, Logistic Regression (LR), and Support Vector Machine (SVM), on Google Maps reviews data using the n-gram feature extraction approach. The SVM algorithm proved to be better than others with the uni-bi-trigram features extraction method, achieving an average of 80.93% accuracy.


Introduction
Customer feedback is vital for deciding a brand's reputation. Most customers willingly share their genuine experiences and thoughts about the product or service. Online reviews are helpful for brands in improving their product quality, customer engagement, credibility, and trust [1]. The common practice of the customers is to check the overall rating of the product or manually skim the previously recorded text feedback, analyse it, and then decide whether to buy a product or not. A manual effort of searching, analysing, comparing, and concluding is a tedious practice that mostly ends with the targeted product's shallow/misleading conclusion [2,3]. Therefore, an algorithmic method should be introduced to overcome this limitation [4]. This study is an effort to provide the people of South Punjab with a quick fix, reviews-driven sentiment classification system for clothing brands that may * Corresponding author. Email: mbalvi@iub.edu.pk assist them in quantifying trends in Bahawalpur, Multan, and Rahimyar khan cities. The data in this research is extracted from their potential customer's brand reviews posted on Google Maps. Google Maps is the fastest-growing review site surpassing Facebook [5]. Many brands and other businesses are enlisted on Google Maps where a user can access the outlet's location and give feedback about products or services. The proliferation of reviews on Google Maps improves the platform's significance. The extracted customer reviews are in free text form. Text data have implied shortcomings that can hinder the sentiment analysis process. Such drawbacks of free-text data include: • Data is noisy.
• Data contains irrelevant content.
• It contains spelling errors and contractions.
• It often consists of user-improvised language related to computer-mediated social networks. 2 The flow of this research follows data extraction from Google Maps, data integration, data cleaning, preprocessing, feature engineering, and model development. The selected features are fed to two state-of-the-art Machine Learning Algorithms, i.e., LR and SVM to develop the predictive model. The predictive model can successfully classify customer reviews about Bahawalpur, Multan, and Rahimyar Khan clothing brands. This research benefits a vast community, including brand managers, owners, and potential customers, in terms of statistical analysis and descriptive judgment. Through the statistical analysis, brand managers and owners can better understand customer sentiments and market trends, hence financially reaching a better market share. At the same time, customers can scrutinize the brands' descriptive judgments and obtain their services. This study resulted in a machine learning-based model that utilizes Google Maps reviews to assist people in evaluating a clothing brand quickly. The developed model obtained an average of 80.93% accuracy on the validation dataset.
The paper is distributed in a way that section 2 describes the related literature work, dataset elaboration is given in section 3, whereas section 4 is about the research method. The experimental work results are discussed in section 5, and section 6 concludes the work and indicates future direction. Finally, the paper ends with a list of references.

Related Work
This section describes the research related to systems that use web technologies to accumulate customer feedback to enhance, improve, or change their service style or product quality. Such feedback analysis system(s) can be developed using python and allied tools such as selenium, Beautiful Soup, sentiment analyzing lexical libraries, and scikit-learn [6]. Natural Language Toolkit (NLTK) is another powerful tool that works with human-generated text data and encompasses a cluster of open-source modules that provide easy-to-use interfaces to over 50 corpora and lexical resources [7]. Each passing second adds Millions of Megabyte data to the digital world using the Internet [8], and Internet data extraction is coined as web harvesting, web scraping, or web extraction. Researchers are taking advantage of the vast amount of available data and using it for research purposes [9]. Beautiful Soup [10] and Selenium WebDriver [11] are among some of the handy tools that make scrapping [12] easier by automating tasks to reduce human intervention. Brand review data is vital and appealing for many researchers and analysts to get insights into the trends using customergenerated product reviews. The authors applied text mining and network analysis to product reviews [13]. They also suggested an approach that allowed managers to control the strengths and weaknesses of brand image effectively. Filipa Rosado-Pinto et al. worked on customers' online restaurant reviews [14]. They explored the brand's authenticity and consumer brand engagement using text-mining techniques.
Online reviews do not directly impact sales, but indirectly, they are very effective for customers to choose the right brand, which can boost the brand's sales. Stephen J. Carson et al. [15] studied the effects and relationship between online customer reviews and sales. Boonyanit M. and Viriya T. [16] examined the customer experience by analysing online reviews of different restaurants present on Google Maps and used the VADER [17] for review classification. They applied a logistic regression algorithm to obtain results.
Supervised learning methods require annotated data samples (Text data is often unlabelled.). Usually man-power is used to label data samples. The overall human labelling accuracy is 82.9%, proved in human labeling experiments [18]. However, human annotation involves a lot of time and expenses. An alternate for text data annotation, especially for sentiment classification, is using sentiment lexicons such as Vader, SentiWordNet, TextBlob, etc.
Text preprocessing plays a vital role with unstructured text data. The authors in [19] presented text preprocessing techniques that eliminated noisy data and improved the model's performance. Another work that reported the impact of text preprocessing on model performance is described by S. Alam and N. Yao in [20]. M. B. Alvi et al. studied the implications of recursive preprocessing on text data [21]. They have used 19 preprocessing techniques and suggested a preprocessing pipeline that works iteratively. Another work by the same authors reported on developing a hybrid model that could perform sentiment analysis on the issue of global warming [22]. They claimed to achieve 86% accuracy using their model. Shuai Liu et al. [23,24] introduced engineering applications of effective hybrid information and big data processing.
Tokenization is one of the initial and essential processes in preprocessing pipelines, separating input strings into individual tokens. The significance and complexity of tokenization are addressed in [25]. The Authors in [26] have performed a comparative evaluation of tokenization based on quantitative methods. Stemming and lemmatization are similar and effective preprocessing steps that help reduce the vector space of a given data set. Vimala B. and Ethel Lloyd-Yemoh [27] proposed a performance-based comparison between stemming and lemmatization and found that lemmatization techniques produced the best result. Michael W. Browne reported on cross-validation methods in [28]. The research concluded that predictive accuracy depends on sample size and the number of predictor variables.
Ronen Feldman described the applications and challenges of one of the hottest fields in computer science: sentiment analysis [29]. Alessia D'Andrea et al. in [30] explained the classification of approaches, tools, applications, and implementation of sentiment analysis. The performance of various sentiment analysis methods showed that SVM gave higher accuracy than the entropy method, as shown in [31]. The authors of [32] investigated Decision Tree, LR, Naïve Bayes, Random Forest, and SVM classifiers implemented in Apache Spark (an in-memory intensive computing platform). Their findings indicated that the LR for product reviews achieved the highest classification accuracy.
Google Maps Data Analysis of Clothing Brands in South Punjab, Pakistan 3

Dataset
Data mining is the computational process of identifying, engaging, categorizing, analysing, and maintaining data [33]. People not only take guidance on the location using Google Maps but also share their feedback. Many clothing brands in Bahawalpur, Multan, and Rahimyar Khan are listed on Google Maps. The targeted brands with more than 13 reviews were scrapped, ignoring other brands with a smaller number of reviews. In total, there were 51 brands in the final list: 18 in Bahawalpur, 24 in Multan, and 09 in Rahimyar Khan, with a total number of 4121 reviews in the initial list, of which 1416 belonged to Bahawalpur, 1951 belonged to Multan, and 454 belonged to Rahimyar Khan city where 300 reviews were exempted from the list. The Reviews were examined and exempted by using the following criteria: (1) empty reviews, (2) reviews in non-English, and (3) irrelevant reviews.
Google Maps facilitates the customers to provide star ratings and text feedback. But some of the users give their feedback only in star ratings. Such star rating feedback creates the dataset's blank (empty) text fields. The data extractor unified such blank input into the primary dataset. The scrapped dataset also acquired non-English data samples (Urdu script and Roman Urdu reviews). Additionally, the dataset also includes irrelevant data samples related to business promotions and sports. All these inappropriate reviews were identified and exempted from the dataset for final qualitative analysis to avoid skewed results and peculiar outcomes. The sample of exempted reviews is shown in table 1, while the filtered review sample is given in Table 2.

Experimental Method
This section describes the method adopted to undertake this work. The method includes four main phases: data extraction and integration, data preprocessing, exploratory data analysis (EDA), and model development. In the first phase, a custom build scrapper was used to extract the data from Google Maps. Data integrator involves the integration of multiple .csv files to develop a single .csv file. Since the review text has no polarity, the next step is annotating the data. After the annotation, the annotated data samples were compared for variance, and the conflicting reviews were considered for discussion. In the data preprocessing step, the integrated data was masked by removing unwanted and trivial reviews. The features in the training data were selected to fulfil the assured bias. EDA involved comparative analysis. Figure 1 shows the proposed sentiment analysis system for brand reviews.

Figure 1.
Step-by-step methodology for sentiment analysis system for brand reviews

Data Extraction
Privacy of user credentials is one of the major concerns in digital civilization. Therefore, only publicly available reviews were extracted that were published on the Internet by the person, and no personal information was gathered. Different social networking sites facilitate through Application Programming Interfaces (APIs) that assist in data collection. Rate limiting, API licensing, and manual data collection are too costly in terms of data and time [34].
The motive of a custom extractor is to provide some additional advantages over traditional extraction methods, i.e., the usage of APIs. Google Maps allows users to search by the name of the brand and the search filters such as "location". There are hundreds of cloth-related brands identified to have reviews. The well-known brands in three major cities of South Punjab were examined on Google Maps to categorize if they contained enough customer reviews. The brands with more than 13 reviews were filtered out to maintain the quality of the analysis. Fiftyone brands in these cities fulfilled all the requirements, having more than 13 reviews. In the second phase, the listed brands from the initial search round were efficiently scrapped by the custom scraper. The Customer reviews were extracted using the developed Google Maps reviews data extractor (RDE). The extractor was developed using "Selenium" and "Beautiful Soup". Moreover, the utilization of the scrapper depends upon the URL of the required web page. The scrapper parsed through filtered brand URLs, extracting all the EAI Endorsed Transactions on Scalable Information Systems 01 2023 -04 2023 | Volume 10 | Issue 3 | e10 reviews on that page, and saved them in a .csv file with the name of the brand in the next index. Once all the reviews from the URL were scraped, the scrapper proceeds to another brand URL. Subsequently, All the reviews from the list were extracted using the same procedure, moving with three different files based on their cities to analyse their popularity among people.

Data Integration
Algorithm 1 shows the functional approach of the data integrator. The developed data integrator combines the data files (in .csv format) into one file, representing all the brand reviews of respective cities. The file integration provides an easy city-wise comparison of all the clothing brands.

Algorithm 1 Input: The data files of the multiple brands of a specific city. For each data file, do: 1 If (Reviews >= 13), Append it into a new data file. 2 Drop the data file having less than 13 reviews (if any). 3 Fetch the name of a brand as a value. 4 Extend the new data file with reviews and values. 5 Repeat the step for other cities.
Output: The unified data file.

Data Annotation
The obtained data set does not have any polarity. Therefore, the Multi Annotator methodology reliably helped to annotate the integrated data set. The annotators were clear about a few perspectives, such as what to annotate (according to contextual perspective). All three annotators independently annotated the data into two subcategories (positive and negative). After discussion, some contradictory labels over a few data samples were smoothly resolved by adopting the majority vote label.

Data Pre-processing
Data preprocessing is a rudimentary but essential step of sentiment analysis [19,35]. Data preprocessing steps convert the raw uncleaned text data into a format compatible with the machine learning algorithms. The textual data cannot be fed directly to machine learning algorithms. In the preprocessing pipeline, the output of one process becomes an input for the forthcoming process. Tokenization [36,37] is the second preprocessing step that breaks continuous strings into small fragments: words, keywords, phrases, or symbols. Tokenization or word segmentation is a significant step because it helps in masking non-important words, punctuation, and digits. A raw annotated data set contains impurities, which may cause redundancy/inaccuracy in an algorithm. Text data preprocessing also involves the process of non-trivial term reduction. A Non-trivial term reduction implies casenormalization, stop words/special characters/numbers removal, and stemming. A case-conversion process that brings uniformity to the text. Removal of the special characters and numbers constitutes another step of preprocessing text data. Stemming brings down the different forms of verbs, nouns, and word variants into semantically base words. These initial preprocessing steps not only result in a cleaner dataset but also mitigates feature space, which increases computational efficiency. After following the above steps, a document-term matrix is built, the second stage of preprocessing. The rows of the document term matrix denote reviews, whereas the column represents the (n-gram) features. N-grams are very useful for the development of N-gram Language Models (LMs). N-gram is the frequency of the word sequence that appears in a corpus text. N-gram used to have the improved prediction of a system by aid of probability of occurrence of a particular word. The probability calculation example of the N-gram model is given in Table 3.
• Expensive for shopping. • Worse brand for shopping.

Model Development
Machine learning algorithms are used to build models by using the training dataset. The following algorithms are used for this purpose.

Logistic Regression
Logistic regression is one of the most powerful and widely used algorithms for binary classification problems. logistic regression describes and tests the hypotheses about the relationship between a categorical outcome and predictor variables [38]. In logistic regression, the hypothesis function theta may be determined using Equation 1, which predicts values between 1 and 0. With the threshold set to 0.5, it classifies the output as true (positive) or false (negative).
The hypothesis function is determined using Equation 2.

Support Vector Machine
The support vector machine algorithm has been found to be effective in text mining applications. The large-margin classifier classifies the output as either 0 (negative) or 1 (positive). The SVM detects the optimal separating hyperplane and maximizes the margins (in the training data), which can be especially effective in high dimensional spaces. The margin is the distance between the hyperplane and the nearest data points of different data set classes. The SVC uses various parameters to modify SVC algorithm functionality (such as kernel="linear", C=1, loss="squared_hinge", max_iter=1000, penalty="l2", tol=0.0001). The hypothesis function of SVM can be determined by Equation 3:

Result and Discussions
The results of two models with four features (unigrams, bigrams, uni-bigrams, and uni-bi-trigrams) are shown in Table 9. The SVM-based model outperformed logistic regression using all four features. Overall, the best results were achieved by using Uni-Bi-trigram, obtaining 83.26% accuracy. The accuracy of the classifiers was also compared, and it was observed that SVM outperforms with an average accuracy of 80.93%, followed by LR with an average accuracy of 79.23%. This work analysed the popularity of the clothing brands in Bahawalpur, Multan, and Rahimyar Khan. There was a total of 51 clothing-related brands in this data set. Only brands with 13 reviews or above were extracted and analysed, and others were ignored. After the analysis, it was found that, based on the positive reviews count, "Edenrobe", "J.", and "Diners" were the most popular brands in Bahawalpur. "Rang Ali Fabrics", "Khaadi", and "Ideas" were found to be the most popular brands in Multan. In Rahimyar Khan, "J.", "Khaadi", and "Diners" were found to be the most popular brands, as shown in Figure 2.
A comparative study revealed that in the long run, "Limelight", "Uniworth", and "AlkaramStudios" tend to be more popular brands among the people of Bahawalpur. "Gravity", "Shirt and Tie", and "Ismail Gulgushat" were found to be the most popular brands in Multan. "AlkaramStudios", "Breakout", and "Uniworth" were found to be the most popular brands in Rahimyar Khan, as shown in Figure 3.

Conclusion and Future Work
In this work, a sentiment analysis system is developed to conclude people's choices about various brands in the three most populated cities of south Punjab, Pakistan using Google Maps data. A custom data extractor was developed to get brand reviews from Google Maps data. The extracted 4121 reviews were integrated from three different files into a single file using a robust custom data integrator. The data was annotated by the multi-annotation method, and then preprocessing was regulated on the data set. Four different features (unigrams, bigrams, and their unions) were utilized with two machine learning algorithms (LR and SVM) for building the classification model. The experiments were conducted using StratifiedKfold crossvalidation to suppress model overfitting, and then accuracies were computed. The results showed that the SVM algorithm-based model with Uni-Bi-Trigram features performed better than LR, obtaining an average accuracy of 80.93%. This accuracy is obtained by taking the average of all the accuracies obtained using uni-bi-tri gram with SVM. This comparative study further reveals that "Limelight" is the most popular brand in Bahawalpur, "Gravity" is famous in Multan city, and "AlkaramStudios" is prominent in Rahimyar Khan city.
An extension to this work may include adding multilingual data sets from different data sources (Facebook/Instagram shops, website reviews from the brand's online stores, and related websites) and transforming the city-oriented analysis into brand-oriented across all of Pakistan.