A Machine Learning Approach to Identify Phishing Websites: A Comparative Study of Classification Models and Ensemble Learning Techniques

Phishing assaults are one of the more prevalent types of cybercrime in the world today. To steal information, users are sent emails and messages. Moreover, websites are used for it. Phishing primarily targets corporate websites, such as those for e-commerce, finance, and governmental organizations. To obtain sensitive user information, attackers impersonate websites, a phenomenon known as phishing. In addition to exploring the use of machine learning algorithms to identify and stop web phishing assaults, this research suggests utilizing machine learning techniques to detect phishing URLs by analyzing various aspects of the URLs. The study includes classification models like Logistic Regression, Random Forest, Decision trees, KNN, Naive Bayes, SVM, and other ensemble learning techniques like Gradient Boosting, XGBoost, Histogram Gradient Boosting, Light Gradient Boosting, and Ada Boost were used to detect phishing websites


Introduction
As Internet usage rises and online transactions become more frequent, phishing attempts are a serious security issue that is quickly getting worse. Phishing is the practice of attempting to get sensitive data through electronic contact by impersonating a trustworthy entity to obtain usernames, passwords, credit card numbers, or other private information. Attackers carry out phishing attacks using various methods, including fraudulent emails and websites. Blacklists, which are lists of URLs and Internet Protocol address have been classified as dangerous, are a systematic method for detecting phishing websites. Attackers can easily change the URL to avoid being listed on blacklists using encoding and other techniques [1].
Phishing is a type of cyber-attack that takes consumable data, including credit card numbers and login credentials for accounts. Phishing assaults are becoming more prevalent all around the world [2]. In 2008, 51,401 phishing websites were identified by the Anti-Phishing workgroup. According to a survey by Rivest-Shamir-Adleman (RSA) Security, Inc., phishing attacks cost a global organization $9 billion in 2016 [3]. These figures demonstrate the ineffectiveness of the current phishing attack defences.
Organizations should create a comprehensive plan incorporating technological and non-technical methods to protect against phishing. A few of the technical safeguards include putting a user education program into place, setting up multi-factor authentication, using URL filtering software to block well-known harmful websites, and keeping an eye on network traffic for unusual activities. In addition to constantly evaluating the organization's security posture, non-technical methods include adopting rules and procedures to deal with any potential threats and raising awareness of phishing schemes. By implementing these safeguards, businesses can ensure their users are better protected from phishing scams. Also, organizations should consider investing money into cutting-edge technology like machine Uppalapati et al. 2 learning or artificial intelligence to spot questionable activity quickly.
Organizations may swiftly identify and mitigate any possible hazards by using these technologies before they become problematic. Lastly, businesses should ensure that their systems are frequently patched and upgraded to guard against the most recent vulnerabilities. Patching systems regularly will assist in lowering the possibility of attackers using known vulnerabilities to access sensitive data [45] [46][47] [48].
To classify phishing websites, the dataset mainly contains the URL details. The term "Uniform Resource Locator" (URL) refers to the internet address of a web page or other resource. It is a special code that allows users to view a certain web page using a web browser [44]. Several elements comprise a URL, including the protocol, domain name, subdomain name, path, etc as shown in fig 1.

Figure 1. Uniform Resource Locator for a Website
The communication protocol being used to send data over the internet is indicated by the protocol type in a URL (Uniform Resource Locator). The URL starts with the protocol type, widowed by a colon and two forward slashes. These protocols include HTTP, HTTPS, FTP, and SMTP.POP3. The most popular protocol is HTTP, and HTTPS offers far higher security than HTTP, FTP, and SMTP, which are only occasionally used.
Unique names are used to construct domain names. They serve as a website's singular identifier and reside in the URL between the protocol and path. The second-level domain (SLD) and the top-level domain (TLD), respectively. HTTP://www.exampleurl.com/info/aboutus.html is the URL in question. The domain name "www.exampleurl.com" appears in the URL. The top-level domain is.com.
Second-level domains are "exampleURL". The phisher creates a domain name that is extremely close to an original or legitimate website domain name; the phishing email seems to be from "www.example-url.com" or "www.example_urls.com".
Converting domain names into IP addresses is called a Domain name system (DNS), which websites may then comprehend and use to link user requests to servers and display them on websites.
Additionally, URLs showed to control an attacker entirely. Services like Bitly or TinyURL can be used to conceal the link's true location. The user may find it more challenging to determine where the link will take them.
Following their name or any port number, a path that defines the file's position "aboutus.html" in the directory on the server hosting "www.exampleulr.com" appears. The path may also comprise several directory names or a file name on the server. The path is a crucial part of the URL because it enables the web server to give the client requesting access to a particular resource [4] [5].
Using a valid domain name and adding a bogus route to the URL, such as "/login.php," the attacker tricks users into providing their credentials on what appears to be a login or update page. Still, the attackers are accessing the victims' information.
The rest of the paper is organized as follows. Section 2 describes the study of the existing works. Section 3 describes the research methodology. Section 4 addresses the state-ofthe-art corpora utilized to carry out this classification problem. Section 5 presents an overview of the evaluation criteria. Section 6 discusses the numerous cutting-edge Machine learning approaches. Section 7 presents the author's observations regarding the proposed Research Questions. Finally, the work's conclusion is included in Section 8.

Literature Review
The RF model and various other ML techniques were proposed by Rao et al. in a novel way [6]. The overfitting issue and sparse or missing data can both be dealt with using the RF approach.
The logistic regression is reliable for finding able ways to find independent variables gathered for two groups. The features' recurrence, incompatibility, and the negative predictive consequences of outlier values are their examples of its limitations. However, the support vector machine approach can be used because it is better suited for various independent variables [7]. Large, noisy datasets are a constraint of this method, although it works well for nonlinear issues. A survey of the main detection methods and taxonomy for phishing detection was presented by Vijayalakshmi et al. in 2020 [8]. According to an APWG data analysis, phishing attacks increased from 2017 to 2019. In the study, a taxonomy of automated phishing detection solutions was presented. Depending on the input parameters, the taxonomy divided all the solutions into three categories: web address-based methods, webpage content-based solutions, and hybrid approaches. Web address-based approaches were classified into list-based, heuristic rule-based, and learning-based approaches based on the techniques used in the solutions, and web content-based approaches were divided into rule-based and machine learning-based solutions.
Ozgur et al. [9] implemented web phishing classification by collecting their own data from the available resources. The Random Forest algorithm with solely NLP-based features performs the best with a 97.98% accuracy rate for phishing URL identification, according to experimental and comparative data from the implemented classification methods.
Jain et al. [10] proposed a methodology for performing the classification task in detecting phishing websites when compared to other machine learning approaches logistic regression classifier achieved more than 98.4% accuracy.

Methodology
Three datasets altogether were utilized to detect web phishing. The datasets were obtained from the machine learning repositories at UCI and Kaggle. These functions have a direct connection to website content. The datasets varied in size, which is crucial for assessing the precision and effectiveness of the Pre-processing is done one algorithm.
Each of processing is done to remove extraneous features and deal with missing values. Data is divided for training and testing. To create a Phishing Classifier model, we explored a variety of algorithms [36][37] [38]. They include decision trees, Bernoulli's Naive Bayes, logistic regression, support vector machines, K-nearest neighbours, random forests, and ensembling techniques [41] like Gradient Boosting, Hist Gradient Boosting, AdaBoost, XGBoost, and LightGBM. Finally, test data are provided to validate the output of the algorithms. Several statistical indicators, including recall, accuracy, and precision, are used to assess performance. The adopted methodology is outlined in Fig. 2. this survey article are to investigate and comprehend the leading Machine Learning algorithms utilized for Phishing websites addition, as well as to react to a few research-related questions. Q1. What recent datasets are available for this task? Q2. Which evaluation techniques apply to this task? Q3. What strategies may machine learning techniques be used to categorize websites? Q4. What are the results?

Datasets
Three datasets altogether were utilized to detect web phishing. The Kaggle and UCI machine learning repositories are where the datasets are pulled from [11] [12]. The size of the datasets varies as well, which is crucial for assessing the precision and effectiveness of various algorithms.

Dataset -1:
The first dataset contains 2456 URLs with 28 attributes and is titled "Phishing Websites." The material does not specify a specific date of collection for the Phishing Websites Data Set. The dataset was given to the UCI Machine Learning Repository on March 26, 2015, as noted in the dataset description. The information in the dataset, however, might have been obtained from several sources over time prior to the donation date, including the MillerSmiles archive, the PhishTank archive, and Google's search operators. The precise dates of data collection for the dataset may vary based on the original sources and methods used to obtain the data.
Dataset -2: Titled as Web page phishing detection dataset. The 48-feature dataset was created from 5000 authentic websites and 5000 fraudulent websites between January and May 2015 and May and June 2017. By utilizing the Selenium WebDriver browser automation framework, a better feature extraction method is used, which is more accurate and reliable than a parsing strategy based on regular expressions. It is appropriate for WEKA.
Dataset -3: Titled as Phishing Dataset for machine learning. As a part of the dataset 87 characteristics were retrieved from the 11430 URLs. This dataset is designed to be a standard reference for machine learning-based phishing detection systems. It contains a total of 63 features, which are divided into three categories. Seven features are derived from communication with other services, while the remaining 56 features are based on the structure and syntax of URLs. The dataset is evenly balanced with an equal number of authentic and phishing URLs, making up 50% each.
Among these datasets, Dataset-1 is an unbalanced dataset. Remaining two are of balanced datasets.

Evaluation Metrics
The effectiveness of machine learning algorithms for categorizing phishing websites can be assessed using a variety of evaluation approaches. Here are several regularly employed methods, including Precision, Recall, Accuracy and F1-Score [13] [14]. A Confusion Matrix is a tabular representation of the counts of true positives, true negatives, false positives, and false negatives of the data. It is used to evaluate the effectiveness of a binary classifier.

FN TN
The most fundamental evaluation statistic, accuracy is determined by dividing the total number of predictions by the number of correct predictions generated by the model as shown in the equation (1). When the statistics are unbalanced, that is, when one class is far more numerous than the other, accuracy might be deceptive.
Precision is the ratio of actual positive results (phishing websites accurately identified as such) to all expected positive results (all websites identified as such) as in the equation (2).
Recall is the ratio of real positives-i.e., all the phishing websites in the dataset-to the overall number of positives as in the equation (3). When working with data that is unbalanced, it may be necessary to make a trade-off between the two measurements.
A more balanced way to evaluate the performance of the model than accuracy is to use the F1-Score, which is the harmonic mean of precision and recall as in the equation (4). When the dataset is unbalanced, it is frequently employed.

Approaches
Following are some methods for classifying websites: -Analysis of a URL's syntax and structure is required to establish the category of the URL. Phishing websites, for instance, may utilize URLs that match those of real websites [5].  [40], including the decision tree approach, which is straightforward and efficient and uses recursive partitioning of data to classify phishing websites. Multiple decision trees are combined to create Random forests, which uses predictions. Another well-liked technique for accurately identifying phishing websites is SVM. In binary classification tasks, such as classifying websites as legitimate or phishing, logistic regression is used. A probabilistic approach called Naive Bayes is used to categorize webpages. Based on the separation of the data points, KNN is also used to detect phishing websites. Gradient boosting is an ensemble method for increasing classification accuracy by combining the results of various decision trees [19][20] [21].
Decision Tree: A popular Supervised machine learning method for categorizing web pages or URLs as legitimate or phishing is the decision tree algorithm. Variables like URL structure, URL length, port, and other variables are employed as predictors and it was trained on a labelled dataset. A decision tree represents all potential outcomes (also known as leaves) of a decision process (also known as branches). By dividing the dataset's best characteristics and criteria, the decision tree is constructed recursively. The Gini impurity is the default classification criterion used by the decision tree classifier. The following are the steps to implement decision tree.
Step 1: If every record in Dt is a member of class yt, then t is a leaf node with the label yt.
Step 2: To divide the records into more manageable groups, an attribute test condition is chosen if Dt contains records that belong to multiple classes. For each test condition outcome, a child node is formed, and records in Dt are assigned to the children based on the outcomes. Then, each child node receives a recursive application of the algorithm.

Random Forest:
A classification or regression problem's outcome can be predicted using the supervised machine learning method Random Forest. It is an ensemble method that creates several decision trees during training and only utilizes the most crucial attributes during prediction. Using features like n_estimators, criteria, and others, it predicts the input URL by iteratively going through each decision tree in the random forest. The final prediction is then decided by majority voting. To address the issue of overfitting in decision trees, the random forest technique was initially developed. This is accomplished by creating many decision trees, each one employing a different subset of the input data [22] [23].
Naive Bayes: Is a supervised classification technique that categorizes unknown inputs using the Bayesian probability [24]. This model determines the conditional probability of a URL or web page being a legitimate or phishing site. Gaussian Nave Bayes, Bernoulli Nave Bayes, and more variations of Bayesian algorithms exist. When using Bernoulli Naive Bayes, which commonly uses binary features, the prior probabilities and likelihoods for each feature and each class are calculated. The final class label for the URL or webpage is projected to be the class with the highest posterior probability. The naïve Bayes classification is defined as If X is a set of d attributes = { 1, 2, …. , }, The Naive Bayes classifier calculates the posterior probability for each class y to categories a test record.. The highest probability is the class that the test record belongs to.
Logistic Regression: A statistical technique for simulating the likelihood that an event will occur is logistic regression [24]. Given one or more independent factors, it is used to forecast the result of a categorical dependent variable. Any type of data can be used with this method. Utilizing a logistic function, logistic regression makes predictions about the likelihood of an occurrence. The form of this function is: Where x is the input variable and Y is the predicted outcome.
Where ( = 0) = 1 − ( = 1). The value of x for which ( = 0) = ( = 1)is known as the point of inflection, or breakpoint. Support Vector Machine: A supervised machine learning approach known as a support vector machine maximizes the shortest distance. SVM uses a linear kernel function to find an ideal hyperplane in an N-dimensional space that, depending on its feature space, can distinguish between authentic and phishing data the most effectively [25] [26].

K-Nearest Neighbours: KNN Classifier is a supervised
learning-based classification system that divides data into various categories [26]. This technique classifies each data point individually using its k nearest neighbours. The value of k in this issue is 3. To properly label each data point for the KNN classifier, we must first decide which category or class each piece of data should go under. These two pieces of information are fed into the KNN classifier, which uses the distance between each piece of data and its nearest k neighbours to determine which category it should fall under. The following are steps for implementing KNN.
• Choose the number of nearest neighbours to consider, denoted as 'k'. The algorithm computes the distance between each test example = ( 1, 1) and all training examples ( , ) ∈ to determine its nearest neighbour list Dz. Gradient Boosting: When there are numerous features with significant levels of association, this model is frequently used [27]. The classifier gains the ability to match the features of websites to their labels (such as legitimate or phishing). Multiple weak learners are combined using gradient boosting to produce a powerful model. When the target level of accuracy is attained, the algorithm stops adding weak classifiers and starts over with an empty model [28].
Extreme Gradient Boosting Classifier: An ensemblebased classification system called XGBoost Classifier can be applied to any machine learning issue involving sizable datasets. It builds powerful classifiers using a boosting method. By analyzing the elements of the website, XG enhances and trains the data based on the learning rate and maximum depth, and then assigns a score that indicates the likelihood that the website is a phishing website. The primary flaw in XG boost is its inability to handle categorical characteristics.
Histogram Gradient Boosting Classifier: For large datasets (sample >=10000), Histogram Gradient Boosting outperforms the Gradient Boosting Classifier, which combines many weak learning models. Max_bins, Max_depth, and other terms are employed in this classifier, and default values are considered. The tree grower learns at each split whether to choose the left or right child (i.e., phishing or legitimate as the ultimate split) based on the prospective gain. Those samples are mapped to the child with the most samples if there are no missing values discovered during training.
Light Gradient Boosting: The XGBoost method, which also manages unbalanced datasets, is comparable. It was created by Microsoft, is quicker, and uses less memory. As light GBM develops vertically (leaf-wise), more loss is reduced [29].
Adaptive Boosting: AdaBoost Classifier is a machine learning technique with an ensemble approach to categorize fresh data points. Data is trained using n_estimators and the learning rate [30]. Ada boost produces stumps, a tree with only two leaves. Stumps' principal function is to eliminate errors; however, they are not given equal weight in the final decision tree. All the data points are first given equal weights. When classification is done incorrectly, weights are increased. By sequential training on the training data and subsequently testing on the test data, it iteratively creates a classifier.
Various machine learning algorithms along with the optimization techniques can be used to perform web phishing classification. As the dimensions of the dataset also play a key role. If the dataset is of multidimensional feature, it may lead to the overfitting condition [31]. So, the dimensions can be reduced to improve the model's accuracy.

Results
From the table 1, XG Boost has the highest accuracy in dataset 1 with a score of 97.13%. This algorithm ran in 1.51 seconds. Logistic Regression, which has an accuracy rate of 91.73%, is the least accurate algorithm. The only algorithm with accuracy greater than 97.05% is XG Boost. With 89.06% accuracy, Bernoulli naive bayes is the algorithm that has a lower percentage of accuracy when compared to other algorithms and takes very little time to perform.
Among all the datasets the performance of the UCI Phishing dataset is yielding better results compare to other datasets as shown in the figure. Even advanced topics like deep learning [32] [33], transfer learning can be used to classify the web phishing websites [34].

Conclusion
This research paper describes how machine learning algorithms effectively detect and predict web phishing attacks. The analysis of different machine learning techniques can accurately classify phishing websites based on the various features such as path, URL, domain name, sub-domain name, and directory. For instance, decision tree models can effectively identify the relevant features for classification, while random forest can improve the accuracy and robustness of the classification models, SVM can easily handle highly dimensional feature spaces. Boosting methods like gradient boosting, XG boost, and Ada boost have shown that these algorithms can accurately classify into phishing or legitimate. Boosting algorithms are highly effective in improving the performance of weaker machine-learning algorithms. By iteratively reweighting the training examples, boosting algorithms can boost the model's accuracy by giving more weight to different examples. Other techniques like deep learning and neural networks [42] [43] can also be used in further works. But this paper mainly focuses on the machine learning and boosting algorithms that can be done easily with less complexity. Overall, machine learning algorithms can significantly enhance the security of web users by providing phishing detection. As cybercriminals continue to develop more sophisticated phishing attacks, the use of machine learning algorithms will become increasingly important in ensuring the safety and security of online users.