Water Quality Estimation and Anomaly Detection: A Review

Critical infrastructures that provide irreplaceable services are systems that contain industrial control systems (ICS) that can cause great economic losses, security vulnerabilities and disruption of public order when the information in it is corrupted. These ICSs, which were previously isolated, have now become systems that contain online sensors, wireless networks, and artificial intelligence technologies. This situation has also increased the scope of attacks by malicious people who intend to carry out industrial espionage and sabotage these systems. In this study, water quality estimation systems and anomaly detection are comprehensively examined. In this direction, the statistics of the studies in the literature, the methods for water quality anomaly detection, the existing data sets, and the di ffi culties encountered in the water systems to achieve better water management are discussed. Principle findings of this research can be summarized as follows: (i) new methodologies and architectures have improved water quality assessment through anomaly detection, (ii) di ff erent datasets including multi-modal information have been presented, and (iii) remaining challenges and prospects have been investigated.


Introduction
71% of the Earth's surface is surrounded by water, which is vital to all known life forms.Cleaning of drinking water is a major task for water supply companies around the world and it is a big problem that drinking water is highly vulnerable to possible attacks.In recent years, with the acceleration of industrialization, intense human activities, agricultural activities, and other sectors' water demands have increased in most countries.Due to environmental pollution and variable climatic conditions, different problems such as the increase in the amount of water resources use, deterioration of water quality, and sabotage are encountered.In addition, the subject of water management systems is one of the research areas that many developed countries prioritize.Accordingly, for the sustainable management of water resources, methods that require advanced technology in areas such as measurement systems, monitoring, and control systems need to be defined and put into practice [1][2][3].Therefore, water management systems are in a very important position for the protection of critical infrastructures due to a number of factors such as these.The first of these factors is the quality of the water distributed, and a slight deterioration in water quality directly affects many people in terms of health.Another problem is that, unlike certain other infrastructures where it would be possible to restrict physical access to key assets, water management systems have a significant number of remote stations that are challenging to control and safeguard from unintentional or intentional contamination incidents.Because there are very few defense mechanisms in case of pollution.Various techniques [4,5] can be used to model the distribution of pollutants, but the large and complex topologies of water distribution systems make these techniques difficult to apply.In order to improve emergency response capacity and safeguard water quality from potential risks brought on by intentional or unintentional contamination, it has become essential to develop an effective detection method to spot changes and anomalies in water quality as well as to provide rapid early warning in case of potential hazards.
Detection of intentional/unintentional contamination events that threaten the safety of water management systems and prevention systems is widely studied in the literature.These studies address the issue of water management systems security from many different aspects such as water quality determination and detection of anomalies, placement of water quality sensors and SCADA (Supervisory Control and Data Acquisition) security, pollutant detection, modeling of intervention and mitigation methods, and the development of artificial intelligence and machine learning supported anomaly detection models for more complex situations addresses [6,7].
Water management systems are known as cyberphysical systems where physical processes work together with computational engineering systems.In these systems, water quality measurement and water management are controlled by SCADA system composed of existing sensors, actuators, programmable logic controllers (PLCs), remote terminal units (RTUs), and field devices such as these.Therefore, recent cyber physical events show that these SCADA systems water management systems are suitable for cyber attacks and are one of the leading critical infrastructures.At this point, it is clear that in these water management systems, there is a need for tools that can detect anomalies in water quality, evaluate the risk of the cyber-physical system, and support the prevention and intervention of cyber-physical attacks [8].

Smart city and its water management perspectives
The study and creation of applications for smart cities have become hot topics in recent years.Although the concept of the smart city-connected cities, intelligent cities, digital cities, etc.-was first proposed in the 1990s, big data and AI-driven recent technological advancements have accelerated the adoption of these applications.These programs, made available by city governments, give residents amenities that might make daily living easier [9].According to Bellini et al. [10], who manually categorized the applications for smart cities, there are eight main classes: smart governance, smart economy, smart facilities, smart transport, smart energy [11,12], smart industry and production, smart environment (like smart water), and smart healthcare.
A smart water city enhances the quality of life of its residents by utilizing Information and Communications Technology (ICT) and other technologies to address urban water issues at every stage of the urban water cycle.Six categories can be used to classify general research: (i) management of alternative water resources and reuse, (ii) sustainability, (iii) water 4.0, (iv) sanitation and value-adding, (v) quality, and (vi) networks [13].The restoration of the water cycle, waterfront use, and intelligent water management help to improve overall water management in addition to offering individualized solutions for traditional water management practices including drainage, water treatment, and wastewater treatment.ICT-based intelligent technologies augment and supplement existing infrastructure and water management technologies on a broad scale in a smart water city [14].

AI-driven next-generation cyber-physical systems
The Next Generation Cyber-Physical Systems (NG-CPS) have become complex, autonomous, sophisticated, and pervasive as a result of the gradual integration of technology.As a result, both academia and business are interested in today's NG-CPS, which includes the Internet of Things (IoT), cyber components, Internet of Vehicles (IoV), Intelligent Implantable Medical Devices (IMDs), etc.Although NG-CPS can be defined by a number of opportunities for service providers (stakeholders in the industry and the market) as well as for consumers (clients).Although NG-CPS technology has many benefits, it also presents a number of difficulties for the involved parties, including reliability, security, and interoperability [15].
The literature offered a number of ways to address these issues with NG-CPS technology, however, they don't seem to be able to recognize recently adopted risks.Designing trustworthy AI-driven solutions for NG-CPS technology is therefore imperative if we are to handle these issues profitably [16,17].Because AIdriven solutions have the potential to foresee and detect both existing dangers and those that have just been accepted, they should be employed as a substitute technology in the presence of existing literature.To create new AI-driven solutions for this developing technology, researchers and industry experts must collaborate [18].

Anomalies and their types
There are numerous definitions of anomalies, each with varying levels of specificity.They are typically thought to be infrequent in comparison to non-anomalous observations in a dataset and to deviate from the norm in terms of their attributes.Any water-quality value or set of data that is the result of a manufacturing defect in the in-situ sensor equipment is considered an anomaly in this study.In collaboration with the enduser, Leigh et al. [19] established the many categories of abnormalities that are expected to arise in the waterquality data.These include drift, clusters of spikes, missing values, large and small abrupt spikes, low variability, including permanent values, continuous offsets, abrupt shifts, high variability, impossible values, out-of-sensor-range values, and others [20].
Also, the main groups of anomalies are briefly discussed as (i) Point anomaly: This happens when a single data instance deviates from the overall dataset's typical pattern.(ii) Contextual/Conditional anomaly: These are data occurrences that are only labeled anomalies in a given circumstance.(iii) Collective anomaly: A group of data instances is referred to as a collective anomaly when they behave abnormally when compared to the full dataset.It is possible to include anomaly types to these groups as given in Table 1.

Motivations and objectives of this review
Much research that has been done in the area of outlier or anomaly identification has been organized and classified in a few recent survey publications, with a focus on the research challenges that still require attention.The potential of tensor-based techniques as a cutting-edge method for the detection and identification of anomalies and failures in interdisciplinary activities is highlighted by Fanaee-T and João [21].Sebestyen and Hangan [22] discuss the difficulties and potential solutions associated with putting computerbased anomaly detection systems into practice through a number of case studies.Dogo et al. [23] conduct a thorough literature review to determine the current ML approaches being used to address the water quality anomaly detection (WQAD) issue, highlight the drawbacks and restrictions of these approaches, suggest a hybrid DL-ELM framework for WQAD that could be further investigated, and then suggest future research directions.Through the use of remote sensing, Sagan et al. [24] examine existing trends and improvements in water quality.They also identify and assess a variety of widely used estimating techniques across data sources and datasets, point out the shortcomings of the system as it stands as well as prospective improvements.Jiang et al. [25] provide a general systematic framework to analyze the dynamics of river water quality in depth by incorporating high temporal resolution observations with a combination of Fourier and wavelet spectrum analysis.Ahmed et al. [26] examine the issue from a number of angles, including the examination of cuttingedge technologies like the Internet of Things (IoT) and machine learning approaches to address water quality as well as the traditional methods of measuring water quality to obtain insight into the issue.After examining the present options, the authors suggest a low-cost, IoTbased system that uses machine learning techniques to track trends in water quality and identify unusual events.Gupta et al. [27] offer a summary of data analytics platforms appropriate for diverse Environmental Science and Engineering (ESE) research applications.
Utilizing three example case cases, we demonstrate recent ML algorithm implementations in the ESE sector.One of these case studies is the detection of anomalies in continuous data generated by engineered water systems.Shi et al. [28] list the management of drinking water quality applications of online UV-Vis spectrophotometers over the previous two decades.Table 2 shows the comparison with other review and survey papers.
The main objectives of the review address following ones.
• Must do a thorough literature analysis to determine the most recent methods for estimating water quality and spotting anomalies.
• To draw attention to the flaws and restrictions of these existing techniques.
• To give remaining challenges and recommend future research directions.

Paper organization
The rest of this paper is organized as follows: Section 2 focuses on water management systems architectures with their components and cyber security in water quality.Section 3 describes the methodology for the systematic review process according to PRISMA 2020.Section 4 deals with the materials and methods for water quality and anomaly detection.Section 5 presents the remaining challenges and prospects.In the last section, we conclude the paper.Fanaee-T and João [21] 2016 None Anomaly Detection Sebestyen and Hangan [22] 2017 All Anomaly Detection Dogo et al. [23] 2019 Smart Water Grids Water Quality and Anomaly Detection Sagan et al. [24] 2020 None Water Quality Jiang et al. [25] 2020 None Water Quality Ahmed et al. [26] 2020 IoT-based low-cost system Water Quality Gupta et al. [27] 2021 Engineered Water Systems Anomaly Detection Shi et al. [28] 2022 None Water Quality Ours 2023 Water Management Systems Water Quality and Anomaly Detection

Water management systems architecture and components
Water management facilities under the control of local governments managements should be organized according to the design principles and norms determined by the World Health Organization worldwide.For this reason, drinking water treatment plants are made according to certain standards for the purification of water supplied from the surface and underground water resources and are generally monitored by SCADA systems.All equipment is designed to be controlled from a single center.An automation program for the facility is prepared.All processes are in the computer environment and under its control, and all data about the facility are continuously recorded through this program.Regional control panels are installed in order to intervene in the electro-mechanical parts of the units in the facility.In case of need, it is possible to intervene in that unit from its own panel by means of operators.
The automation program contains all the necessary information (flow rates, levels, pressures, temperatures, dissolved oxygen concentration, pH values, and other concentrations) for an effective operation.Major automatic quality control and measurement equipment (flow meters, pH meter, turbidity meter, residual chlorine analyzer) are checked every day, calibrated if necessary, and renewed and the instructions for use are followed.In drinking water treatment plants, all units and main equipment are connected to the automation system.There is an alarm system in order to take necessary measures in case of malfunction and to deliver news to the control room.
Thanks to smart technology, traditional water management systems can have the instrumental ability to measure and record data, and the ability to stay in touch with system administrators interconnected and quickly analyze the current situation and respond and solve problems intelligently.(Fig. 1).Smart water management can generally be defined as intelligent, efficient and sustainably sourced water.Smart water systems are designed to collect all kinds of water-related data such as flow, leakage, pressure changes, transmission, current chemical parameters and levels, using technologies such as sensors, wireless communication devices and control units.In this way, it helps the efficient use of the resource by analyzing the collected data.In general, there are four components of smart water management technology as shown in Table 3.These are digital output devices such as meters and sensors, SCADA systems, geographical information systems (GIS), and related software.These components are used for various purposes.For example, with digital output devices, water quality can be monitored instantly, leakage and pressure can be detected in real-time, asset management can be provided, and consumption can be measured with smart water meters.With SCADA systems, operations such as optimization of pumping stations, control of treatment and drinking water facilities, environmental controls can be performed and processes can be controlled remotely by processing and optimizing the information obtained.With the Geographic information system (GIS), information about the environment can be collected, managed or analyzed.In this way, asset management of a water management system, management and analysis of environmental data can be performed, and integrated network models can be obtained.Related software is used to store, report or use data collected by other components.Thanks to this software, for example, water networks can be managed, and possible attack situations can be detected by working in integration with GIS and SCADA systems.Thus, decision-making and risk management can be facilitated for modeling the infrastructure and environmental systems of water management systems.The fact that water management systems become smart, that is, integrating with information and communication technologies, increases efficiency and performance.But it also leaves infrastructure vulnerable to cyber threats.Because thousands of sensors added to the system will be controlled over a network and if an attacker gains access to the network, the security obligations of confidentiality, integrity, and availability will be violated.Therefore, it can be said that recent studies have not been able to obtain a full solution due to the problem of testing on real systems, limited computing resources, existing architectures not responding to change, re-usability problems, and limitations in communication.In summary, it is necessary to increase safety studies on smart water management systems and to obtain new and sustainable solutions.

Water quality process
Water quality is defined as an indicator of the physical, biological, and chemical properties of water.Changes in water chemistry can occur due to natural disasters such as earthquakes, terrorist attacks, or man-made pollution.Today, water companies use pollution warning systems to control drinking water quality.With these systems, they regularly monitor the relevant water quality and environmental data at various measurement points, using different sensors.
In other words, it has been observed that many of the water quality parameters are measured and followed in institutions with water management systems in real life, but they are not analyzed together at the point of determining the water quality, and observations are made by looking at a few determined parameters one by one.Therefore, the established systems also need a monitoring system that accurately reports water quality changes by analyzing all measurement parameters based on measured values together.In other words, an adequate and accurate alarm system that enables early detection of any changes is a basic requirement for the provision of clean and safe drinking water.
A distribution system's water quality monitoring process is a delicate and extremely complex process that is influenced by a variety of factors.Because it is difficult to anticipate the water quality at a particular stage in the system's life due to the varying water quality data arriving from various sources and treatment facilities as well as the diversity of water pathways in the system.Additionally, there is a lot of data pollution because the data produced differ

Geographic Information Systems
To store, manage, process and analyze spatial information.
• Asset mapping and asset management • Fully integrated network models • Environmental data analysis and management

Software
To store, use and report data.
To model infrastructure and environmental systems to improve design, decision making and risk management.
• Often integrated with GIS and/or SCADA systems to manage water networks, control pressure, monitor leakage.
• Improved decision making and risk management • Customer databases • Intelligent metering, billing and collections • Hydraulic design and optimization • Water resources and hydrological modeling for water security • Cloud-based data management and hosting options from one another.Therefore, in order to ensure comparability in the produced data, there must be a certain standardization.When the production methods of water-related data, duplicate data production, and data sharing problems are experienced, institutional capacities in data collection, storage, and analysis at the local level are insufficient and the data cannot be recorded sufficiently.This situation necessitates effective log/data management in water management systems.In addition, the inaccessibility of data in digital environments and such issues are the main problems in the production and use of water-related data in the world.
When determining the quality of water, it is necessary to know where and for what purpose the water is used (such as drinking water, industry, agriculture, and energy sector) and where the water comes from (rivers, lakes, coastal-transitional waters, and underground waters) play a role in determining water quality standards.For example, while determining the quality of the water to be used for agriculture, parameters such as salinity of the water and ion toxicity are involved, while determining the quality of drinking water, parameters such as the PH ratio of the water, the amount of chlorine, and the dissolved oxygen should be considered.At the point of water quality management, in general, risk assessments are carried out with the possible effects of pollutants in water resources on human health and aquatic ecosystem, the analysis, and rating of this risk, and the measures to be taken in order to prevent negative effects.In order for water resources to reach a good level of quality, general water quality standards are determined in the world.Environmental Protection Agency (EPA) has launched a significant push to create powerful, thorough, and completely integrated surveillance and monitoring systems, including global water quality data, that allow for the early identification and awareness of diseases, pests, and dangerous substances [29].In this direction, environmental quality standards have been determined in the reference of the World Health Organization (WHO) for EU priority substances for water quality in water management systems and for country-specific pollutants [1,2,30,31].The Guidelines for drinkingwater quality (GDWQ), the first version of which was published in 1958, is the international reference point used to establish national and regional regulations on water quality and includes an assessment of the health risks posed by various microbial, chemical, radiological, and physical contaminants that may be present in drinking water.In the literature, drinking water quality is generally determined by the analysis of various parameters.In this direction, physicochemical parameters, which are of vital importance for institutions and which are decided on water quality by direct measurement, and which are widely used in the literature to determine water quality, are given in Table 4.
Kang et al. [32] examined big data analytics studies applied in the field of water quality.These studies were classified and compared according to big data prediction models.These comparisons were made using models such as artificial neural networks, Radial-based Function Network (RBFN), Deep Belief Network, Decision Trees, Improved Decision Trees, and Least Squares Support Vector Machine.In addition, the parameters affecting the water quality according to the standards in the related study have been diversified under different subheadings.Lu et al. [5] collected data from the Tualatin River, one of the world's most polluted rivers, to estimate water quality and estimated indicators such as water temperature, dissolved oxygen, pH value, specific conductivity, turbidity, and fluorescence dissolved in organic matter (FDOM).XGBoost and Random Forest models with data noise have been proposed for forecasting systems.The proposed models are then compared with classical models (PSO-SVM, RBFNN, LSSVM, LSTM) under different metrics.According to the proposed RF model, it performed best in estimating temperature, dissolved oxygen and specific conductivity.In their study, Chawla et al. [33] used regression and machine learning models such as linear regression, random forest, support vector machine (SVM) and long short-term memory (LSTM) to predict the Salton Sea salinity level and future trend.Parameters such as temperature, conductivity, specific conductivity, dissolved oxygen and salinity were studied.Selim et al. [34] present a study on water quality analysis using the Internet of things and big data analytics.An IoT-based model is proposed considering the parameters affecting the quality of water such as Oxidation Reduction Potential (ORP), dissolved oxygen (DO), PH, Electrical Conductivity (EC) and turbidity.In the study, the points that need to be considered in making the data read through these devices meaningful are mentioned.In the study of Nemade and Shah [35], firstly, data cleaning was performed by removing missing values and outliers on the dataset collected using IoT sensors.Then, the G-SMOTE technique, which hybridizes SMOTE and genetic algorithm, is proposed to solve the unbalanced data set problem.In the proposed system, the usage area of water is determined by using the modified deep learning neural network (MDLNN) classifier.In the study of Jin et al. [36], surface water quality estimation is made to provide real-time early warnings based on past observation data.A genetic algorithm (IGA) and a back propagation neural network (BPNN) are integrated into the data-driven model.Genetic algorithm was used to optimize suitable initial weight parameters.BPNN was applied to adjust suitable connection architectures and determine the characteristics of water quality variation.

Methodology for Systematic Review
This systematic review's goal is to summarize the current state of knowledge and identify areas for future study that should be prioritized.To ensure a comparable and thorough outcome, the Preferred Reporting Items for Systematic Reviews (PRISMA) 2020 technique [37] (see Fig. 2) has been specifically created to offer detailed reporting guidelines for such assessments.This process typically has four steps: (i) Identification, (ii) Screening, (iii) Eligibility, and (iv) Inclusion.

Identification of sources and search terms
Scopus, Web of Science, and DBLP were the main online databases used in the search strategy to find publications.These are the most popular libraries in the field of water quality estimation and anomaly detection in water management systems for publishing conference proceedings and journal papers.We used Google Scholar to find relevant publications that appeared in other databases in addition to returning articles that were covered in these databases.The databases were searched using suitable keywords and keyword combinations such as ["water quality" && "anomaly detection"].The search was restricted to the years 2012 through 2022, which narrows the scope of our meta-analysis to more recent publications.The search string for each database is displayed in Table 5. Advanced searching was employed to weed out irrelevant papers when a basic database search produced a large number of results.

Screening
The papers from the Identification stage that received the highest ratings were manually annotated.The degree to which water quality and anomaly detection were discussed/explained in each publication was a critical and central qualifying question for the screening procedure.The publication's relation to the subject of the ones described above was another criterion for this screening phase.Using a second keyword annotation procedure, we distributed the final collection by classifying each manuscript according to a more specific set of categories based on its title, abstract, and keywords.The whole text required to be reviewed at this point only if the categorization of publications based on these three elements was not feasible.

Eligibility and Inclusion
This section outlines the procedures we used to select the final group of papers for this review.The following selection criteria were used to find publications for a systematic review: • Must address anomaly detection and water quality in water management systems.
• Must have technical content.
• Must have undergone peer review and been published in a workshop, conference, or international journal.

Results
3,219 publications were found after searching an online database.To facilitate further investigation, their information was exported as a CSV file.After removing any duplicates, the remaining peer-reviewed articles that have appeared in internationally renowned conferences, seminars, or publications were picked for more in-depth analysis.The eligible list of publications for analysis was selected by reading the title and abstract and skimming the text in accordance with inclusion and exclusion criteria.As a consequence, a selection of 78 papers that would be examined in order to address the study themes was finalized.

Specific conductance
The number of dissolved salts in water are estimated.
It is generally stable in water from the same source, but there can be significant changes in conductivity as waters mix.A software program called VOSviewer [38] is used to visualize and explore maps made from network data obtained from these papers.Country-based and abstract networks are shown in Figs.3a and 3b, respectively.Items are depicted with a circle and their label.The weight of an object determines the size of the circle and label for that item.The label and circle of an object grow in size in proportion to its weight.

Global description of the datasets
It is not safe to test or implement attacks on cyberphysical systems and the intrusion detection and intrusion prevention systems that can be created against them on real physical systems.Researchers often use platforms that simulate real systems or real cyberphysical test environments.Cyber-physical environments called testbeds have been established in about 30 countries for various needs such as vulnerability analysis, training, development, and testing of defense mechanisms.iTrust [39] for cyber security research at Singapore University of Technology and Design, Sakarya University Critical Infrastructure National Test Bed center (CENTER) [40] in Turkey, The Mississippi State University (MSU) SCADA Security Lab [41], Technical Assessment Research Lab, China [42], SCADA testbed recently built at the University of New Orleans, USA [43] are the most popular centers that provide opportunities for studies by offering cyberphysical environments created for critical infrastructures.Apart from these, cyber security studies are also carried out on simulation platforms or small-scale cyber-physical test environments (eg EpanetCDA [44], Facies [45], WaterBox [46]) on water management systems.Among the cyber-physical test environments, the most respected and popular are the Secure Water Treatment (SWaT) and The Water Distribution (WADI™) test environments located at the ITrust center.Therefore, among the accessible public data sets in the literature on water management systems, these are the data sets in which all kinds of scenarios are tried and the most realistic data is obtained.In the SWAT architecture, which is designed based on the 6-stage water treatment process, is aimed to test a small series of cyber-attacks in the test area and develop a defense mechanism against them, using carefully designed experiments that ensure no damage to the physical system.In this context, the   In addition to the WADI dataset, BATADAL, which is the result of a competition to objectively compare the performance of algorithms for detecting cyber attacks on water distribution systems, includes one year of normal data without attacks, 6 months of tagged attack data [47].

Performance metrics
Performance metrics are an important part of machine learning that gives someone an insight into whether progress has been made as a result of the analysis.There are several criteria we can use to evaluate the performance of ML algorithms, classification, and regression algorithms.How the performance of machine learning algorithms is measured and compared and how the importance of various features in the result is evaluated depends entirely on the metric chosen.Therefore, metrics must be chosen carefully to evaluate machine learning performance [48].Performance metrics used for classification problems are Confusion Matrix, Accuracy, Precision, Recall, F1-Score The Confusion Matrix is the easiest way to measure the performance of a classification problem where the output can be classes of two or more types.That is, a confusion matrix consists of a two-dimensional table 4.There are "Actual" and "Predicted" and also "True Positives (TP)", "True Negatives (TN)", "False Positives (FP)", "False Negatives (FN)" in both dimensions as shown below.Accuracy is the most common performance metric for classification algorithms.It can be defined as the number of correct predictions made as the ratio of all predictions made.The formula is as follows.

Accuracy = T P + T N T P + T N + FP + FN
(1) Precision can be defined as the number of correct results returned by our machine learning model.The formula is as follows.

P recision = T P T P + FP
(2) Recall can be defined as the number of positives returned by our machine learning model.

Recall = T P T P + FN
(3) F1_Score gives the harmonic average of precision and recall.Mathematically, the F1 score is the weighted average of precision and recall.The best value of F1 is 1, the worst is 0. The formula is as follows.Performance metrics that can be used to evaluate predictions for regression problems are Mean Absolute Error (MAE), Mean Square Error (MSE), and R Squared (R 2 ).
Mean Absolute Error (MAE) is the simplest error metric used in regression problems.It is basically the sum of the mean of the absolute difference between the predicted and actual values.The formula is as follows.
Here, x and y are D dimensional vectors, and x i denotes the value on the ith dimension of x.
Mean Square Error (MSE) is like MAE except that instead of using the absolute value, it squares the difference of the actual and predicted output values before adding them all.The formula is as follows.
The R_Square metric is often used for explanatory purposes and provides an indication of the fitness or

Traditional methods
The classical (ML) techniques that have recently drawn the most interest in water quality management and anomaly detection include logistic regression (LR), support vector machines (SVM) and artificial neural networks (ANN).Statistical techniques are also trusted for the anomaly identification of water quality data in addition to these conventional ML methods.In this field of research, multivariate methods like linear discriminant analysis and principal component analysis have also been used.
The bulk of these classic ML approaches have limitations due to their large computational memory and time needs, imbalanced anomalous-to-normal data ratios, and sensor signal processing noise.Because of this, they have low levels of accuracy, a high rate of false alarms, poor missing data handling, and a lack of robustness when managing sizable real-time datasets from numerous and diverse sensory sources in high dimensional data search spaces.As a result, it becomes vital to research additional cutting-edge anomaly detection approaches in order to enhance performance and fix these ML systems' flaws [49,50].
The fundamental idea behind learning from data is to use a collection of observations to identify an underlying process.Finding a function that, using the data at hand, maximizes a particular score is one approach to see this more formally.This function can be thought of as a rough approximation of the actual, unidentified function that specifies the data generation process.We are in a supervised learning environment when the training data (the available data) provides explicit examples of what the desired output should be.In this context, classification refers to the process of giving a label to an observation or piece of data in order to place it into one of several classes or categories.An example of this method is a classifier, which may be trained using a set of previously labeled observations to establish the proper parameters.The anticipated label for an observation is the result of applying a classifier to that observation.Finding a function from a hypothesis set, which includes all feasible functions depending on the chosen model, is equivalent to the training process of a classifier [51].
The techniques used for clustering are designed to create extremely distinct clusters that are internally cohesive.When we lack the precise labels matching to each observation, clustering is the most popular type of unsupervised learning.In this instance, the data's features will correspond with the labels.[52].

Deep learning
Starting in 2012, deep learning, a new area of machine learning, lead to breakthrough advances.The abovementioned water quality and anomaly detection are particularly well-suited to deep learning because it has specialized network types for sequential data that capture temporal structures, they mainly computerize feature engineering and selection by prioritizing and learning hierarchies of progressively abstract representations of the inputs, making them particularly well-suited to high-dimensional data, and they can learn arbitrarily complex non-linear mappings [23,[53][54][55].
The primary deep learning (DL) architecture models are deep belief networks (DBN), deep Boltzmann machines (DBM), stacked denoising autoencoders (SDAE), convolutional neural networks (CNN), and recurrent neural networks (RNN).These models have been applied to the analysis of water quality and anomaly detection [56].

Extreme learning machine
The Extreme learning machine (ELM) algorithm was devised in response to the learning rate of feedforward neural networks, which is typically thought to be significantly slower than predicted due to slower iterations and parameter tuning of the networks.The three-layer feedforward design of the traditional ELM.The first layer is the input layer, while the second is the sole layer that is hidden.The input layer is then projected to a higher dimensionality by the hidden layer using connection weights that are randomly generated, set, and fixed across the network.The hidden layer's outputs are generated using non-linear sigmoid activation functions.With features for linear inputoutput, the third layer is used as the output.In order to train the connection weights between the hidden and output layers, a regularized least squares technique, such as the Moore-Penrose pseudo-inverse, is used to calculate the hidden layer values and the desired output [57].
In contrast to backpropagation (BP) based neural networks, ELM does not use iterations or parameter adjustment.The ELM algorithm's key advantages include quick training and strong generalization, which is the capacity to perform well on novel inputs that haven't been seen before other than those used to train the model.As a result, ML research uses the ELM algorithm extensively.to ELM as a solution for anomaly detection issues in different areas because of its quick training times and strong generalization abilities [58].

Reinforcement learning
Nevertheless, apart from these two types, we must distinguish another one that is very different from those two: Reinforcement Learning (RL).In RL, instead of having an initial training dataset from which to learn, the learning system called an agent interacts in an environment and it is responsible to select and perform actions, getting rewards or penalties in return.The agent must learn by itself the best strategy (the socalled policy) to get the most reward over time.A policy determines (either in a probabilistic or deterministic way) what action the agent should take when it is in a given situation [59,60].Table 6 shows different approaches for water quality estimation and anomaly detection with the used models with their parameters.

Future sensors
Future water management system sensors will be more accurate, effective, and economical.These sensors might detect, monitor, and analyze water quality using sophisticated machine learning algorithms, which would give more precise, current information regarding water supplies.New sensors might also be employed to monitor the effects of climate change on water supplies in real-time, allowing water managers to take preventative action to safeguard their water resources from pollution and other environmental concerns [76,77].

Reproducibility
Reproducibility in water management systems is the ability to replicate the same results with the same set of data.This allows for a greater degree of confidence when making decisions based on the data available.Reproducibility also enables researchers and scientists to verify the accuracy of the results they are seeing and make sure they are reliable.By reproducing results, water management teams can ensure that their decisions are based on accurate, reliable data [78,79].

Explainable artificial intelligence
Explainability in water management systems refers to the ability to explain the decisions and predictions that have been made by an AI-based system.Explainability can provide insight into why and how an AI-based system has arrived at a certain decision, enabling users to evaluate the accuracy and reliability of the system.This can be used by water management teams to better identify areas of concern and inform decisions about how best to allocate resources and solve issues related to water, quality, safety, and sustainability.Recent initiatives to increase black-box models' explainability lie under the purview of XAI research.They include the study's analysis tools Deep LIFT [80], RISE [81], SHAP [82], and LIME [83].

Contamination diffusion models
Contamination diffusion models are used in water management systems to simulate the transport and spread of water contaminants such as pollutants, chemicals, and viruses.These models are used to predict how contaminants will move over time, allowing water managers to identify areas of potential contamination and develop strategies to prevent or reduce their impact.The choice of numerical model tools for water pollution diffusion in the model base must be established, reliable, all-encompassing, and flexible enough to accommodate various scenarios [84].

Class imbalance problem
A dataset with the imbalanced distribution of classes, where one or more classes contain more instances than the others, is referred to as having a class imbalance.For instance, in a binary-class situation, the class having the majority of instances is referred to as the majority class, and the class with the minority of instances is referred to as the minority class.Realworld anomalies in water quality are uncommon but interesting occurrences, but forecasting them from an unbalanced learning standpoint using conventional machine learning algorithms is extremely difficult [53].To solve the class imbalance problem, it is possible to utilize combinations of heterogeneous and homogeneous algorithms, such as bagging, boosting, stacking, and their variants embedded with resampling strategies, as well as optimized DNN models.Data level, algorithm level, and cost-sensitive level methods can also be utilized.Robust models of increasing class imbalance and stable models under extreme class imbalance ratios are still gaps in the literature [85].

Optimal sensor placement problem
The sensor placement problem is an optimization problem that attempts to find the optimal locations for sensors in a given area in order to maximize their effectiveness.This could involve finding the most effective placements for traffic cameras, temperature sensors, or any other kind of sensor.The goal is to strategically place the sensors so they can provide the most accurate readings and insights while minimizing costs [86].The number of nodes in a WDN is frequently substantially more than the number of accessible

Anomaly event localization
Not only are anomalies and behavioral changes of sensor data in water distribution networks to be detected but also the correct position and source of faults that result in anomalous behaviors at the water distribution networks are to be found (fault localization/anomaly event localization).The hydraulic model must be built with nodal demands that are sufficiently accurate to reflect actual water consumption, accurate elevations at locations (nodes) where pressure data are recorded, and accurate boundary conditions, such as service reservoirs, tanks, and pumps, in order to produce good results for an anomaly hotspot localization process [72].Not only pressure-based anomalies but also other types of anomalies should be investigated.Identification of the source of contamination can be another hotspot that needs to be localized.This addresses the requirement to respond as soon as the contamination is discovered and to implement the necessary defenses to isolate the system component that has been affected.

Anomaly correction
Anomaly correction is a process of detecting, diagnosing, and correcting anomalies in data.It helps identify any unusual patterns or behavior in datasets that may indicate an error or irregularity.Anomaly correction can be used to improve the accuracy and reliability of data-driven decision-making [87].The value of the data directly affects the relevance of the detection and correction methods.Sensor data, often known as the information produced by sensors, can be either numerical or categorical.The former behave like numbers that can execute mathematical operations and are continuous, scalable, and have a zero.The latter, however, lack all mathematical operations and are discrete.Since categorical data are displayed as a string of symbols, any anomaly may be caused by an unknown symbol or symbol sequence.It should be noted that as processing power improves, the appeal of sensors with categorical output is rising.The inability to perform statistical analysis due to the nature of the problem makes anomaly detection and correction much more difficult.

Visualization and GUI design
Data visualization is the use of visual components to effectively communicate the relevance of large datasets and to find undiscovered data trends.Charts, graphs, maps, tables, and other visual representations of data are all examples of data visualization.Interactive data visualization, on the other hand, allows users to directly alter plot elements and create connections between several plots.Decision makers can more easily and swiftly understand analytical data with the help of data visualization, especially those without a background in computer science or statistical analysis.In most cases, the Graphical user interface (GUI) is provided by the user interface layer of water management systems, from which users can export and view data, produce summary statistics, and edit data quality [88].To visualize water management-related data, some issues should be regarded: (i) The data organization and analysis process must be done initially.(ii) When working with massive datasets, it might be intimidating to try to spot trends by simply looking at the raw data.And when working with data, it is crucial to present the data in an objective manner.(iii) The third phase involves monitoring data and analyzing trends.(iv) It's crucial to identify the audience before starting to produce infographics, social media posts, or academic outputs using the findings.(v) Sciencecongruent narratives that are values-driven can help us communicate with the right audiences.(vi) By its most basic definition, graphic design is the art of producing visual content, principally conveying messages through the use of visual hierarchy and page layout strategies.It's ideal to adhere to fundamental design visual guidelines and principles when creating graphics.(vii) The results should be announced to the public.In science, consistency and replication are crucial [89].

Parallel and distributed computing
Only IoT data is expected to have 50 billion connected sensors worldwide by 2025, whereas the size of data is expanding quickly at a rate of millions per second.In order to extract knowledge or make an accurate prediction, integrating, analyzing, and mining enormous amounts of data requires an effective and efficient framework and an algorithm [90].Due to the continuous evolution of data streams, predicting anomaly detection and monitoring water quality at high speed are crucial and challenging challenges [91].The majority of current and traditional anomaly detection techniques rely significantly on stationary data, and it can take centralized algorithm hours or even days to compute and identify accurate results.Thus, parallel and distributed computing is critical in reducing execution time, which can fit the need for realtime or near-real-time detection and monitoring [92].

Water quality in social multimedia
Social media platforms have emerged as a reliable means of communication and information transmission during the past ten years.They are a favored forum to discuss and express concerns over various domestic and international difficulties because of their capacity to reach sizable audiences globally.Security, disaster response, disease outbreaks [93], and consumer happiness are all monitored on social media by law enforcement, emergency management agencies, the public health community, and businesses [94,95].Although social media monitoring is still relatively new to the water industry, it might be utilized for comparable objectives given that consumer complaints are a good source for spotting distribution system issues early on.
The Water Research Foundation's 2017 project1 , Social Media for Water Utilities, showed how the water industry lagged behind other sectors in embracing social media, such as the electric industry.According to the survey, just a small portion of the 60 drinking water and wastewater utilities in the United States with social media profiles were actually using it, and of those who did, only a small portion was able to successfully reach their customer base.
Nowadays, various studies are carried out to monitor water quality through social media [96].One of them is "Water Quality in Social Multimedia [95]".The analysis of social media tweets on water quality, security, and safety is the focus of the WaterMM Task.In order to download the text, the accompanying image, and the metadata of tweets that were chosen using a keywordbased search that included words or phrases about the quality of drinking water, participants in this task are given a set of Twitter post IDs (e.g., strange color, odor or taste, related illnesses, etc.).Participants can tackle the task using text features, image features, metadata, or a combination of the above.You can review some papers using WaterMM benchmark dataset [97][98][99].Using other social media platforms, collecting multimodal and cross-data will be the main focus of future works.

Conclusion
In this study, water management systems and water quality issues from critical infrastructures, cyber security studies on water quality, past cyber attacks on water quality processes, and security requirements are systematically examined.The studies reviewed in this paper are promising, but more work is required for implementation and validation on real water systems.Monitoring water quality in water systems is a highly complicated and critical process influenced by many factors.Therefore, methods that require advanced technology should be defined and applied.It has become compulsory to develop an efficient detection method for improving the emergency response capacity in the event of a possible attack, protect against potential hazards caused by intentional/unintentional contamination, classify water quality changes and anomalies, and ensure early warning in case of potential hazards.It's not easy to create a static set of rules or restrictions that catch major attacks clearly and quickly.Therefore, the use of learning-based anomaly detection techniques is essential for water quality detection in water management systems.In this way, the anomaly detection system will provide a defense mechanism to water management systems, while simultaneously maintaining, repairing, and developing similar critical infrastructures.With the work to be done by developing models based on artificial intelligence and machine learning techniques, predicting more stealthy attacks and implementing a defense mechanism can be possible.In fact, these intrusion detection systems need to be evaluated against real-time water management systems.

5 EAI
Endorsed Transactions on Internet of Things | Volume 9 | Issue 4 |

6 EAI
Endorsed Transactions on Internet of Things | Volume 9 | Issue 4 |

8 EAITable 4 .
Water quality physico-chemical parameters and explanations Parameter Explanation Ref. Range (WHO) References Dissolved Oxygen (mg/L) The amount of dissolved oxygen is a very important parameters in water pollution and wastewater treatment.Low dissolved oxygen is critical to aquatic life.Low level is one of the most important indicators of water pollution.It depends on the temperature of the water, the partial pressure of oxygen in the atmosphere, the organisms that provide oxygen to the water, and the mineral concentration in the water.It measures how clean the water is.It is undesirable to be in the water.L) Free chlorine is added to disinfect water.Free chlorine levels decrease over time, so the CL2 levels of the water in the tanks and the water from the treatment are different.Inactivates pathogenic bacteria.constant, but changes with the seasons.Temperatures of different water sources are different.The temperatures of surface waters are naturally determined by the climate.7-12 °C (ideal) data 3 The concentration of organic matter is measured in water.It can decrease over time because of decomposition of organic substances in the water.

9 EAI
Endorsed Transactions on Internet of Things | Volume 9 | Issue 4 | D. Balta et al.
(a) Country-based network (b) Abstract network

Figure 3 .
Figure 3. VOSviewer-based visualization of the papers

11 EAI
Endorsed Transactions on Internet of Things | Volume 9 | Issue 4 | goodness of a set of predicted output values to actual output values.The formula is as follows.
Numerous investigations have been carried out by different scholars with the goal of enhancing the theoretical and practical performance of the original ELM.Researchers are paying close attention 12 EAI Endorsed Transactions on Internet of Things | Volume 9 | Issue 4 |

13 EAI
Endorsed Transactions on Internet of Things | Volume 9 | Issue 4 |

15 EAI
Endorsed Transactions on Internet of Things EAI Endorsed Transactions on Internet of Things | Volume 9 | Issue 4 |

Table 2 .
Comparison with other review and survey papers

Table 3 .
Components of Smart Water Management Technologies

Table 5 .
Search string used for each data source

Table 6 .
Comparison of different approaches for water quality estimation and anomaly detection sensors.In order to deliver network-wide, globally relevant information, sensors must be positioned in this manner.Moving sensors are utilized in real applications more frequently rather than static ones.It should be considered that this raises the problem's level of complexity, though.14 EAI Endorsed Transactions on Internet of Things | Volume 9 | Issue 4 |