A Comparative Analysis of Various Deep-Learning Models for Noise Suppression

Excessive noise in speech communication systems is a major issue affecting various fields, including teleconferencing and hearing aid systems. To tackle this issue, various deep-learning models have been proposed, with autoencoder-based models showing remarkable results. In this paper, we present a comparative analysis of four different deep learning based autoencoder models, namely model ‘alpha’, model ‘beta’, model ‘gamma’, and model ‘delta’ for noise suppression in speech signals. The performance of each model was evaluated using objective metric, mean squared error (MSE). Our experimental results showed that the model ‘alpha’ outperformed the other models, achieving a minimum error of 0.0086 and maximum error of 0.0158. The model ‘gamma’ also performed well, with a minimum error of 0.0169 and maximum error of 0.0216. These findings suggest that the pro-posed models have great potential for enhancing speech communication systems in various fields.


Introduction
Exposure to high noise levels can pose a significant threat to human health and well-being.Noise pollution, which is an unwanted sound that has a detrimental effect on human health and the environment, is a growing concern in many cities worldwide.According to the World Health Organization, exposure to noise levels above 70 decibels (dB) for an extended period can cause hearing damage, while exposure to levels above 85 dB for an extended period can result in permanent hearing loss [1].Furthermore, prolonged exposure to high noise levels can lead to stress, cardiovascular disease, and sleep disturbances.In addition to the health risks, high noise levels can also have adverse effects on productivity and cognitive performance.Studies have shown that exposure to high levels of noise can impair memory, reduce concentration, and interfere with communication.This can lead to decreased productivity, impaired academic performance, and even accidents in the workplace.Reducing noise levels is, therefore, essential to protect human health and well-being, as well as to ensure optimal cognitive and academic performance.One way to achieve this is through noise cancellation technology [2].
Noise cancellation is a technique that involves reducing or eliminating unwanted sounds by generating sound waves that are the opposite of the unwanted sound.This can be achieved using various methods, including active noise control, passive noise control, and adaptive noise control.Machine learning, specifically deep learning techniques such as Convolutional Neural Networks (CNN), can be used to improve noise cancellation by learning the patterns of the input signals and generating a more accurate and effective cancellation signal (Pandey & Wang, 2019).CNN models are trained using large datasets of sound samples and can learn to recognize and cancel out specific types of noise.By using machine learning algorithms, noise cancellation systems can adapt to different environments and noise sources, making them more versatile and effective.Furthermore, machine learning can also improve the speed and efficiency of noise cancellation systems by allowing them to process signals in real-time, making them useful in applications such as noise-cancelling headphones, automotive noise reduction, and environmental noise control.
Convolutional Neural Networks (CNN) have emerged as a powerful tool for pattern recognition in various fields, including speech recognition and image processing.Their ability to reduce the number of parameters in Artificial Neural Networks (ANN) has been a significant factor in their success.As a result, researchers and programmers have been exploring the use of larger models to tackle more complex tasks that were previously deemed impossible using traditional ANNs.One key assumption of CNN is that the problems they handle should not have spatially dependent properties.Additionally, CNN's ability to learn abstract features as the input moves through deeper layers is another critical aspect [3].For example, in image classification, the first layer may recognize edges, the second layer may identify simpler shapes, and the third layer may recognize higherlevel characteristics, such as faces [4].Autoencoder models are a type of artificial neural network that can learn to compress data into a low-dimensional representation and then reconstruct it back into its original form [5]. Ghosh et al. (2023) embarked on a comprehensive study to assess water quality through predictive machine learning.Their research underscored the potential of machine learning models in effectively assessing and classifying water quality.The dataset used for this purpose included parameters like pH, dissolved oxygen, BOD, and TDS.Among the various models they employed, the Random Forest model emerged as the most accurate, achieving a commendable accuracy rate of 78.96%.In contrast, the SVM model lagged behind, registering the lowest accuracy of 68.29% [17].Alenezi et al. (2021) developed a novel Convolutional Neural Network (CNN) integrated with a block-greedy algorithm to enhance underwater image dehazing.The method addresses color channel attenuation and optimizes local and global pixel values.By employing a unique Markov random field, the approach refines image edges.Performance evaluations, using metrics like UCIQE and UIQM, demonstrated the superiority of this method over existing techniques, resulting in sharper, clearer, and more colorful underwater images [18].Sharma et al. ( 2020) presented a comprehensive study on the impact of COVID-19 on global financial indicators, emphasizing its swift and significant disruption.The research highlighted the massive economic downturn, with global markets losing over US $6 trillion in a week in February 2020.Their multivariate analysis provided insights into the influence of containment policies on various financial metrics.The study underscores the profound effects of the pandemic on economic activities and the potential of using advanced algorithms for detection and analysis [19].
This process is done by training the model to minimize the difference between the input data and the reconstructed data.Autoencoders can be used for various tasks, including data compression, feature extraction, and denoising.In the context of noise cancellation, autoencoder models are particularly useful.One way to use autoencoders for noise cancellation is to train them to learn a mapping between a noisy input signal and a clean output signal.During training, the model is fed noisy input data and is trained to generate a clean output that is as close as possible to the original clean signal.Once the model is trained, it can be used to denoise new input data by passing it through the autoencoder and obtaining the reconstructed clean output.The decoder is responsible for generating the reconstructed clean output, and its architecture is designed to extract useful features from the input signal.By training the decoder to reconstruct clean signals from noisy inputs, the model learns to extract features that are robust to noise.This allows the decoder to effectively denoise input signals by removing the noise while retaining the underlying information.Autoencoder models have been shown to be effective in various noise cancellation applications, including speech and image denoising.
This paper compares and showcases four distinct kinds of autoencoder model architectures.Section 2 talks about the research done in the field of noise cancellation and suppression.While Section 3 and Section 4 speak about the methodology and the result and comparison of the proposed models, Section V concludes the paper highlighting all the key inferences from the results.

Related Works
Table 1.compares the results from different models.Various models including CNN model, CRN, LSTM, RNN, R-CED, and DeepClean Architecture were implemented, and the performances were analyzed.A common pattern was observed among the research.All the proposed models were a little bit computationally extensive.

Reference
Model Results

[6]
A Custom CNN The training set accuracy was 95.82% and the corresponding loss was 0.13.The validation set accuracy was 73.59% and the loss was 1.02.[7] Custom CNN and RNN Different noise removal methods showed varying performance on different types of noise.The RNN method was found to be more effective in removing stationery and street noise compared to the CNN method, whereas the CNN method performed better in removing noise from music and voiced background.Compared to the AES+DNN method, the BLSTM approach showed superior performance, exhibiting a 5.4 dB increase in ERLE and a 0.5 improvement in PESQ.The subjective listening tests revealed that the proposed techniques were preferred by the listeners over 60% of the time.

Methodology
The first step of the study involves unzipping the dataset, which contains a diverse collection of noisy sounds.These sounds include speeches embedded with different types of background noises.Additionally, the dataset includes corresponding clean signals, which serve as references for noise-free audio.Upon unzipping, the dataset is loaded into memory for further processing.To facilitate data manipulation and model training, the loaded data set are converted into tensors, which are mathematical representations of the audio signals.This conversion enables efficient computation and manipulation of the data during subsequent stages.Next, the dataset is divided into training and testing subsets.This step ensures that the performance of the deep-learning models can be evaluated on unseen data, thereby providing a realistic assessment of their effectiveness in noise suppression tasks.Then comes emphasis on the creation and compilation of various deep-learning models.These models, specifically designed for noise suppression, employ different architectural configurations and parameters [15].The purpose of this diversity is to enable a comprehensive comparative analysis, shedding light on the strengths and weaknesses of each model.Once the models are created and compiled, they undergo evaluation.This involves assessing their performance on the testing dataset, utilizing appropriate evaluation metrics for noise suppression tasks.The results obtained during evaluation provide insights into the models' capabilities in reducing noise and preserving the clarity of the desired audio signals.Lastly, the bestperforming deep-learning model from the comparative analysis is implemented.This implementation stage aims to showcase the practical application of the selected model, potentially serving as a basis for further research and development in noise suppression techniques.Figure 1 shows the flowchart for the algorithm.

Creating and loading the dataset
The dataset utilized in this study comprises a diverse collection of audio recordings.These recordings consist of speeches embedded with various background noises, simulating real-world acoustic environments.Additionally, the dataset includes corresponding clean signals, which serve as references for the noise-free versions of the speeches.Careful curation of the dataset ensures a wide range of noise types, including environmental, mechanical, and humangenerated sounds.The collection covers different scenarios and contexts to provide a comprehensive representation of the challenges encountered in noise suppression tasks.The dataset's size and composition helps to train and evaluate autoencoder models effectively while ensuring the privacy and anonymity of the individuals involved in the recordings.
The goal is to train a deep learning model to remove noise from the noisy speech sounds and obtain the clean speech sounds.The data is first read from files using TensorFlow's audio decoding function.The clean sounds and noisy sounds are stored in separate lists.These lists are then concatenated to create two large tensors representing the clean and noisy sounds respectively.Next, the clean and noisy audio files are decoded using the decode_wav function from the TensorFlow library.The decoded clean and noisy audio files are concatenated into two separate numpy arrays, clean_sounds_list and noisy_sounds_list.To train the deep learning model, we need to divide the data into training and testing sets.We use 80% of the data for training and 20% for testing [16].The data is then split into smaller batches, with each batch having a fixed size of 12000 sound samples.This is done to make the training process more efficient and to avoid running out of memory during training.
Next, we created two TensorFlow datasets: one for training and another for testing.The dataset is created using the tensorflow function.This function takes the clean and noisy tensors as input and returns a dataset object that can be used for training the model.The dataset is then shuffled and batched, with a batch size of 64.We also drop any incomplete batches at the end of each epoch to ensure that all batches are the same size.Figure 2 and Figure 3 shows a sample clean waveform and a sample noisy waveform.The resulting dataset is then used to train a deep learning model to remove noise from the noisy speech sounds.The model architecture used in this code is a convolutional neural network (CNN) with an encoder-decoder architecture.The encoder part of the model consists of several 1D convolutional layers that extract relevant features from the input sound.The decoder part of the model consists of several 1D transposed convolutional layers that reconstruct the clean sound from the extracted features.The model is trained using the noisy speech sounds as input and the clean speech sounds as output.

Model Formation
This paper introduces 4 distinct autoencoder architectures for better understanding and comparison of training loss and validation losses.The models are named 'alpha', 'beta', 'gamma', and 'delta' for a clear distinction between them.The models are all created using the Keras library in python.We picked the sequential model from the Keras library since our model development was focused on a stack of layers with only one input and output for each layer.The architecture of model 'alpha' consists of a series of convolutional layers followed by a series of transpose convolutional layers.The convolutional layers downsample the input data to a lower dimensional feature representation, while the transpose convolutional layers upsample the features to reconstruct the original input data.The network starts with an input layer of size (16000, 1).The first convolutional layer (c1) has 2 filters with a kernel size of 32 and a stride of 2. The activation function used is ReLu.The following convolutional layers (c2, c3, c4, c5) have 4, 8, 16, and 32 filters respectively with the same kernel size and stride as c1 and the ReLu activation function.The first transpose convolutional layer (dc1) has 32 filters, kernel size of 32, and a stride of 1 with padding set to 'same'.The Concatenate layer merges the output of the dc1 layer with the output of c5.The resulting tensor is passed through the next transpose convolutional layer (dc2) which has 16 filters, kernel size of 32, and a stride of 2. Again, the output is concatenated with the output of the previous convolutional layer (c4) and passed through the next transpose convolutional layer (dc3) with 8 filters, kernel size of 32, and a stride of 2. This process is repeated until the last transpose convolutional layer (dc7), which has 1 filter and a kernel size of 32.Finally, the output of the last concatenation layer is passed through a linear activation function and produces the output of the autoencoder.Figure 4 displays the input shape and the output form following each layer.padding is set to 'same' to ensure that the output shape of the layers is the same as the input shape.The decoder part is defined using three 1D transposed convolutional layers.The transposed convolutional layers are used to upsample the encoded input back to its original shape.The layers have 8, 16, and 1 filter respectively, with the same padding and activation function as the encoder.Figure 7 displays the model delta architecture.

Future Scope
In considering future research directions for this topic, it is crucial to explore advanced techniques for noise suppression in deep-learning models without triggering detection algorithms.One potential area of investigation involves the development of novel architectures that incorporate attention mechanisms.Attention mechanisms enable the models to focus on relevant audio features while suppressing unwanted noise, leading to enhanced speech quality and improved noise reduction performance.Another promising avenue for future research lies in leveraging generative adversarial networks (GANs) for noise suppression.GANs have shown remarkable potential in generating realistic data, and their application to noise suppression could lead to more robust and accurate models.By training GANs on large-scale datasets containing diverse noise patterns, researchers can aim to achieve superior noise reduction performance and generalize well to unseen noisy environments.Additionally, investigating the effectiveness of transfer learning approaches in noise suppression is another intriguing research direction.By leveraging pre-trained models on related tasks, researchers can explore the transferability of learned features and weights to enhance noise suppression performance.This approach could potentially reduce the amount of labeled data required for training, making noise suppression models more accessible and adaptable to various real-world scenarios.Moreover, exploring the combination of deep-learning models with other signal processing techniques, such as adaptive filtering or spectral enhancement, holds promise for further improving noise suppression performance.Integrating these techniques into the existing deep-learning frameworks could result in more comprehensive and effective noise reduction algorithms.Furthermore, the investigation of real-time noise suppression systems for practical applications is an important direction for future research.Developing efficient and lowlatency deep-learning models capable of processing audio signals in real-time scenarios can have significant implications for industries such as telecommunications, voice assistants, and audio-conferencing systems.Also, it is crucial to consider the ethical implications and potential bias in noise suppression algorithms.Future research should focus on developing fair and unbiased models that do not inadvertently discriminate against certain speech characteristics or demographic groups.By pursuing these future research directions, the field of deep-learning-based noise suppression can continue to advance, leading to more accurate, robust, and practical solutions for enhancing speech communication systems in various domains.

Conclusion
In this paper, the primary aim was to compare and make the best noise suppression algorithm that can help in various modern-day devices.For that, four autoencoder models were built from scratch and were compared on the basis of the fixed set of input parameters.Model 'alpha' was an excellent performer with the lowest minimum loss of 0.0086.While model 'beta', 'gamma', and 'delta' obtained minimum error of 0.0177, 0.0169, and 0.0188 respectively.The proposed autoencoder model for noise suppression holds significant promise in effectively handling adaptive noises in a wide range of environments.By leveraging its inherent capacity to learn robust representations of audio signals, the autoencoder can adaptively capture and model the complex dynamics of different noise types.This adaptability enables the model to effectively suppress adaptive noises that exhibit time-varying characteristics or exhibit non-stationary patterns.The autoencoder's ability to learn latent representations from both noisy and clean signals allows it to identify and extract relevant features that are essential for differentiating between the desired speech and adaptive noises.Consequently, the model can dynamically adjust its noise suppression parameters and adapt its filtering mechanisms to accommodate the changing nature of the noises in real-time.This capability of the autoencoder model makes it a valuable tool in applications where adaptive noises are prevalent, such as communication systems, voice assistants, and audio recording devices.By effectively suppressing adaptive noises, the model enhances the intelligibility and quality of the desired speech signal, improving the overall user experience in various practical scenarios.
accuracy on INRNet and 0.955 on ROLD (both on 10% noise density) signal-to-noise ratio (SNR) of the injected signal is increased by 21.6% [13] A custom Convoluted Recurrent Network (CRN) The proposed SAES based on CRN performs better than the conventional Wiener and NLMS algorithms, particularly in low Signal-to-Error Ratio (SER) conditions and high Reverberation Time (RT60) conditions.

Figure 1 .
Figure 1.Flowchart for the Algorithm

4 .
Results and DiscussionsAll the models are tested and validated on a set of parameters for each noise file i.e., batches, steps per epoch, validation steps, and the number of epochs.6630 samples are used in training and 1658 samples are used for validating the model.EAI Endorsed Transactions on Internet of Things | Volume 10 | 2024 |

Table 2
depicts the results obtained from all the models comparing there minimum and maximum errors It is observed that model alpha has the lowest minimum and maximum loss and consistency as compared to other models with model gamma being the second best, followed by model beta and model delta.

Table 2 .
Model Error Comparison