Application of Deep Neural Network Algorithm in Speech Enhancement of Online English Learning Platform

INTRODUCTION: In the online English learning platform, noise interference makes people unable to hear the content of English teaching clearly, which leads to a great reduction in the efficiency of English learning. In order to improve the voice quality of online English learning platform, the speech enhancement method of the online English learning platform based on deep neural network is studied. OBJECTIVES: This paper proposes a deep neural network-based speech enhancement method for online English learning platform in order to obtain more desirable results in the application of speech quality optimization. METHODS: The optimized VMD (Variable Modal Decomposition) algorithm is combined with the Moth-flame optimization algorithm to find the optimal solution to obtain the optimal value of the decomposition mode number and the penalty factor of the variational modal decomposition algorithm, and then the optimized variational modal decomposition algorithm is used to filter the noise information in the speech signal; Through the network speech enhancement method based on deep neural network learning, the denoised speech signal is taken as the enhancement target to achieve speech enhancement. RESULTS: The research results show that the method not only has significant denoising ability for speech signal, but also after this method is used, PESQ value of speech quality perception evaluation of speech signal is greater than 4.0dB, the spectral features are prominent, and the speech quality is improved. CONCLUSION: Through experiments from three perspectives: speech signal denoising, speech quality enhancement and speech spectrum information, the usability of the method in this paper is confirmed.


Introduction
Speech [1], as the most frequently used communication method by people, is both efficient and convenient at the same time. However, speech signals are inevitably disturbed by various noises in the process of acquisition and transmission. When acquiring speech signals, and the acquired speech signal will be mixed with other sounds.
During transmission, the speech signal is affected by various circuit noises, making the quality of the speech signal at the receiving end degraded. These noises greatly affect the efficiency of information transmission and even lead to communication failure. At the same time, in the point-to-point English learning platform in a more complex environment, the application scenarios of speech teaching are more complex and bad. Moreover, in order to ensure the concealment, the signal is often weak, which is easily submerged in the noise of the environment and Haiyan Peng and Min Zhang 2 affects the learning effect. Therefore, the application of speech enhancement technology in English learning platforms is particularly important.

EAI Endorsed Transactions on Scalable Information Systems
NorezmiJamal et al. [2] proposed to separate the noise mixture signal from the noise background to predict the target mask. The noise in Malay speech is reduced by using the deep neural network method combined with the acoustic characteristics such as the power spectrum of the gamma pass filter bank to enhance the intelligibility of Malay speech. ShobaSivapatham et al. [3] proposed performance analysis methods for various training targets to improve speech quality and intelligibility. In order to improve the speech quality and intelligibility, the performance of binary and non binary training targets of the deep neural network is evaluated under different SNR and noise conditions.
Deep neural network method [4] is a hot application in the field of speech signal processing, therefore, this paper studies the problem of speech enhancement of online English learning platform, and proposes a deep neural network-based speech enhancement method for online English learning platform in order to obtain more desirable results in the application of speech quality optimization. The speech enhancement technology can suppress the interference of noise as much as possible, make the English speech clearer, and improve the user's learning experience.

Speech signal denoising based on optimized VMD algorithm
Variable fractional modal decomposition algorithm is referred to as VMD algorithm, and this paper optimizes the method, and the principle of optimizing VMD algorithm is as follows: firstly, this paper uses the Mothflame optimization algorithm to optimize the VMD algorithm, find the number of decomposed modalities H and penalty factor  in the optimized combination   00 , H  , and complete the decomposition of the speech signal [5]; Then, based on the principle of correlation coefficient filtering, the effective mode and the noise mode are selected for the decomposed modal components, and the noise mode is de-noising by wavelet threshold [6]; Finally, the de-noising modal components and effective modes are reconstructed to make the speech signal achieve the purpose of de-noising. The operation flow chart is shown in Figure 1.

Signal decomposition based on variational modal decomposition algorithm
The adaptive, non-recursive signal decomposition method-variational modal decomposition algorithm [7] can be used to decompose the signal into a finite number of intrinsic modal components and the sum of these intrinsic modal components. These intrinsic modal components are characterized by the definition of intrinsic modes. The process of decomposing the signal by VMD is actually the process of solving the variational problem, and solving the variational first requires constructing the variational, so constructing the variational problem and solving the variational problem are the core of the algorithm.
where ( ) Where e is the error term; using the Fourier transform, Eq.
where, d is the derivative parameter. (1) Initialize (2) Update h  and h  .

Screening of correlation coefficients
After the signal is decomposed by the variational modal decomposition algorithm, its correlation coefficient is filtered by the autocorrelation function. Haiyan Peng and Min Zhang 4 value, then the modal component is considered to be well correlated with the original speech signal and needs to be retained; otherwise, the corresponding modal component is then subjected to wavelet denoising [11].

Optimization of VMD algorithm based on Moth-flame optimization algorithm
After screening the correlation coefficients of modal components, the VMD algorithm is optimized by moths flaring fire optimization algorithm. Suppose the moth is a candidate solution for solving the decomposition modulus H and the penalty factor  , and the variable to be solved is the position of the moth in space. Thus, by changing its own position vector, the moth can fly in one, two, three, or even higher dimensions. Since the mothflame optimization (MFO) algorithm is essentially a population intelligence optimization algorithm, the population of moths can be represented in the matrix as follows.
where HX  , X   ; n is the number of moths; m represents the dimensionality of the modal number H and the penalty factor  to be decomposed. For these moths, it is also assumed that there exists a vector of adaptation values corresponding to them, denoted as n P .
The MFO algorithm requires each moth to update its own position using only the unique flame corresponding to it, thus avoiding the algorithm from falling into local extremes and greatly enhancing the global search capability of the algorithm. Therefore, the moth position and the flame position in the search space are the same dimensional matrix of variables. In order to mathematically model the flight behavior of moths to a flame, the position update mechanism of each moth relative to the flame can be represented by the equation where j X denotes the j th moth that describes only the candidate solution for solving the decomposition modulus H and penalty factor  ; i G denotes the i th flame; and R denotes the spiral function. The function satisfies the following conditions.
(1) The initial point of the spiral function should start from the moth.
(2) The end point of the spiral is the position of the flame.
where j E denotes the distance between the j th moth and the i th flame; c is the defined logarithmic spiral shape constant and the path coefficient k is a random Equation (11) simulates the path of the moth's spiral flight, and it can be seen that the next position of the moth's update is determined by the flame it surrounds. The spiral equation shows that the moth can fly around the flames, not just in the space between them, thus guaranteeing the global search capability of the algorithm with local exploitation. If the fitness value of the updated moth position is better than that of the contemporary corresponding flame, its updated position will be selected as the position of the next generation of flames, and thus the moth has local exploitation capability. The model has the following characteristics when used.
(1) By modifying the parameter k , a moth can converge to an arbitrary neighborhood of the flame.
(2) The smaller the k , the closer the moth is to the flame.
(3) As the moth gets closer to the flame, it renews itself around the flame more and more frequently.
The flame position update mechanism described above ensures the local exploitation capability of the moth around the flame. To increase the probability of finding a better solution with the decomposition modulus H and penalty factor  , the currently found optimal solution is used as the location of the next generation of flames. Thus, the flame location matrix usually contains the currently found optimal solution. During the optimization process, each moth updates its position according to the matrix. The path coefficients present k in the MFO algorithm are random numbers in a fixed interval, and by this treatment, the moths will converge more precisely to the flames in their corresponding sequence as the iterative process proceeds. The general procedure for solving the problem using the MFO algorithm proposed in this paper is as follows.
(1) Initialization of the MFO algorithm, setting parameters such as the dimensionality of the input optimal decomposition modulus H and penalty factor  , the moth population search size, the maximum number of iterations, and the logarithmic spiral shape constant.
(2) The variables to be solved are initialized, the moth positions are randomly generated in the search space, and the corresponding fitness value of each moth is evaluated.
(3) The spatial position of the moth is sorted in the order of increasing fitness value and assigned to the flame as the spatial position of the flame in the first generation.
(4) Use Equation (9) to update the position of the current generation of moths.
(5) Reorder the fitness values of the updated moth position and the flame position, and select the spatial position with the better fitness value to update as the position of the next generation flame (6) Reduce the number of flames using the adaptive mechanism of equation (12).
(7) Return to step (4) to enter the next generation until the number of iterations meets the algorithm requirements.
(8) Output and display the optimization results of decomposing the modal number H and penalty factor  , and the program ends.

Wavelet denoising
After obtaining the optimal value of the modal number and penalty factor, the optimization algorithm uses wavelet threshold to filter the noise information in the speech signal. The selection of suitable wavelet bases and the determination of the number of decomposition layers in wavelet denoising are the prerequisites to achieve denoising, and then the noise containing speech signal is decomposed to obtain wavelet coefficients of different scales. After comparing these wavelet coefficients, it is found that the noise-containing wavelet coefficients are smaller than those of the actual signal, so a suitable threshold can be selected to compare with the coefficients obtained from wavelet decomposition. When the wavelet coefficients are higher than the threshold, it can be determined that the wavelet coefficients are mainly generated from the actual speech signal and retained; otherwise, it can be assumed that the wavelet coefficients are generated from noise and filtered out. Finally, the wavelet coefficients are wavelet inverse transformed and then reconstructed to achieve wavelet denoising [12].
According to the wavelet denoising principle, it is known that the wavelet denoising effect depends largely on the appropriate threshold and threshold function. Hard thresholding and soft thresholding are two commonly used thresholding functions, and the reconstructed signal after hard thresholding has disadvantages such as discontinuity, oscillation and distortion phenomenon; soft thresholding function [13], although continuous, often appears deviation will lead to the reconstructed signal with high frequency part information loss, edge blurring and other problems. Due to these defects, the traditional threshold function needs to be improved to construct a new threshold function. In addition to the selection of the threshold function, the selection of the wavelet denoising threshold is also very important. In the process of threshold selection, if the threshold is too small, the noise in the signal will not be filtered out; if the selected threshold is too large, the useful components may be filtered out, resulting in deviations in the data. The threshold function selected in this paper is.

Network speech enhancement method based on deep neural network learning
For the speech signal of the network English learning platform after denoising in Section 2.1, the network speech enhancement method based on deep neural network learning is used to achieve speech signal enhancement. This method mainly uses deep neural network [16]. This network initializes the network model by training restricted Boltzmann machine (RBM), and gradually optimizes a deep neural network (DNN) EAI Endorsed Transactions on Scalable Information Systems 01 2023 -01 2023 | Volume 10 | Issue 2 | e10 6 through random gradient descent algorithm. This model can solve the problem that the network model falls into local optimization to a certain extent. The flow diagram of the whole algorithm is shown in Figure 2.

Pre-training and fine tuning
(1) Pre-training Pre-training uses the denoised network English learning platform speech signal ( ) ot  to train the restricted Boltzmann machine [17], which is an energybased model whose network is a bipartite graph. The first layer is the visible layer  , and the second layer is the hidden layer  , connected in between by a sigmoid activation function, whose joint probability of the visible and hidden layers is defined as where  is the energy function of the RBM;  is the normalization constant. Since the speech signal is a realvalued distribution, the first RBM is usually a Gaussian Restricted Boltzmann Machine (GRBM), followed by a Bernoulli Restricted Boltzmann Machine (BBRBM) in superposition.
For GRBM, the energy function is defined as where, j and i are the visual layer and implicit layer codes, respectively. j  is the processing variance of the speech signal in the visual layer; ji  is the connection weight between the visual layer and the implicit layer; the conditional probabilities of the visual and implicit layers of GRBM are as follows.
( ) For BBRBM, the energy function is defined as The conditional probabilities of the visible and implicit layers of the BBRBM are as follows.
Gibbs sampling is used for layer-by-layer training. The idea of Gibbs sampling is that given a training sample 1  of a speech signal ( ) ot  , the conditional probability of each node is found according to the formula ( ) The contrast divergence (CD) algorithm is used to update RBM parameters. The output of the previous layer is the input of the next layer, and finally a stacked RBM network is formed.
(2) Fine tuning Fine tuning is there adaptive learning [18] process, and fine tuning has three main phases: forward transfer, feedback conduction, and modification of weights.
a. Forward pass: input the minimum batch of speech features into the neural network and forward pass the activation values of each layer to the output layer to obtain the cost function based on the minimum mean square error criterion Loss : b.
( ) Where, M is the passed batch size;  is the total dimensionality of the speech ( )  (21) where, W  is the weight decay parameter, which is used to control the pre-weight amplitude and prevent overfitting. The above steps are so repeatedly executed until the training is completed.

Enhancement phase
The speech is feedforward by the deep neural network, and the mean network is used to obtain the output of the hidden layer in the enhancement stage. After obtaining the enhanced speech features, the speech waveform is reconstructed [20], and the waveform reconstruction process is the inverse process of preprocessing. Assuming that the pure speech obtained is ( )

Experimental analyses
The hardware environment processor of the experiment is Intel Core i7-9750H, with a memory capacity of 16GB and a maximum frequency of 4.5GHz. The software environment is MATLAB2020b. In MATLAB software, we test the effect of denoising and enhancing the speech of the online English learning platform by adding signalto-noise ratio to the word "dark" in the TIMIT-Speech-Database speech database, and the time domain waveforms of the pure speech signal and the speech signal after adding noise are shown in Figures 3-4. The frequency domain waveforms are shown in Figures 5-6.   According to the analysis of Figure 7 and Figure 8, the time-domain and frequency-domain characteristics of the signal are recovered after the denoising of the noisy network English learning platform speech by the method in this paper, and they are highly consistent with the original time-domain and frequency-domain characteristics, which proves that the method in this paper has a good denoising effect on the noisy speech signal. In order to test the enhancement effect of this method on the English speech signal, four types of unknown noise completely different from the original speech signal are used, namely A1 (white noise), A2 (pink noise), A3 (babble noise) and A4 (Leopard noise). These four types of noise are from noise_ 92 noise database. Then three different signal-to-noise ratios of -5dB, 0dB and 5dB and pure speech signals are synthesized. The enhancement effect of this paper on the noisy English speech signal is shown in Table 1, and the enhancement effect is mainly reflected by the speech quality perception assessment PESQ value, which takes the value range of -0.5~4.5, and a larger value indicates better speech quality.  By analyzing the data in Table 1, it can be seen that in many scenarios, the PESQ values of the speech quality perception evaluation of noisy English speech signals are significantly different before and after the application of this method. Before the application, the PESQ values of the speech quality perception evaluation of noisy English speech signals are all negative. After the application of this method, the PESQ values of the speech quality perception evaluation of noisy English speech signals are higher than 4.0. It shows that after the use of this method, the PESQ value of speech quality perception evaluation of noisy English speech signals is improved, and the speech quality is improved. Taking 5dB noise of scene A1 as an example, the spectrogram of English speech signal before and after the enhancement of this method is shown in Figure 9.
(a) Before this method is enhanced (b) After this method is enhanced Figure 9. This method is used to enhance the spectrogram of speech signals before and after enhancement It can be seen from Figure 9 that before using the method proposed in this paper to enhance the speech signal of the network English platform, the spectral characteristics of the speech signal are not obvious, indicating that the quality of the speech signal is not high. After using the method proposed in this paper to enhance the speech signal of the network English platform, the spectral characteristics of the speech signal are prominent and the quality of the speech signal is improved [21].

Conclusion
In view of the shortcomings of traditional speech enhancement methods, this paper proposes a speech enhancement method based on deep neural network for online English learning platform based on previous research. The difference between this method and other deep neural network based speech enhancement methods is that the special denoising method based on optimized VMD algorithm for online English learning platform speech signal denoising can effectively remove the noise information in the speech signal, optimize signal quality. The speech of the online English learning platform is effectively enhanced while the denoised speech signal is input into the deep neural network, and applied to the English learning application platform. Through experiments from three perspectives: speech signal denoising, speech quality enhancement and speech spectrum information, the usability of the method in this paper is confirmed.