Exploring the Impact of Mismatch Conditions, Noisy Backgrounds, and Speaker Health on Convolutional Autoencoder-Based Speaker Recognition System with Limited Dataset

Arundhati Niwatkar; Yuvraj Kanse; Ajay Kumar Kushwaha

doi:10.4108/eetsis.5697

Authors

Arundhati Niwatkar Shivaji University
Yuvraj Kanse Karmaveer Bhaurao Patil College of Engineering
Ajay Kumar Kushwaha Bharati Vidyapeeth Deemed University

DOI:

https://doi.org/10.4108/eetsis.5697

Keywords:

MPCC, pitch, jitter, shimmer, convolutional autoencoder

Abstract

This paper presents a novel approach to enhance the success rate and accuracy of speaker recognition and identification systems. The methodology involves employing data augmentation techniques to enrich a small dataset with audio recordings from five speakers, covering both male and female voices. Python programming language is utilized for data processing, and a convolutional autoencoder is chosen as the model. Spectrograms are used to convert speech signals into images, serving as input for training the autoencoder. The developed speaker recognition system is compared against traditional systems relying on the MFCC feature extraction technique. In addition to addressing the challenges of a small dataset, the paper explores the impact of a "mismatch condition" by using different time durations of the audio signal during both training and testing phases. Through experiments involving various activation and loss functions, the optimal pair for the small dataset is identified, resulting in a high success rate of 92.4% in matched conditions. Traditionally, Mel-Frequency Cepstral Coefficients (MFCC) have been widely used for this purpose. However, the COVID-19 pandemic has drawn attention to the virus's impact on the human body, particularly on areas relevant to speech, such as the chest, throat, vocal cords, and related regions. COVID-19 symptoms, such as coughing, breathing difficulties, and throat swelling, raise questions about the influence of the virus on MFCC, pitch, jitter, and shimmer features. Therefore, this research aims to investigate and understand the potential effects of COVID-19 on these crucial features, contributing valuable insights to the development of robust speaker recognition systems.

References

Mura, M. La., Lamberti, ” Human-Machine Interaction Personalization: a Review on Gender and Emotion Recognition Through Speech Analysis.” IEEE International Workshop on Metrology for Industry 4.0 & IoT, 319-323, (2020).

Shelke, P. P., Wagh, K.” Review on Aspect based Sentiment Analysis on Social Data”. International Conference on Computing for Sustainable Global Development, 331-336, (2021).

Ishak, Z., Rajendran, N., Al Sanjary, O. I., Mat Razali, N. “Secure Biometric Lock System for Files and Applications: A Review.” IEEE International Colloquium on Signal Processing & Its Applications, 23-28, (2020).

Soufiane H., Nikola N., Jamal, K, “Convolutional neural network vectors for speaker recognition.” International Journal of Speech Technology, 24, 389–400, (2021).

Tanu Singhal, “A Review of Coronavirus Disease-2019(COVID19).” Indian J Pedatr, 87(4): 281–286, (2020).

Hu, H. R., Song, Y., Liu, Y., Dai, L. R., McLoughli, I., Liu, L,” Domain Robust Deep Embedding Learning for Speaker Recognition.” IEEE International Conference on Acoustics, Speech and Signal Processing, 7182-7186, (2022).

Loina, L., “Speaker Identification Using Small Artificial Neural Network on Small Dataset.” International Conference on Smart Systems and Technologies, 141-145, (2022).

Lin, W., Mak, M. W., “Robust Speaker Verification Using Population-Based Data Augmentation.” IEEE International Conference on Acoustics, Speech and Signal Processing, 7642-7646, (2022).

Hasan, A., Abdulqader, S., Abdul Rahman Al-Haddad, S. Abdo, A. Abdulghani and S. Natarajan, “Hybrid Feature Extraction MFCC and Feature Selection CNN for Speaker Identification Using CNN: A Comparative Study.” International Conference on Emerging Smart Technologies and Applications, 1-6, (2022).

Ali, M. H., Jaber, M. M., Abd, S. K., Rehman, A., Awan, M. J., Vitkutė-Adžgauskienė, D., Damaševičius, R., & Bahaj, S. A, “Harris Hawks Sparse Auto-Encoder Networks for Automatic Speech Recognition System.” Applied Sciences, 12(3), 1091-1095, (2022).

Abbaschian, B. J., Sierra-Sosa, D., Elmaghraby, A.,” Deep Learning Techniques for Speech Emotion Recognition, from Databases to Models.” Sensors, 21(4), 1249-1255, (2021).

Radzikowski, K., Wang, L., Yoshie, O., “Accent modification for speech recognition of non-native speakers using neural style transfer.” EURASIP Journal on Audio, Speech, and Music Processing, 11, (2021).

Bunrit, S., Inkian, T., Kerdprasop, N., Kerdprasop, K.,” Text-Independent Speaker Identification Using Deep Learning Model of Convolution Neural Network.” International Journal of Machine Learning and Computing, 9(2), 143-148, (2019).

Zinah J. Mohammed Ameen, Abdul kareem Abdulrahman Kadhim., “Deep Learning Methods for Arabic Autoencoder Speech Recognition System for Electro-Larynx Device.” Advances in Human-Computer Interaction, 1-11, (2023).

Rashid Jahangir, Ying Wah Teh, Henry Friday Nweke, Ghulam Mujtaba, Mohammed Ali Al-Garadi, Ihsan Ali, “Speaker identification through artificial intelligence techniques: A comprehensive review and research challenges. “Expert Systems with Applications, 171, (2021).

Piotr Staroniewicz, “Influence of Natural Voice Disguise Techniques on Automatic Speaker Recognition”,.Joint Conference - Acoustics,IEEE,(2018).

] Douglas A.Reynolds and Richard C. Rose, “Robust text in- dependent speaker identification using Gaussian mixture speaker models.” IEEE, (1995).

Lap-Ching Keung, Kelly Richardson, Deborah Sharp Matheron, Vincent Martel- Sauvageau, “ A Comparison of Healthy and Disordered Voices Using Multi-Dimensional Voice Program”, Praat, and TF32,Journal of Voice,ISSN 0892-1997, (2022).

Hengling Zhao, Yangyang Jiang, Shenghan Wang, Fei He, Fangzhou Ren, Zhong- hao Zhang, Xue Yang, Ce Zhu, Jirong Yue, Ying Li, Yipeng Liu, “Dysphagia diagnosis system with integrated speech analysis from throat vibration.” Expert Systems with Applications, Volume 204, 117496, ISSN 0957-4174(2022).

R. Jagiasi, S. Ghosalkar, P. Kulal and A. Bharambe.” CNN based speaker recognition in language and text-independent small-scale system.” International conference on IoT in Social, Mobile, Analytics and Cloud, 176-179, (2019).

Tirumala, S.S., & Shahamiri, S.R., “A Deep Autoencoder approach for Speaker Identification.” International Conference on Signal Processing Systems, (2017).