Conformer network-guided speech recognition for smart home Internet of Things system

Shengjun Huang; Dong-hyun Kim

doi:10.4108/eetiot.9818

Authors

Shengjun Huang Youngsan University , Nanjing Technical Vocational College
Dong-hyun Kim Youngsan University

DOI:

https://doi.org/10.4108/eetiot.9818

Keywords:

Internet of Things, conformer network, asymmetric convolution, speech recognition, gated feed-forward neural network

Abstract

In response to the fact that the World is gradually entering an aging society, and the problem that traditional Internet of Things (IoT) systems are operationally complex and lack humanization, a conformer network-guided speech recognition for smart home Internet of Things system is proposed in this paper. Firstly, by introducing a voice recognition module with an embedded processor, not only traditional voice recognition has been achieved, but also cloud transmission of voice has been realized, breaking through the bottleneck of low computing and storage capabilities of the main control chip. Then, by using Internet of Things technology, the complex algorithms are transferred to the cloud for execution. There is a significant improvement in voice recognition. By leveraging the distributed storage feature of the cloud, a user-specific voice database can be established categorically. This enables the provision of a vast amount of data basis when users are learning. In response to the shortcomings of the existing Conformer speech recognition model, such as insufficient extraction ability of time-frequency features, redundant model structure and large number of parameters, this paper proposes a speech recognition model based on asymmetric convolution and gated feed-forward neural network. Different-sized asymmetric convolutions are used to perform multi-scale fusion and down-sampling on the time-frequency features of the speech sequence. This not only enhances the model's ability to extract time-frequency features but also effectively reduces the information loss during down-sampling. At the same time, the gated feed-forward module is introduced to replace the double half-step feed-forward network in Conformer, reducing the number of network parameters while simplifying the model structure. Finally, based on a large amount of data, the system gradually builds a personalized speech recognition library for the user through learning. Through experiments, the effectiveness of the proposed intelligent fusion-based Internet of Things system in terms of speech recognition accuracy, computing power, and the intelligence level of voice interaction has been verified.

Downloads

Download data is not yet available.

References

[1] Mu X, Antwi-Afari M F. The applications of Internet of Things (IoT) in industrial management: a science mapping review[J]. International Journal of Production Research, 2024, 62(5): 1928-1952.

[2] Gajić T, Petrović M D, Pešić A M, et al. Innovative approaches in hotel management: integrating artificial intelligence (AI) and the Internet of Things (IoT) to enhance operational efficiency and sustainability[J]. Sustainability, 2024, 16(17): 7279.

[3] Yin S, Li H, Laghari A A, et al. An anomaly detection model based on deep auto-encoder and capsule graph convolution via sparrow search algorithm in 6G Internet of Everything[J]. IEEE Internet of Things Journal, 2024, 11(18): 29402-29411.

[4] Popoola O, Rodrigues M, Marchang J, et al. A critical literature review of security and privacy in smart home healthcare schemes adopting IoT & blockchain: problems, challenges and solutions[J]. Blockchain: Research and Applications, 2024, 5(2): 100178.

[5] Amru M, Kannan R J, Ganesh E N, et al. Network intrusion detection system by applying ensemble model for smart home[J]. International Journal of Electrical and Computer Engineering, 2024, 14(3): 3485-3494.

[6] Yin S, Li H, Laghari A A, et al. FLSN-MVO: edge computing and privacy protection based on federated learning Siamese network with multi-verse optimization algorithm for industry 5.0[J]. IEEE Open Journal of the Communications Society, 6:3443-3458, 2024.

[7] Zhang X, Wang X, Jia Y. The visual internet of things system based on depth camera[C]//Proceedings of 2013 Chinese Intelligent Automation Conference: Intelligent Automation & Intelligent Technology and Systems. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013: 447-455.

[8] Zhang L, Wu H. Application of single chip technology in internet of things electronic products[J]. Journal of Intelligent & Fuzzy Systems, 2021, 40(2): 3223-3233.

[9] Juluru T K, Golamari J M, Chitti S, et al. Human-Computer Interaction in Audio Systems: An IoT-Based Gesture Control Approach[C]//2025 7th International Conference on Inventive Material Science and Applications (ICIMA). IEEE, 2025: 795-802.

[10] Zhou F Y, Li J H, Tian G H, et al. Research and Implementation of Embedded Voice Interaction System Based on ARM in Intelligent Space[J]. Advanced Materials Research, 2012, 433: 5620-5627.

[11] Ahmed G, Lawaye A A. CNN-based speech segments endpoints detection framework using short-time signal energy features[J]. International Journal of Information Technology, 2023, 15(8): 4179-4191.

[12] Wu Q, Liu Y. A speech endpoint detection method based on cascaded speech enhancement[C]//2021 International Conference on Electronic Information Technology and Smart Agriculture (ICEITSA). IEEE, 2021: 1-6.

[13] Jiang Y, Yin S. Heterogenous-view occluded expression data recognition based on cycle-consistent adversarial network and K-SVD dictionary learning under intelligent cooperative robot environment[J]. Computer Science and Information Systems, 2023, 20(4): 1869-1883.

[14] Chao L, Chen J, Chu W. Variational connectionist temporal classification[C]//European conference on computer vision. Cham: Springer International Publishing, 2020: 460-476.

[15] Du S, Li T, Yang Y, et al. Multivariate time series forecasting via attention-based encoder–decoder framework[J]. Neurocomputing, 2020, 388: 269-279.

[16] Ji Z, Xiong K, Pang Y, et al. Video summarization with attention-based encoder–decoder networks[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 30(6): 1709-1717.

[17] Kamal M B, Khan A A, Khan F A, et al. An Innovative Approach Utilizing Binary-View Transformer for Speech Recognition Task[J]. Computers, Materials & Continua, 2022, 72(3).

[18] Lo W C, Wang W J, Chen H Y, et al. Feasibility study regarding the use of a conformer model for rainfall-runoff modeling[J]. Water, 2024, 16(21): 3125.

[19] Chen S, Wu Y, Chen Z, et al. Continuous speech separation with conformer[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 5749-5753.

[20] Chan T K, Chin C S. Multi-branch convolutional macaron net for sound event detection[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 2972-2985.

[21] Rajab M A, Abdullatif F A, Sutikno T. Classification of grapevine leaves images using VGG-16 and VGG-19 deep learning nets[J]. TELKOMNIKA (Telecommunication Computing Electronics and Control), 2024, 22(2): 445-453.

[22] Aich U, Saha A, Woźniak M, et al. Schizophrenia detection from electroencephalogram signals using image encoding and wrapper-based deep feature selection approach[J]. Scientific Reports, 2025, 15(1): 21390.

[23] Berghi D, Wu P, Zhao J, et al. Fusion of audio and visual embeddings for sound event localization and detection[C]//ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024: 8816-8820.

[24] Yu W, Zhu M, Wang N, et al. An efficient transformer based on global and local self-attention for face photo-sketch synthesis[J]. IEEE Transactions on Image Processing, 2022, 32: 483-495.

[25] Burchi M, Vielzeuf V. Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition[C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021: 8-15.

[26] El Mattar S, Baghdad A. Beyond Traditional RFID: Unveiling the Potential of Wi‐Fi, 5G, Bluetooth, and Zigbee for Backscatter Systems[J]. Transactions on Emerging Telecommunications Technologies, 2025, 36(2): e70062.

[27] Varriale V, Cammarano A, Michelino F, et al. Critical analysis of the impact of artificial intelligence integration with cutting-edge technologies for production systems[J]. Journal of Intelligent Manufacturing, 2025, 36(1): 61-93.

[28] Abatal A, Mzili M, Mzili T, et al. Intelligent Interconnected Healthcare System: Integrating IoT and Big Data for Personalized Patient Care[J]. International Journal of Online & Biomedical Engineering, 2024, 20(11).

[29] Bu H, Du J, Na X, et al. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline[C]//2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, 2017: 1-5.

[30] Dong L, Xu S, Xu B. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition[C]//2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018: 5884-5888.

[31] Xu M, Li S, Zhang X L. Transformer-based end-to-end speech recognition with local dense synthesizer attention[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 5899-5903.

[32] Li C, Shi J, Zhang W, et al. ESPnet-SE: End-to-end speech enhancement and separation toolkit designed for ASR integration[C]//2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021: 785-792.

[33] Wenxuan Z, Yaqin Z, Zhaoxiang Z, et al. Lite transformer network with long–short range attention for real-time fire detection[J]. Fire Technology, 2023, 59(6): 3231-3253.

[34] Ma J, Reda S. WeNet: Configurable Neural Network with Dynamic Weight-Enabling for Efficient Inference[C]//2023 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED). IEEE, 2023: 1-6.

[35] Liu B, Han Z, Chen X, et al. Rope-net: deep convolutional neural network via robust principal component analysis[J]. Machine Learning, 2025, 114(7): 150.

[36] Tsunoo E, Futami H, Kashiwagi Y, et al. Decoder-only architecture for streaming end-to-end speech recognition[J]. arXiv preprint arXiv:2406.16107, 2024.

Conformer network-guided speech recognition for smart home Internet of Things system

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Scopus

Latest publications