Lightweight Keyword Spotting with Inter-Domain Interaction and Attention for Real-Time Voice-Controlled Robotics

Authors

DOI:

https://doi.org/10.4108/airo.7877

Keywords:

TinyML, Speech Commands, Channel Attention, Keyword Spotting

Abstract

This study introduces a novel lightweight Keyword Spotting (KWS) model optimized for deployment on resource-constrained microcontrollers, with potential applications in robotic control and end-effector operations. The proposed model employs inter-domain interaction to effectively extract features from both Mel-frequency cepstral coefficients (MFCCs) and temporal audio characteristics, complemented by an attention mechanism to prioritize relevant audio segments for enhanced keyword detection. Achieving a 93.70% accuracy on the Google Command v2-12 commands dataset, the model outperforms existing benchmarks. It also demonstrates remarkable efficiency in inference speed (0.359 seconds) and resource utilization (34.9KB peak RAM and 98.7KB flash memory), offering a 3x faster inference time and reduced memory footprint compared to the DS-CNN-S model. These attributes make it particularly suitable for real-time voice command applications in low-power robotic systems, enabling intuitive and responsive control of robotic arms, end-effectors, and navigation systems. In this work, however, the KWS model is demonstrated in a simple non-destructive testing system for controlling sensor movement. This research lays the groundwork for advancing voice-activated robotic technologies on resource-limited hardware platforms.

Downloads

Author Biography

Minhhuy Le, Phenikaa (Vietnam)

Minhhuy Le received a bachelor’s degree in mechatronics engineering from the Ha Noi University of Science and Technology, Viet Nam, in 2009. He got a master and doctor’s degree in control and instrumentation engineering from Chosun University, Korea in 2011 and 2014, respectively. He’s currently an Assistant Professor at the Faculty of Electrical and Electronic Engineering, Phenikaa University, Vietnam. His research interests are Magnetic sensors, Nondestructive testing, UWB radar, Machine Learning, and Deep learning.

References

[1] Y. Zhang, N. Suda, L. Lai, and V. Chandra, “Hello edge: Keyword spotting on microcontrollers,” arXiv preprint arXiv:1711.07128, 2017.

[2] B. Kim, S. Chang, J. Lee, and D. Sung, Broadcasted residual learning for efficient keyword spotting, 2023. arXiv: 2106 . 04140 [cs.SD]. [Online]. Available: https://arxiv.org/abs/2106.04140.

[3] D. C. De Andrade, S. Leo, M. L. D. S. Viana, and C. Bernkopf, “A neural attention model for speech command recognition,” arXiv preprint arXiv:1808.08929, 2018.

[4] A. Berg, M. O’Connor, and M. T. Cruz, “Keyword transformer: A self-attention model for keyword spotting,” arXiv preprint arXiv:2104.00769, 2021.

[5] D. Seo, H.-S. Oh, and Y. Jung, “Wav2kws: Transfer learning from speech representations for keyword spotting,” IEEE Access, vol. 9, pp. 80 682–80 691, 2021.

[6] D. Ng, Y. Chen, B. Tian, Q. Fu, and E. S. Chng, “Convmixer: Feature interactive convolution with curriculum learning for small footprint and noisy farfield keyword spotting,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2022, pp. 3603–3607.

[7] R. Vygon and N. Mikhaylovskiy, “Learning efficient representations for keyword spotting with triplet loss,” in Speech and Computer: 23rd International Conference, SPECOM 2021, St. Petersburg, Russia, September 27–30, 2021, Proceedings 23, Springer, 2021, pp. 773–785.

[8] C. Banbury, V. J. Reddi, P. Torelli, et al., “Mlperf tiny benchmark,” arXiv preprint arXiv:2106.07597, 2021.

[9] M. Rusci and T. Tuytelaars, “On-device customization of tiny deep learning models for keyword spotting with few examples,” Ieee Micro, 2023.

[10] J. Bushur and C. Chen, “Neural network exploration for keyword spotting on edge devices,” Future Internet, vol. 15, no. 6, p. 219, 2023.

[11] J. E. Jeoung, Y. K. Yeow, and M. Ahemad, “Keyword spotting on embedded system with deep learning,” in Proceedings of 2019 electrical engineering symposium, vol. 3, 2019, pp. 87–91.

[12] N. A. Abbas and M. R. Ahmad, “Keyword spotting system with nano 33 ble sense using embedded machine learning approach,” Jurnal Teknologi (Sciences & Engineering), vol. 85, no. 3, pp. 175–182, 2023.

[13] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017.

[14] A. Howard, M. Sandler, G. Chu, et al., Searching for mobilenetv3, 2019. arXiv: 1905.02244 [cs.CV].

[15] P. Warden, “Speech commands: A dataset for limitedvocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.

[16] M. Xu, L.-Y. Duan, J. Cai, L.-T. Chia, C. Xu, and Q. Tian, “Hmm-based audio keyword generation,” in Pacific-Rim Conference on Multimedia, Springer, 2004, pp. 566–574.

[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015. arXiv: 1512 . 03385. [Online]. Available: http://arxiv.org/abs/1512.03385.

[18] S. Majumdar and B. Ginsburg, “Matchboxnet: 1d time-channel separable convolutional neural network architecture for speech commands recognition,” arXiv preprint arXiv:2004.08531, 2020.

[19] B. Kim, S. Chang, J. Lee, and D. Sung, “Broadcasted residual learning for efficient keyword spotting,” arXiv preprint arXiv:2106.04140, 2021.

[20] D. Seo, H.-S. Oh, and Y. Jung, “Wav2kws: Transfer learning from speech representations for keyword spotting,” IEEE Access, vol. PP, pp. 1–1, May 2021. doi: 10.1109/ACCESS.2021.3078715.

[21] M. N. Miah and G. Wang, “Keyword spotting with deep neural network on edge devices,” in 2022 IEEE 12th International Conference on Electronics Information and Emergency Communication (ICEIEC), IEEE, 2022, pp. 98–102.

[22] M. Bezoui, A. Elmoutaouakkil, and A. Beni-hssane, “Feature extraction of some quranic recitation using mel-frequency cepstral coeficients (mfcc),” in 2016 5th international conference on multimedia computing and systems (ICMCS), IEEE, 2016, pp. 127–131.

[23] L. Muda, M. Begam, and I. Elamvazuthi, “Voice recognition algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques,” arXiv preprint arXiv:1003.4083, 2010.

[24] Edge Impulse, https://edgeimpulse.com/, Accessed: 2024.

Downloads

Published

18-03-2025

How to Cite

[1]
H. V. Pham, T. P. Vu, H. T. Nguyen, and M. Le, “Lightweight Keyword Spotting with Inter-Domain Interaction and Attention for Real-Time Voice-Controlled Robotics”, EAI Endorsed Trans AI Robotics, vol. 4, Mar. 2025.