Lightweight Keyword Spotting with Inter-Domain Interaction and Attention for Real-Time Voice-Controlled Robotics
DOI:
https://doi.org/10.4108/airo.7877Keywords:
TinyML, Speech Commands, Channel Attention, Keyword SpottingAbstract
This study introduces a novel lightweight Keyword Spotting (KWS) model optimized for deployment on resource-constrained microcontrollers, with potential applications in robotic control and end-effector operations. The proposed model employs inter-domain interaction to effectively extract features from both Mel-frequency cepstral coefficients (MFCCs) and temporal audio characteristics, complemented by an attention mechanism to prioritize relevant audio segments for enhanced keyword detection. Achieving a 93.70% accuracy on the Google Command v2-12 commands dataset, the model outperforms existing benchmarks. It also demonstrates remarkable efficiency in inference speed (0.359 seconds) and resource utilization (34.9KB peak RAM and 98.7KB flash memory), offering a 3x faster inference time and reduced memory footprint compared to the DS-CNN-S model. These attributes make it particularly suitable for real-time voice command applications in low-power robotic systems, enabling intuitive and responsive control of robotic arms, end-effectors, and navigation systems. In this work, however, the KWS model is demonstrated in a simple non-destructive testing system for controlling sensor movement. This research lays the groundwork for advancing voice-activated robotic technologies on resource-limited hardware platforms.
Downloads
References
[1] Y. Zhang, N. Suda, L. Lai, and V. Chandra, “Hello edge: Keyword spotting on microcontrollers,” arXiv preprint arXiv:1711.07128, 2017.
[2] B. Kim, S. Chang, J. Lee, and D. Sung, Broadcasted residual learning for efficient keyword spotting, 2023. arXiv: 2106 . 04140 [cs.SD]. [Online]. Available: https://arxiv.org/abs/2106.04140.
[3] D. C. De Andrade, S. Leo, M. L. D. S. Viana, and C. Bernkopf, “A neural attention model for speech command recognition,” arXiv preprint arXiv:1808.08929, 2018.
[4] A. Berg, M. O’Connor, and M. T. Cruz, “Keyword transformer: A self-attention model for keyword spotting,” arXiv preprint arXiv:2104.00769, 2021.
[5] D. Seo, H.-S. Oh, and Y. Jung, “Wav2kws: Transfer learning from speech representations for keyword spotting,” IEEE Access, vol. 9, pp. 80 682–80 691, 2021.
[6] D. Ng, Y. Chen, B. Tian, Q. Fu, and E. S. Chng, “Convmixer: Feature interactive convolution with curriculum learning for small footprint and noisy farfield keyword spotting,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2022, pp. 3603–3607.
[7] R. Vygon and N. Mikhaylovskiy, “Learning efficient representations for keyword spotting with triplet loss,” in Speech and Computer: 23rd International Conference, SPECOM 2021, St. Petersburg, Russia, September 27–30, 2021, Proceedings 23, Springer, 2021, pp. 773–785.
[8] C. Banbury, V. J. Reddi, P. Torelli, et al., “Mlperf tiny benchmark,” arXiv preprint arXiv:2106.07597, 2021.
[9] M. Rusci and T. Tuytelaars, “On-device customization of tiny deep learning models for keyword spotting with few examples,” Ieee Micro, 2023.
[10] J. Bushur and C. Chen, “Neural network exploration for keyword spotting on edge devices,” Future Internet, vol. 15, no. 6, p. 219, 2023.
[11] J. E. Jeoung, Y. K. Yeow, and M. Ahemad, “Keyword spotting on embedded system with deep learning,” in Proceedings of 2019 electrical engineering symposium, vol. 3, 2019, pp. 87–91.
[12] N. A. Abbas and M. R. Ahmad, “Keyword spotting system with nano 33 ble sense using embedded machine learning approach,” Jurnal Teknologi (Sciences & Engineering), vol. 85, no. 3, pp. 175–182, 2023.
[13] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017.
[14] A. Howard, M. Sandler, G. Chu, et al., Searching for mobilenetv3, 2019. arXiv: 1905.02244 [cs.CV].
[15] P. Warden, “Speech commands: A dataset for limitedvocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
[16] M. Xu, L.-Y. Duan, J. Cai, L.-T. Chia, C. Xu, and Q. Tian, “Hmm-based audio keyword generation,” in Pacific-Rim Conference on Multimedia, Springer, 2004, pp. 566–574.
[17] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR, vol. abs/1512.03385, 2015. arXiv: 1512 . 03385. [Online]. Available: http://arxiv.org/abs/1512.03385.
[18] S. Majumdar and B. Ginsburg, “Matchboxnet: 1d time-channel separable convolutional neural network architecture for speech commands recognition,” arXiv preprint arXiv:2004.08531, 2020.
[19] B. Kim, S. Chang, J. Lee, and D. Sung, “Broadcasted residual learning for efficient keyword spotting,” arXiv preprint arXiv:2106.04140, 2021.
[20] D. Seo, H.-S. Oh, and Y. Jung, “Wav2kws: Transfer learning from speech representations for keyword spotting,” IEEE Access, vol. PP, pp. 1–1, May 2021. doi: 10.1109/ACCESS.2021.3078715.
[21] M. N. Miah and G. Wang, “Keyword spotting with deep neural network on edge devices,” in 2022 IEEE 12th International Conference on Electronics Information and Emergency Communication (ICEIEC), IEEE, 2022, pp. 98–102.
[22] M. Bezoui, A. Elmoutaouakkil, and A. Beni-hssane, “Feature extraction of some quranic recitation using mel-frequency cepstral coeficients (mfcc),” in 2016 5th international conference on multimedia computing and systems (ICMCS), IEEE, 2016, pp. 127–131.
[23] L. Muda, M. Begam, and I. Elamvazuthi, “Voice recognition algorithms using mel frequency cepstral coefficient (mfcc) and dynamic time warping (dtw) techniques,” arXiv preprint arXiv:1003.4083, 2010.
[24] Edge Impulse, https://edgeimpulse.com/, Accessed: 2024.
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Hien Vu Pham, Thuy Phuong Vu, Huong Thi Nguyen, Minhhuy Le

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.