Conformer-Performer: An Efficient Architecture for Voice Activity Detection

Echa Apriliyanto; Anita Fira Waluyo

doi:10.58526/jsret.v4i4.979

Authors

Echa Apriliyanto Universitas Teknologi Yogyakarta
Anita Fira Waluyo Universitas Teknologi Yogyakarta

DOI:

https://doi.org/10.58526/jsret.v4i4.979

Keywords:

Voice Activity Detection, Conformer, Performer, Deep Learning, Linear Attention

Abstract

Voice Activity Detection (VAD) is a crucial pre-processing step for speech technologies, yet standard Conformer architectures suffer from quadratic computational complexity. This study introduces the Conformer-Performer, a novel architecture that replaces standard multi-head self-attention with the Fast Attention Via positive Orthogonal Random features (FAVOR+) mechanism to achieve linear complexity. The objective was to develop an efficient VAD model that maintains high accuracy suitable for resource-constrained applications. The model was trained on the multilingual FLEURS dataset using a teacher-student approach and extensive data augmentation. Experimental results demonstrate that the Conformer-Performer achieves an F1-score of 98.29%, which is highly competitive with the standard Conformer's 98.41%, while achieving a 7.8-fold reduction in peak GPU memory usage and a 3.46-fold speedup in CPU inference time. Furthermore, the proposed model significantly outperforms the SileroVAD baseline. These findings confirm that the Conformer-Performer offers a compelling balance of accuracy and efficiency, making it highly suitable for real-time, on-device speech processing.

Downloads

Download data is not yet available.

References

Ball, J. (2023). Voice Activity Detection (VAD) in Noisy Environments. http://arxiv.org/abs/2312.05815

Braun, S., & Tashev, I. (2021). On training targets for noise-robust voice activity detection. http://arxiv.org/abs/2102.07445

Chaudhuri, S., Roth, J., Ellis, D. P. W., Gallagher, A., Kaver, L., Marvin, R., Pantofaru, C., Reale, N., Reid, L. G., Wilson, K., & Xi, Z. (2018). AVA-Speech: A Densely Labeled Dataset of Speech Activity in Movies. http://arxiv.org/abs/1808.00606

Choromanski, K., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J., Mohiuddin, A., Kaiser, L., Belanger, D., Colwell, L., & Weller, A. (2022). Rethinking Attention with Performers. http://arxiv.org/abs/2009.14794

Conneau, A., Ma, M., Khanuja, S., Zhang, Y., Axelrod, V., Dalmia, S., Riesa, J., Rivera, C., & Bapna, A. (2022). FLEURS: Few-shot Learning Evaluation of Universal Representations of Speech. http://arxiv.org/abs/2205.12446

Dinkel, H., Wang, S., Xu, X., Wu, M., & Yu, K. (2021). Voice activity detection in the wild: A data-driven approach using teacher-student training. https://doi.org/10.1109/TASLP.2021.3073596

Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., & Pang, R. (2020). Conformer: Convolution-augmented Transformer for Speech Recognition. http://arxiv.org/abs/2005.08100

Jia, F., Majumdar, S., & Ginsburg, B. (2021). MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection. http://arxiv.org/abs/2010.13886

Ko, T., Peddinti, V., Povey, D., Seltzer, M. L., & Khudanpur, S. (2017). A STUDY ON DATA AUGMENTATION OF REVERBERANT SPEECH FOR ROBUST SPEECH RECOGNITION.

Köpüklü, O., & Taseska, M. (2022). ResectNet: An Efficient Architecture for Voice Activity Detection on Mobile Devices. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2022-September, 5363–5367. https://doi.org/10.21437/Interspeech.2022-820

Luckenbaugh, J., Abplanalp, S., Gonzalez, R., Fulford, D., Gard, D., & Busso, C. (2021). Voice activity detection with teacher-student domain emulation. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 6, 4521–4525. https://doi.org/10.21437/Interspeech.2021-1234

Ma, M., Koizumi, Y., Karita, S., Zen, H., Riesa, J., Ishikawa, H., & Bacchiani, M. (2024). FLEURS-R: A Restored Multilingual Speech Corpus for Generation Tasks. https://doi.org/https://doi.org/10.48550/arXiv.2408.06227

Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2019-September, 2613–2617. https://doi.org/10.21437/Interspeech.2019-2680

Piczak, K. J. (2015). ESC: Dataset for environmental sound classification. MM 2015 - Proceedings of the 2015 ACM Multimedia Conference, 1015–1018. https://doi.org/10.1145/2733373.2806390

Salamon, J., Jacoby, C., & Bello, J. P. (2014). A dataset and taxonomy for urban sound research. MM 2014 - Proceedings of the 2014 ACM Conference on Multimedia, 1041–1044. https://doi.org/10.1145/2647868.2655045

Sharma, M., Joshi, S., Chatterjee, T., & Hamid, R. (2022). A comprehensive empirical review of modern voice activity detection approaches for movies and TV shows. In Neurocomputing (Vol. 494, pp. 116–131). Elsevier B.V. https://doi.org/10.1016/j.neucom.2022.04.084

Silero Team. (2021). Silero Models: pre-trained enterprise-grade STT / TTS models and benchmarks. GitHub. https://github.com/snakers4/silero-models

Snyder, D., Chen, G., & Povey, D. (2015). MUSAN: A Music, Speech, and Noise Corpus. http://arxiv.org/abs/1510.08484

Wang, C., Song, Y., Liu, H., Liu, H., Liu, J., Li, B., & Yuan, X. (2022). Real-Time Vehicle Sound Detection System Based on Depthwise Separable Convolution Neural Network and Spectrogram Augmentation. Remote Sensing, 14(19). https://doi.org/10.3390/rs14194848

Wilkinson, N., & Niesler, T. (2021). A Hybrid CNN-BiLSTM Voice Activity Detector. http://arxiv.org/abs/2103.03529

Yang, Q., Liu, Q., Li, N., Ge, M., Song, Z., & Li, H. (2024). sVAD: A Robust, Low-Power, and Light-Weight Voice Activity Detection with Spiking Neural Networks. http://arxiv.org/abs/2403.05772

Zhou, H., Du, J., Chen, H., Jing, Z., Xiong, S., & Lee, C. H. (2021). Audio-visual information fusion using cross-modal teacher-student learning for voice activity detection in realistic environments. Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 6, 4550–4554. https://doi.org/10.21437/Interspeech.2021-592