研究生: |
楊宥芩 Yang, You-Chin |
---|---|
論文名稱: |
基於對比式訓練之輕量化開放詞彙的關鍵詞辨識 Small-footprint Open-vocabulary Keyword Spotting Using Contrastive Learning |
指導教授: |
陳柏琳
Chen, Berlin |
口試委員: |
陳柏琳
Chen, Berlin 王新民 Wang, Xin Min 洪志偉 Hung, Jeih-weih 江振宇 Chiang, Chen-Yu |
口試日期: | 2024/07/22 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2024 |
畢業學年度: | 112 |
語文別: | 中文 |
論文頁數: | 41 |
中文關鍵詞: | 關鍵詞辨識 、零樣本 、對比學習 、開放詞彙 、自定義 |
英文關鍵詞: | keyword spotting, user-defined, zero-shot, contrastive learning, open-vocabulary |
研究方法: | 比較研究 、 觀察研究 |
DOI URL: | http://doi.org/10.6345/NTNU202401799 |
論文種類: | 學術論文 |
相關次數: | 點閱:107 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著智慧裝置的普及,關鍵詞辨識技術變得越來越重要,其目標是在連續語音中識別是否存在特定的關鍵詞,這項任務極具挑戰性,因為它不僅需要準確地檢測關鍵詞,還需要有效地排除其他關鍵詞。隨著深度神經網絡的快速發展,採用深度神經網絡的關鍵詞辨識在精準度上取得了顯著進步。傳統基於深度神經網絡的關鍵詞辨識系統需要大量目標關鍵詞的語音作為訓練資料,因此只能識別固定的關鍵詞,且在訓練完成後難以替換關鍵詞。若需要替換關鍵詞,則必須重新收集目標關鍵詞的語料並重新訓練模型。本文聚焦於實作一個開放詞彙的關鍵詞辨識系統。該系統通過自注意力機制,利用語音特徵與文本嵌入向量生成有效的聯合嵌入,並藉由辨別器對聯合嵌入計算信心分數。系統依據這些信心分數來決定是否啟動系統。同時,透過對比式學習來處理在設定多個關鍵詞時,錯誤關鍵詞的信心分數過高而產生的誤報問題。在預訓練音頻編碼器時,我們除了使用包含5000類關鍵詞的語料進行分類任務訓練的預訓練音頻編碼器外,還採用了更加節省參數的音頻編碼器架構,能夠減少100K的參數,並通過500類關鍵詞進行分類任務的預訓練。本研究在識別10個未在訓練階段出現的新關鍵詞上,達到了94.08%的準確率,相較於基準方法提升了12%。
As smart devices become more widespread, keyword spotting technology is becoming increasingly important. The goal of this technology is to identify the presence of specific keywords in continuous speech. This task is highly challenging as it not only requires accurate detection of the keywords but also the effective exclusion of other non-target keywords. With the rapid development of deep neural networks, keyword spotting using deep neural networks has achieved significant improvements in accuracy. Traditional keyword spotting systems based on deep neural networks require a large amount of speech data containing the target keywords for training. As a result, they can only recognize fixed keywords and it is difficult to replace these keywords once training is completed. If a keyword needs to be replaced, new speech data for the target keyword must be collected, and the model must be retrained. This paper focuses on implementing an open-vocabulary keyword spotting system. This system utilizes a self-attention mechanism to generate effective joint embeddings by leveraging speech features and text embedding vectors, and calculates confidence scores for these joint embeddings using a discriminator. The system decides whether to activate based on these confidence scores. Furthermore, contrastive learning is employed to address the false alarm issue caused by high confidence scores of incorrect keywords when multiple keywords are set. During the pre-training of the audio encoder, in addition to using a pre-trained audio encoder trained on a classification task with a dataset containing 5000 categories of keywords, we also adopted a more parameter-efficient audio encoder architecture. This architecture reduces the parameters by 100K and is pre-trained on a classification task with 500 categories of keywords. In this study, our approach achieved an accuracy of 94.08% in recognizing 10 new keywords that did not appear during the training phase, which is a 12% improvement over the baseline methods.
[1] M. Hoy, “Alexa, Siri, Cortana, and more: An introduction to voice assistants, “Medical Reference Services Quarterly, vol. 37, pp. 81–88, 01 2018.
[2] A. H. Michaely, X. Zhang, G. Simko, C. Parada, and P. Aleksic, “Keyword spotting for Google Assistant using contextual speech recognition,” in Proceedings of ASRU 2017 – IEEE Automatic Speech Recognition and Understanding Workshop, December 16-20, Okinawa, Japan, 2017, pp. 272–278.
[3] O. Vinyals and S. Wegmann, “Chasing the metric: Smoothing learning algorithms for keyword detection,” in Proceedings of ICASSP 2014 – 39th IEEE International Conference on Acoustics, Speech and Signal Processing, May 4-9, Florence, Italy, 2014, pp. 3301–3305.
[4] López-Espejo, I., Tan, Z.H., Hansen, J.H. and Jensen, J., 2021. Deep spoken keyword spotting: An overview. IEEE Access, 10, pp.4169-4199.
[5] G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in Proceedings of ICASSP 2014 – 39th IEEE International Conference on Acoustics, Speech and Signal Processing, May 4-9, Florence, Italy, 2014, pp. 4087–4091.
[6] T. N. Sainath and C. Parada, “Convolutional neural networks for smallfootprint keyword spotting,” in Proceedings of INTERSPEECH 2015– 16th Annual Conference of the International Speech Communication Association, September 6-10, Dresden, Germany, 2015, pp. 1478–1482.
[7] X. Wang, S. Sun, and L. Xie, “Virtual adversarial training for DS-CNN based small-footprint keyword spotting,” in Proceedings of ASRU 2019 – IEEE Automatic Speech Recognition and Understanding Workshop, December 14-18, Singapore, Singapore, 2019, pp. 607–612.
[8] Y. Bai, J. Yi, J. Tao, Z. Wen, Z. Tian, C. Zhao, and C. Fan, “A time delay neural network with shared weight self-attention for small-footprint keyword spotting,” in Proceedings of INTERSPEECH 2019 – 20th Annual Conference of the International Speech Communication Association, September 15-19, Graz, Austria, 2019, pp. 2190–2194.
[9] S. Fernández, A. Graves, and J. Schmidhuber, “An application of recurrent neural networks to discriminative keyword spotting,” in Proceedings of ICANN 2007 – 17th International Conference on Artificial Neural Networks, September 9-13, Porto, Portugal, 2007, pp. 220–229.
[10] E. A. Ibrahim, J. Huisken, H. Fatemi, and J. P. de Gyvez, “Keyword spotting using time-domain features in a temporal convolutional network,” in Proceedings of DSD 2019 – 22nd Euromicro Conference on Digital System Design, August 28-30, Kallithea, Greece, 2019, pp. 313–319.
[11] Y. Wang and Y. Long, “Keyword spotting based on CTC and RNN for Mandarin Chinese speech,” in Proceedings of ISCSLP 2018 – 11th International Symposium on Chinese Spoken Language Processing, November 26-29, Taipei, Taiwan, 2018, pp. 374–378.
[12] R. Tang and J. Lin, “Deep residual learning for small-footprint keyword spotting,” in Proceedings of ICASSP 2018 – 43rd IEEE International Conference on Acoustics, Speech and Signal Processing, April 15-20, Calgary, Canada, 2018, pp. 5484–5488.
[13] M. Zeng and N. Xiao, “Effective combination of DenseNet and BiLSTM for keyword spotting,” IEEE Access, vol. 7, pp. 10 767–10 775, 2019.
[14] X. Chen, S. Yin, D. Song, P. Ouyang, L. Liu, and S. Wei, “Small-footprint keyword spotting with graph convolutional network,” in Proceedings of ASRU 2019 – IEEE Automatic Speech Recognition and Understanding Workshop, December 14-18, Singapore, Singapore, 2019, pp. 539–546.
[15] Z.-H. Tan, A. kr. Sarkar, and N. Dehak, “rVAD: An unsupervised segment-based robust voice activity detection method,” Computer Speech & Language, vol. 59, pp. 1–21, 2020.
[16] Y. Yuan, Z. Lv, S. Huang, and L. Xie, “Verifying deep keyword spotting detection with acoustic word embeddings,” in Proceedings of ASRU 2019 – IEEE Automatic Speech Recognition and Understanding Workshop, December 14-18, Singapore, Singapore, 2019, pp. 613–620.
[17] S. Myer and V. S. Tomar, “Efficient keyword spotting using time delay neural networks,” in Proceedings of INTERSPEECH 2018 – 19th Annual Conference of the International Speech Communication Association, September 2-6, Hyderabad, India, 2018, pp. 1264–1268.
[18] H. Wu, Y. Jia, Y. Nie, and M. Li, “Domain aware training for far-field small-footprint keyword spotting,” in Proceedings of INTERSPEECH 2020 – 21st Annual Conference of the International Speech Communication Association, October 25-29, Shanghai, China, 2020, pp. 2562–2566.
[19] R. Kumar, V. Yeruva, and S. Ganapathy, “On convolutional LSTM modeling for joint wake-word detection and text dependent speaker verification,” in Proceedings of INTERSPEECH 2018 – 19th Annual Conference of the International Speech Communication Association, September 2-6, Hyderabad, India, 2018, pp. 1121–1125.
[20] X. Wang, S. Sun, and L. Xie, “Virtual adversarial training for DS-CNN based small-footprint keyword spotting,” in Proceedings of ASRU 2019–IEEE Automatic Speech Recognition and Understanding Workshop,December 14-18, Singapore, Singapore, 2019, pp. 607–612.
[21] Y. Yuan, Z. Lv, S. Huang, and L. Xie, “Verifying deep keyword spotting detection with acoustic word embeddings,” in Proceedings of ASRU 2019 – IEEE Automatic Speech Recognition and Understanding Workshop, December 14-18, Singapore, Singapore, 2019, pp. 613–620.
[22] Y. Tan, K. Zheng, and L. Lei, “An in-vehicle keyword spotting systemwith multi-source fusion for vehicle applications,” in Proceedings of WCNC 2019 – IEEE Wireless Communications and Networking Conference, April 15-18, Marrakesh, Morocco, 2019.
[23] Rusci, M. and Tuytelaars, T., 2023. Few-Shot Open-Set Learning for On-Device Customization of KeyWord Spotting Systems. arXiv preprint arXiv:2306.02161
[24] J. Snell, K. Swersky, and R. Zemel, “Prototypical networks for few-shot learning,”Advances in neural information processing systems, vol. 30, 2017.
[25] M. Mazumder, C. Banbury, J. Meyer, P. Warden, and V. J. Reddi,“Few-Shot Keyword Spotting in Any Language,” in Proc. Interspeech, 2021, pp. 4214–4218.
[26] Y. Chen, T. Ko, and J. Wang, “A Meta-Learning Approach for User-Defined Spoken Term Classification with Varying Classes and Examples,” in Proc. Interspeech, 2021, pp. 4224–4228.
[27] A. Parnami and M. Lee, “Few-shot keyword spotting with prototypical networks,” in 2022 7th International Conference on Machine Learning Technologies (ICMLT). New York, NY,USA: Association for Computing Machinery, 2022, p. 277–283.[Online].
[28] J. Jung, Y. Kim, J. Park, Y. Lim, B.-Y. Kim, Y. Jang, and J. S.Chung, “Metric learning for user-defined keyword spotting,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
[29] B. Kim, S. Yang, I. Chung, and S. Chang, “Dummy Prototypical Networks for Few-Shot Open-Set Keyword Spotting,” in Proc. Interspeech, 2022, pp. 4621–4625.
[30] H.-K. Shin, H. Han, D. Kim, S.-W. Chung, and H.-G. Kang, “Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting,” in Proc. Interspeech 2022, 2022, pp. 1871–1875.
[31] Lee, Y.H. and Cho, N., 2023. Phonmatchnet: phoneme-guided zero-shot keyword spotting for user-defined keywords. Interspeech, 2023, pp. 3964-3968
[32] Nishu, K., Cho, M., Dixon, P. and Naik, D., 2024, April. Flexible keyword spotting based on homogeneous audio-text embedding. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5050-5054). IEEE.
[33] G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2014.
[34] T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in INTERSPEECH, 2015.
[35] X. Chen, S. Yin, D. Song, P. Ouyang, L. Liu, and S. Wei, “Smallfootprint keyword spotting with graph convolutional network,” in IEEE Automatic Speech Recognition and Understanding Workshop, 2019.
[36] G. Chen, C. Parada, and T. N. Sainath, “Query-by-example keyword spotting using long short-term memory networks,” in Proceedings of the IEEE International Conference on Acoustics,Speech, and Signal Processing, 2015.
[37] L. Lugosch, S. Myer, and V. S. Tomar, “Donut: Ctc-based queryby-example keyword spotting,” in NeurIPS Workshop on Interpretability and Robustness in Audio, Speech, and Language, 2018.
[38] J. Huang, W. Gharbieh, H. S. Shim, and E. Kim, “Query-byexample keyword spotting system using multi-head attention and softtriple loss,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2021.
[39] H.-K. Shin, H. Han, D. Kim, S.-W. Chung, and H.-G. Kang, “Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting,” in Proc. Interspeech 2022, 2022, pp. 1871–1875.
[40] J. Lin, K. Kilgour, D. Roblek, and M. Sharifi, “Training keyword spotters with limited and synthesized speech data,” in Proc. ICASSP 2020, 2020, pp. 7474–7478.
[41] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “What does bert with vision look at?” in ACL (short), 2020.
[42] Roman Vygon and N. Mikhaylovskiy, “Learning efficient representations for keyword spotting with triplet loss,” in International Conference on Speech and Computer, 2021.
[43] Kiran R., V. Kurmi, Vinay Namboodiri, and C. V. Jawahar, “Generalized keyword spotting using asr embeddings,” in Interspeech, 2022.
[44] Hyeon-Kyeong Shin, Hyewon Han, Doyeon Kim, SooWhan Chung, and Hong-Goo Kang, “Learning audio-text agreement for open-vocabulary keyword spotting,” in Interspeech, 2022.
[45] Kumari Nishu, Minsik Cho, and Devang Naik, “Matching latent encoding for audio-text based keyword spotting,” arXiv preprint arXiv:2306.05245, 2023.
[46] Jaemin Jung, You kyong. Kim, Jihwan Park, Youshin Lim, Byeongchang Kim, Youngjoon Jang, and Joon Son Chung, “Metric learning for user-defined keyword spotting,” ArXiv, vol. abs/2211.00439, 2022.
[47] Kyubyong Park and Jongseok Kim, “g2pe,” https: //github.com/Kyubyong/g2p, 2019.
[48] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger, “Montreal forced aligner: Trainable text-speech alignment using kaldi.” in INTERSPEECH, 2017.
[49] V. I. Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8. Soviet Union, 1966, pp. 707–710.
[50] M. Mazumder, S. Chitlangia, C. Banbury, Y. Kang, J. M. Ciro, K. Achorn, D. Galvez, M. Sabini, P. Mattson, D. Kanter et al., “Multilingual spoken words corpus,” in Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
[51] J. Lin, K. Kilgour, D. Roblek, and M. Sharifi, “Training keyword spotters with limited and synthesized speech data,” in Proc. ICASSP 2020, 2020, pp. 7474–7478
[52] Shin, H.-K., Han, H., Kim, D., Chung, S.-W., Kang, H.-G., “Learning Audio-Text Agreement for Open-vocabulary Keyword Spotting,” in Proceedings of the Annual Conference of the International Speech Communication Association
[53] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “What does bert with vision look at?” in ACL (short), 2020.
[54] K. R. Prajwal, L. Momeni, T. Afouras, and A. Zisserman, “Visual keyword spotting with attention,” in 32nd British Machine Vision Conference 2021, BMVC 2021, Online, 2021, p. 380.
[55] L. Momeni, T. Afouras, T. Stafylakis, S. Albanie, and A. Zisserman, “Seeing wake words: Audio-visual keyword spotting,” in 31st British Machine Vision Conference 2020, BMVC 2020, Online, 2020.
[56] J. Lee, S.-W. Chung, S. Kim, H.-G. Kang, and K. Sohn, “Looking into your speech: Learning cross-modal affinity for audio-visual speech separation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1336–1345.
[57] H. Chefer, S. Gur, and L. Wolf, “Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 397–406.
[58] Elizalde, B., Deshmukh, S., Al Ismail, M., & Wang, H. (2023, June). Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1-5). IEEE.
[59] I. Lopez-Espejo, Z.-H. Tan, J. H. Hansen, and J. Jensen, “Deep ´spoken keyword spotting: An overview,” IEEE Access, vol. 10,pp. 4169–4199, 2021.
[60] Oord, A. V. D., Li, Y., & Vinyals, O.” Representation learning with contrastive predictive coding,”arXiv preprint arXiv:1807.03748, 2018.
[61] Mazumder, M., Chitlangia, S., Banbury, C., Kang, Y., Ciro, J. M., Achorn, K., ... & Reddi, V. J. (2021, August). Multilingual spoken words corpus. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
[62] Byeonggeun Kim, Mingu Lee, Jinkyu Lee, Yeonseok Kim, and Kyuwoong Hwang, “Query-by-example on-device keyword spotting,” in Proc. ASRU. IEEE, 2019, pp. 532–538.
[63] P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
[64] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, and Noise Corpus,”2015, arXiv:1510.08484v1.
[65] Ng, D., Chen, Y., Tian, B., Fu, Q., & Chng, E. S. (2022, May). Convmixer: Feature interactive convolution with curriculum learning for small footprint and noisy far-field keyword spotting. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 3603-3607). IEEE.
[66] Prajit Ramachandran, Barret Zoph, and Quoc V Le, “Searching for activation functions,” arXiv preprint arXiv:1710.05941, 2017.