研究生: |
王詣承 Wang, Yi-Cheng |
---|---|
論文名稱: |
端到端情境化語音辨識技術之研究 A Study on End-to-End Contextualized Automatic Speech Recognition |
指導教授: |
陳柏琳
Chen, Berlin |
口試委員: |
陳柏琳
Chen, Berlin 王新民 Wang, Hsin-Min 洪志偉 Hung, Jeih-Weih 江振宇 Chiang, Chen-Yu |
口試日期: | 2024/07/22 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2024 |
畢業學年度: | 112 |
語文別: | 英文 |
論文頁數: | 70 |
中文關鍵詞: | 語音辨識 、情境化語音辨識 、長尾分布 、自督式模型 、負樣本訓練 |
英文關鍵詞: | Automatic speech recognition, Contextualized automatic speech recognition, Long-tailed distribution, Self-supervised models, Hard negative training |
研究方法: | 實驗設計法 |
DOI URL: | http://doi.org/10.6345/NTNU202401844 |
論文種類: | 學術論文 |
相關次數: | 點閱:123 下載:4 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在智慧家居設備和手機智慧助理的普及,語音互動技術已成為日常生活中不可或缺的一部分。端到端(E2E)神經網路模型的進步顯著提升了自動語音辨識(ASR)模型的表現,這些模型在多項基準測試中均超越了傳統的混合模型。然而,E2E ASR 模型在辨識特定領域的詞彙(例如聯絡人名和地名)時仍面臨挑戰,這種挑戰在下游應用如自然語言理解中顯得尤為重要。本研究旨在通過增強上下文語境的 ASR 模型,來應對這些模型在真實世界場景中效能下降的問題。 我們的研究首先深入分析了當前先進的 E2E ASR 模型在辨識錯誤方面的局限性,識別出主要問題,包括先驗知識不足和捕捉上下文資訊的能力不足。為解決這些問題,我們提出了 XPhoneAdapter 模型,這是一種結合了新型自監督音素編碼器 XPhoneBERT 的方法,能提供更豐富的音素感知特徵。此外,我們還針對上下文/非上下文不平衡和長尾分佈問題提出了解決辦法,並引入了 Q-HNW 方法進行硬負樣本訓練,以提升模型的穩定性。 研究結果顯示,結合精細的音素感知自監督特徵與增強的硬負樣本訓練,可以在 Librispeech 資料集上實現高達 18% 的相對詞錯誤率(WER)降低和 35% 的罕見詞錯誤率(C-WER)相對改善。此外,在 AISHELL-1 基準資料集上的實驗進一步證明了我們所提出方法的有效性,展示了顯著的效能提升。
本論文的主要貢獻包括以下幾點:
1) 對先進 E2E ASR 模型的辨識錯誤進行了詳細分析,找出了訓練和測試環境中詞彙分佈不匹配的關鍵因素。
2) 突出了阻礙 ASR 模型通用化的兩大主要因素:先驗知識不足和捕捉上下文資訊的能力不足。
3) 提出了 XPhoneAdapter 模型,該模型引入了新型自監督音素編碼器 XPhoneBERT,以提供更豐富的音素感知特徵。
4) 針對上下文/非上下文不平衡和長尾分佈問題,提出了上下文平衡適應方法,以改善低頻上下文詞彙的模型表現。
5) 引入了 Q-HNW 方法進行負樣本訓練,以增強模型在挑戰性辨識場景中的穩定性。
In the era of smart home devices and phone-based smart assistants, voice interaction technology has become increasingly prevalent. The advancements in end-to-end (E2E) neural models have significantly improved the performance of automatic speech recognition (ASR) models, surpassing conventional hybrid models on various benchmark tasks. Despite these advancements, E2E ASR models face challenges in accurately recognizing domain-specific phrases, particularly named entities like contact names and geo-locations, which are critical for downstream tasks such as natural language understanding. This study aims to address the performance decline of ASR models in real-world scenarios by enhancing contextualized ASR (CASR) models. Our research investigates the limitations of current CASR models, identifying key issues such as insufficient prior knowledge and inadequate model capability to capture contextual information. We propose the XPhoneAdapter, which integrates a novel self-supervised phoneme encoder, XPhoneBERT, to provide richer phoneme-aware representations. Additionally, we address the context/no-context imbalance and long-tailed distribution problems through a context-balanced adaptation approach and introduce the Q-HNW method for hard negative training to enhance model robustness. Our findings demonstrate that the synergy of fine-grained phoneme-aware SSL features and enhanced hard negative training can achieve up to an 18% relative word error rate (WER) reduction and a 35% relative improvement in context word error rate (C-WER) on the Librispeech dataset. Experiments on the AISHELL-1 benchmark dataset further validate the effectiveness of our proposed methods, showing significant performance improvements. This thesis contributes to the field of ASR by providing an in-depth analysis of recognition errors, proposing novel methods to enhance CASR models, and demonstrating significant performance gains, thereby paving the way for more reliable and accurate speech recognition in real-world applications.
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, "Robust speech recognition via large-scale weak supervision," in International Conference on Machine Learning, pp. 28492-28518, PMLR, 2023.
T. N. Sainath, R. Prabhavalkar, S. Kumar, S. Lee, A. Kannan, D. Rybach, V. Schogol, P. Nguyen, B. Li, Y. Wu, et al., "No need for a lexicon? Evaluating the value of the pronunciation lexica in end-to-end models," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5859-5863, IEEE, 2018.
Y.-H. Chang and Y.-N. Chen, "Contrastive learning for improving ASR robustness in spoken language understanding," arXiv preprint arXiv:2205.00693, 2022.
I. Williams, A. Kannan, P. S. Aleksic, D. Rybach, and T. N. Sainath, "Contextual speech recognition in end-to-end neural network systems using beam search," in Interspeech, pp. 2227-2231, 2018.
K. M. Sathyendra, T. Muniyappa, F.-J. Chang, J. Liu, J. Su, G. P. Strimel, A. Mouchtaris, and S. Kunzmann, "Contextual adapters for personalized speech recognition in neural transducers," in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8537-8541, IEEE, 2022.
R. Pandey, R. Ren, Q. Luo, J. Liu, A. Rastrow, A. Gandhe, D. Filimonov, G. Strimel, A. Stolcke, and I. Bulyko, "Procter: Pronunciation-aware contextual adapter for personalized speech recognition in neural transducers," in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5, IEEE, 2023.
F.-J. Chang, T. Muniyappa, K. M. Sathyendra, K. Wei, G. P. Strimel, and R. McGowan, "Dialog act guided contextual adapter for personalized speech recognition," in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5, IEEE, 2023.
A. Alexandridis, K. M. Sathyendra, G. P. Strimel, F.-J. Chang, A. Rastrow, N. Susanj, and A. Mouchtaris, "Gated contextual adapters for selective contextual biasing in neural transducers," in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5, IEEE, 2023.
X. Fu, K. M. Sathyendra, A. Gandhe, J. Liu, G. P. Strimel, R. McGowan, and A. Mouchtaris, "Robust acoustic and semantic contextual biasing in neural transducers for speech recognition," in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5, IEEE, 2023.
A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al., "Conformer: Convolution-augmented transformer for speech recognition," arXiv preprint arXiv:2005.08100, 2020.
A. Graves, "Sequence transduction with recurrent neural networks," arXiv preprint arXiv:1211.3711, 2012.
L. T. Nguyen, T. Pham, and D. Q. Nguyen, "Xphonebert: A pre-trained multilingual model for phoneme representations for text-to-speech," arXiv preprint arXiv:2305.19709, 2023.
Y. Zhang, B. Kang, B. Hooi, S. Yan, and J. Feng, "Deep long-tailed learning: A survey," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10795-10816, 2023.
U. Alon, G. Pundak, and T. N. Sainath, "Contextual speech recognition with difficult negative training examples," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6440-6444, IEEE, 2019.
M. Bleeker, P. Swietojanski, S. Braun, and X. Zhuang, "Approximate nearest neighbour phrase mining for contextual speech recognition," arXiv preprint arXiv:2304.08862, 2023.
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206-5210, IEEE, 2015.
H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, "Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline," in 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pp. 1-5, IEEE, 2017.
Y. Peng, J. Tian, W. Chen, S. Arora, B. Yan, Y. Sudo, M. Shakeel, K. Choi, J. Shi, X. Chang, et al., "Owsm v3. 1: Better and faster open whisper-style speech models based on e-branchformer," arXiv preprint arXiv:2401.16658, 2024.
P. S. Aleksic, M. Ghodsi, A. H. Michaely, C. Allauzen, K. B. Hall, B. Roark, D. Rybach, and P. J. Moreno, "Bringing contextual information to Google speech recognition," in Interspeech, pp. 468-472, 2015.
K. B. Hall, E. Cho, C. Allauzen, F. Beaufays, N. Coccaro, K. Nakajima, M. Riley, B. Roark, D. Rybach, and L. Zhang, "Composition-based on-the-fly rescoring for salient n-gram biasing," in Interspeech, pp. 1418-1422, 2015.
G. Pundak, T. N. Sainath, R. Prabhavalkar, A. Kannan, and D. Zhao, "Deep context: End-to-end contextual speech recognition," in 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 418-425, IEEE, 2018.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, "Attention is all you need," Advances in Neural Information Processing Systems, vol. 30, 2017.
W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, "Listen, attend and spell," arXiv preprint arXiv:1508.01211, 2015.
A. Bruguier, G. Pundak, R. Prabhavalkar, and T. Sainath, "Phoebe: Pronunciation-aware contextualization for end-to-end speech recognition," 2019.
M. Jain, G. Keren, J. Mahadeokar, G. Zweig, F. Metze, and Y. Saraf, "Contextual RNN-T for open domain ASR," arXiv preprint arXiv:2006.03411, 2020.
A. Tripathi, J. Kim, Q. Zhang, H. Lu, and H. Sak, "Transformer transducer: One model unifying streaming and non-streaming speech recognition," arXiv preprint arXiv:2010.03192, 2020.
G. Sun, C. Zhang, and P. C. Woodland, "Tree-constrained pointer generator for end-to-end contextual speech recognition," in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 780-787, IEEE, 2021.
H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C. A. Raffel, "Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning," Advances in Neural Information Processing Systems, vol. 35, pp. 1950-1965, 2022.
L. Xu, H. Xie, S.-Z. J. Qin, X. Tao, and F. L. Wang, "Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment," 2023.
A. Alexandridis, K. M. Sathyendra, G. Strimel, F.-J. C. Chang, A. Rastrow, N. Susanj, and T. Mouchtaris, "Gated contextual adapters for selective contextual biasing in neural transducers," in ICASSP 2023, 2023.
P. Peng, B. Yan, S. Watanabe, and D. Harwath, "Prompting the hidden talent of web-scale speech models for zero-shot task generalization," 2023.
X. Yang, W. Kang, Z. Yao, Y. Yang, L. Guo, F. Kuang, L. Lin, and D. Povey, "Promptasr for contextualized ASR with controllable style," 2024.
Y. Li, M. Zhang, C. Su, Y. Li, X. Qiao, M. Ren, M. Ma, D. Wei, S. Tao, and H. Yang, "A multitask training approach to enhance whisper with contextual biasing and open-vocabulary keyword spotting," 2024.
A. Shamsian, A. Navon, N. Glazer, G. Hetz, and J. Keshet, "Keyword-guided adaptation of automatic speech recognition," 2024.
B. Lester, R. Al-Rfou, and N. Constant, "The power of scale for parameter-efficient prompt tuning," 2021.
M. Zhao and H. Schütze, "Discrete and soft prompting for multilingual models," in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, eds.), (Online and Punta Cana, Dominican Republic), pp. 8547-8555, Association for Computational Linguistics, Nov. 2021.
J. Tang, K. Kim, S. Shon, F. Wu, P. Sridhar, and S. Watanabe, "Improving ASR contextual biasing with guided attention," 2024.
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, "Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks," in Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, (New York, NY, USA), pp. 369-376, Association for Computing Machinery, 2006.
Z. Wu, B. Li, Y. Zhang, P. S. Aleksic, and T. N. Sainath, "Multistate encoding with end-to-end speech RNN transducer network," in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7819-7823, 2020.
D. Le, M. Jain, G. Keren, S. Kim, Y. Shi, J. Mahadeokar, J. Chan, Y. Shangguan, C. Fuegen, O. Kalinli, Y. Saraf, and M. L. Seltzer, "Contextualized streaming end-to-end speech recognition with trie-based deep biasing and shallow fusion," in 22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021 (H. Hermansky, H. Cernocký, L. Burget, L. Lamel, O. Scharenborg, and P. Motlícek, eds.), pp. 1772-1776, ISCA, 2021.
F. Chang, J. Liu, M. Radfar, A. Mouchtaris, M. Omologo, A. Rastrow, and S. Kunzmann, "Context-aware transformer transducer for speech recognition," in IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021, pp. 503-510, IEEE, 2021.
G. Sun, C. Zhang, and P. C. Woodland, "Tree-constrained pointer generator with graph neural network encodings for contextual speech recognition," 2022.
M. Han, L. Dong, Z. Liang, M. Cai, S. Zhou, Z. Ma, and B. Xu, "Improving end-to-end contextual speech recognition with fine-grained contextual knowledge selection," in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pp. 8532-8536, IEEE, 2022.
X. Fu, K. M. Sathyendra, A. Gandhe, J. Liu, G. P. Strimel, R. McGowan, and A. Mouchtaris, "Robust acoustic and semantic contextual biasing in neural transducers for speech recognition," in IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pp. 1-5, IEEE, 2023.
Y. Xu, B. Liu, Q. Huang, X. Song, Z. Wu, S. Kang, and H. Meng, "CB-Conformer: Contextual biasing conformer for biased word recognition," in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1-5, 2023.
S. Tong, P. Harding, and S. Wiesler, "Hierarchical attention-based contextual biasing for personalized speech recognition using neural transducers," in ASRU 2023, 2023.
F.-T. Liao, Y.-C. Chan, Y.-C. Chen, C.-J. Hsu, and D.-s. Shiu, "Zero-shot domain-sensitive speech recognition with prompt-conditioning fine-tuning," in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1-8, IEEE, 2023.
M. Han, L. Dong, Z. Liang, M. Cai, S. Zhou, Z. Ma, and B. Xu, "Improving end-to-end contextual speech recognition with fine-grained contextual knowledge selection," in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8532-8536, 2022.
Z. Yang, S. Sun, X. Wang, Y. Zhang, L. Ma, and L. Xie, "Two stage contextual word filtering for context bias in unified streaming and non-streaming transducer," in 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ireland, August 20-24, 2023 (N. Harte, J. Carson-Berndsen, and G. Jones, eds.), pp. 3257-3261, ISCA, 2023.
D. Zhao, T. N. Sainath, D. Rybach, P. Rondon, D. Bhatia, B. Li, and R. Pang, "Shallow-Fusion End-to-End Contextual Biasing," in Proc. Interspeech 2019, pp. 1418-1422, 2019.
D. Le, G. Keren, J. Chan, J. Mahadeokar, C. Fuegen, and M. L. Seltzer, "Deep shallow fusion for RNN-T personalization," in 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 251-257, 2021.
B. Zhang, D. Wu, Z. Peng, X. Song, Z. Yao, H. Lv, L. Xie, C. Yang, F. Pan, and J. Niu, "Wenet 2.0: More productive end-to-end speech recognition toolkit," in 23rd Annual Conference of the International Speech Communication Association, Interspeech 2022, Incheon, Korea, September 18-22, 2022 (H. Ko and J. H. L. Hansen, eds.), pp.1661-1665, ISCA, 2022.
G. Sun, X. Zheng, C. Zhang, and P. C. Woodland, "Can contextual biasing remain effective with Whisper and GPT-2?," in 24th Annual Conference of the International Speech Communication Association, Interspeech 2023, Dublin, Ireland, August 20-24, 2023 (N. Harte, J. Carson-Berndsen, and G. Jones, eds.), pp. 1289-1293, ISCA, 2023.
K. Kim, F. Wu, Y. Peng, J. Pan, P. Sridhar, K. J. Han, and S. Watanabe, "E-Branchformer: Branchformer with enhanced merging for speech recognition," in 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 84-91, IEEE, 2023.
A. Gourav, L. Liu, A. Gandhe, Y. Gu, G. Lan, X. Huang, S. Kalmane, G. Tiwari, D. Filimonov, A. Rastrow, A. Stolcke, and I. Bulyko, "Personalization strategies for end-to-end speech recognition systems," in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, June 6-11, 2021, pp. 7348-7352, IEEE, 2021.
S. Wang, C.-H. Yang, J. Wu, and C. Zhang, "Can Whisper perform speech-based in-context learning?," in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 13421-13425, IEEE, 2024.
M. Shakeel, Y. Sudo, Y. Peng, and S. Watanabe, "Contextualized end-to-end automatic speech recognition with intermediate biasing loss," arXiv preprint arXiv:2406.16120, 2024.
F. Yu, H. Wang, X. Shi, and S. Zhang, "LCB-Net: Long-context biasing for audio-visual speech recognition," in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10621-10625, IEEE, 2024.
K. Cho, A. Courville, and Y. Bengio, "Describing multimedia content using attention-based encoder-decoder networks," IEEE Transactions on Multimedia, vol. 17, no. 11, pp. 1875-1886, 2015.
S. Jean, K. Cho, R. Memisevic, and Y. Bengio, "On using very large target vocabulary for neural machine translation," arXiv preprint arXiv:1412.2007, 2014.
T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013.
Z. Zeng, Y. Khassanov, V. T. Pham, H. Xu, E. S. Chng, and H. Li, "On the end-to-end solution to Mandarin-English code-switching speech recognition," arXiv preprint arXiv:1811.00241, 2018.
A. Jaiswal, A. R. Babu, M. Z. Zadeh, D. Banerjee, and F. Makedon, "A survey on contrastive self-supervised learning," Technologies, vol. 9, no. 1, p. 2, 2020.
C. Si, W. Yu, P. Zhou, Y. Zhou, X. Wang, and S. Yan, "Inception transformer," Advances in Neural Information Processing Systems, vol. 35, pp. 23495-23509, 2022.
K. Kim, F. Wu, Y. Peng, J. Pan, P. Sridhar, K. J. Han, and S. Watanabe, "E-Branchformer: Branchformer with enhanced merging for speech recognition," in 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 84-91, IEEE, 2023.
V. Garcia, E. Debreuve, and M. Barlaud, "Fast k nearest neighbor search using GPU," in 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pp. 1-6, IEEE, 2008.
H. Xuan, A. Stylianou, X. Liu, and R. Pless, "Hard negative examples are hard, but useful," in Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XIV 16, pp. 126-142, Springer, 2020.
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, "Locality-sensitive hashing scheme based on p-stable distributions," in Proceedings of the twentieth annual symposium on Computational geometry, pp. 253-262, 2004.
A. Vyas, A. Katharopoulos, and F. Fleuret, "Fast transformers with clustered attention," Advances in Neural Information Processing Systems, vol. 33, pp. 21665-21674, 2020.
Y. Cui, M. Jia, T.-Y. Lin, Y. Song, and S. Belongie, "Class-balanced loss based on effective number of samples," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268-9277, 2019.
S. Hochreiter and J. Schmidhuber, "Long short-term memory," Neural Comput., vol. 9, pp. 1735-1780, Nov 1997.
J. Johnson, M. Douze, and H. Jégou, "Billion-scale similarity search with GPUs," IEEE Transactions on Big Data, vol. 7, no. 3, pp. 535-547, 2019.