研究生: |
王泓壬 Wang, Hung-Ren |
---|---|
論文名稱: |
會議語音辨識之上下文語言模型 Reranking 研究 Contextualize Language Model Reranking for Meeting Speech Recognition |
指導教授: |
陳柏琳
Chen, Berlin |
口試委員: |
陳冠宇
Chen, Guan-Yu 陳柏琳 Chen, Berlin 曾厚強 Tseng, Hou-Chiang 洪志偉 Hung, Chih-Wei |
口試日期: | 2023/07/21 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 中文 |
論文頁數: | 44 |
中文關鍵詞: | 自動語音辨識 、語言模型 、對話語音 、N-Best 列表 、列表資訊 、重新排序 、跨句資訊 、大型生成式語言模型 、ChatGPT |
英文關鍵詞: | Automatic Speech Recognition, Language Modeling, Conversational Speech, N-Best Lists, List Information, Large Generative Language Models, ChatGPT |
研究方法: | 實驗設計法 |
DOI URL: | http://doi.org/10.6345/NTNU202301357 |
論文種類: | 學術論文 |
相關次數: | 點閱:112 下載:6 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
ASR N-Best Reranking是自動語音識別(ASR)系統中用於提高轉錄輸出準確性的一種技術。在ASR系統中,系統為輸入音頻片段生成多個後選假設,稱為N-Best列表。而BERT (Bidirectional Encoder Representations from Transformers)是一種先進的語言模型,在文本分類、命名實體識別和問題解答等各種自然語言處理(NLP)任務中表現出卓越的性能。由於BERT能夠捕捉上下文信息並生成高品質的輸入文本表示,因此被用於ASR N-Best Reranking。為了更進一步增強BERT模型的預測,我們探索了增強語意信息與訓練目標,大致分為四部分: (1)將文本文法優劣信息融入到模型中的有效方法;(2)間接將整個N-Best列表信息融入到模型中的有效方法;(3)探討分類、排序及多任務訓練目標於模型訓練的可行性;(4)強化模型提取的文本信息。
大型生成式語言模型(LLMs)已經證明了其在各種語言相關任務中的卓越泛化能力。本研究我們評估利用LLMs如ChatGPT於ASR N-Best Reranking任務的可行性。
我們在AMI會議語料庫進行一系列的實驗,實驗結果顯示在降低單詞錯誤率(WER %),提出的方法有其有效性,與基本ASR系統比較最多可達到1.37%的絕對WER (%)下降。
ASR (Automatic Speech Recognition) N-Best reranking is a task that aims to improve the accuracy of ASR systems by re-ranking the output of the ASR system, known as N-Best lists. The N-Best reranking task involves selecting the most likely transcription from the N-Best list based on additional contextual information. BERT (Bidirectional Encoder Representations from Transformers) is a state-of-the-art language model that has shown remarkable performance in various natural language processing (NLP) tasks. BERT is being used in ASR N-Best reranking due to its ability to capture contextual information and generate high-quality representations of input text. We explore the enhancement of semantic information and training objectives, which are broadly divided into four parts: (1) effective methods to incorporate text grammatical strength and weakness information into the model; (2) effective methods to indirectly incorporate the whole N-Best list information into the model; (3) exploring the feasibility of categorization, sorting, and multitask training objectives for the model training; and (4) enhancement of textual information extracted by the model. Large-scale generative language models (LLMs) have demonstrated their excellent generalization ability in various language-related tasks. In this study we evaluate to utilize the excellent generalization ability of LLMs in ASR N-Best Reranking task.We conduct a series of experiments on AMI meeting corpus and the experimental results show the effectiveness of the proposed method in reducing the Word Error Rate (1.37 %).
R. Kneser and H. Ney, “Improved backing-off for M-gram language modeling,” in 1995 International Conference on Acoustics, Speech, and Signal Processing, Detroit, MI, USA: IEEE, 1995, pp. 181–184.
J. Goodman, “A Bit of Progress in Language Modeling.” arXiv, Aug. 09, 2001.
S. F. Chen and J. Goodman, “An empirical study of smoothing techniques for language modeling,” Comput. Speech Lang., vol. 13, no. 4, pp. 359–393, Oct. 1999,
T. Mikolov, S. Kombrink, L. Burget, J. Cernocky, and S. Khudanpur, “Extensions of recurrent neural network language model,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic: IEEE, May 2011, pp. 5528–5531.
T. Mikolov, M. Karafiát, L. Burget, J. Černocký, and S. Khudanpur, “Recurrent neural network based language model,” in Interspeech 2010, ISCA, Sep. 2010, pp. 1045–1048.
M. Sundermeyer, R. Schlüter, and H. Ney, “LSTM neural networks for language modeling,” in Interspeech 2012, ISCA, Sep. 2012, pp. 194–197.
S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997,
G. Sun, C. Zhang, and P. C. Woodland, “Transformer Language Models with LSTM-based Cross-utterance Information Representation.” arXiv, Feb. 12, 2021.
Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov, “Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context.” arXiv, Jun. 02, 2019.
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” arXiv, May 24, 2019.
C.-H. Kuo and K.-Y. Chen, “Correcting, Rescoring and Matching: An N-Best List Selection Framework for Speech Recognition,” in 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand: IEEE, Nov. 2022, pp. 729–734.
J. Shin, Y. Lee, and K. Jung, “Effective Sentence Scoring Method using Bidirectional Language Model for Speech Recognition.” arXiv, May 16, 2019.
J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff, “Masked Language Model Scoring,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2699–2712.
S.-H. Chiu and B. Chen, “Innovative Bert-based Reranking Language Models for Speech Recognition,” in 2021 IEEE Spoken Language Technology Workshop (SLT), Jan. 2021, pp. 266–271.
L. Xu, Y. Gu, J. Kolehmainen, H. Khan, A. Gandhe, A. Rastrow, A. Stolcke, and I. Bulyko, “RescoreBERT: Discriminative Speech Recognition Rescoring with BERT,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), May 2022, pp. 6117–6121.
T.-W. Wu, I.-F. Chen, and A. Gandhe, “Learning to rank with BERT-based confidence models in ASR rescoring,” in Interspeech 2022, ISCA, Sep. 2022, pp. 1651–1655.
B.-C. Yan, H.-W. Wang, S.-H. Chiu, H.-S. Chiu, and B. Chen, “Effective Cross-Utterance Language Modeling for Conversational Speech Recognition.” arXiv, May 31, 2022.
S. González-Carvajal and E. C. Garrido-Merchán, “Comparing BERT against traditional machine learning text classification,” J. Comput. Cogn. Eng., Apr. 2023,
I. Tenney, D. Das, and E. Pavlick, “BERT Rediscovers the Classical NLP Pipeline.” arXiv, Aug. 09, 2019.
B. Van Aken, B. Winter, A. Löser, and F. A. Gers, “How Does BERT Answer Questions?: A Layer-Wise Analysis of Transformer Representations,” in Proceedings of the 28th ACM International Conference on Information and Knowledge Management, Beijing China: ACM, Nov. 2019, pp. 1823–1832.
L. Wu, Z. Zheng, Z. Qiu, H. Wang, H. Gu, T. Shen, C. Qin, C. Zhu, H. Zhu, Q. Liu, H. Xiong, and E. Chen, “A Survey on Large Language Models for Recommendation.” arXiv, May 31, 2023.
Z. Jiang, A. El-Jaroudi, W. Hartmann, D. Karakos, and L. Zhao, “Cross-lingual Information Retrieval with BERT.” arXiv, Apr. 24, 2020.
Y. Peng, Q. Chen, and Z. Lu, “An Empirical Study of Multi-Task Learning on BERT for Biomedical Text Mining,” in Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, Online: Association for Computational Linguistics, 2020, pp. 205–214.
C. Macdonald, N. Tonellotto, and S. MacAvaney, “IR From Bag-of-words to BERT and Beyond through Practical Experiments,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual Event Queensland Australia: ACM, Oct. 2021, pp. 4861–4861.
F. W. Mutinda, S. Yada, S. Wakamiya, and E. Aramaki, “Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT,” Methods Inf. Med., vol. 60, no. S 01, pp. e56–e64, Jun. 2021,
J. Morris, E. Lifland, J. Y. Yoo, J. Grigsby, D. Jin, and Y. Qi, “TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online: Association for Computational Linguistics, 2020, pp. 119–126.
M. Nie, M. Yan, and C. Gong, “Prompt-based Re-ranking Language Model for ASR,” in Interspeech 2022, ISCA, Sep. 2022, pp. 3864–3868.
C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to Fine-Tune BERT for Text Classification?” arXiv, Feb. 05, 2020.
X. Ma, Z. Wang, P. Ng, R. Nallapati, and B. Xiang, “Universal Text Representation from BERT: An Empirical Study.” arXiv, Oct. 23, 2019.
W. Sun, L. Yan, X. Ma, P. Ren, D. Yin, and Z. Ren, “Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent.” arXiv, Apr. 19, 2023.
OpenAI, “Introducing chatgpt.,” 2022.
OpenAI, “GPT-4 Technical Report.” arXiv, Mar. 27, 2023.
Microsoft, “Confirmed: the new bing runs on openai’s gpt-4.,” 2023.
S.-H. Chiu, T.-H. Lo, F.-A. Chao, and B. Chen, “Cross-utterance Reranking Models with BERT and Graph Convolutional Networks for Conversational Speech Recognition.” arXiv, Oct. 01, 2021.
S.-H. Chiu, T.-H. Lo, and B. Chen, “Cross-sentence Neural Language Models for Conversational Speech Recognition,” in 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China: IEEE, Jul. 2021, pp. 1–7.
S. E. Brennan and H. H. Clark, “Conceptual pacts and lexical choice in conversation.,” J. Exp. Psychol. Learn. Mem. Cogn., vol. 22, no. 6, pp. 1482–1493, 1996,
E. A. Schegloff, “Sequencing In Conversational Openings,” in Selected Studies and Applications, J. A. Fishman, Ed., De Gruyter, 1972, pp. 91–125.
J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-Based Models for Speech Recognition.” arXiv, Jun. 24, 2015.
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei, “Language Models are Few-Shot Learners.” arXiv, Jul. 22, 2020.
B. Workshop et al., “BLOOM: A 176B-Parameter Open-Access Multilingual Language Model.” arXiv, Jun. 27, 2023.
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and Efficient Foundation Language Models.” arXiv, Feb. 27, 2023.
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.” arXiv, Jul. 28, 2020.
Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy and Liang, and Tatsunori B. Hashimoto, “Stanford alpaca: An instruction-following llama model.” 2023.
Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-Instruct: Aligning Language Models with Self-Generated Instructions.” arXiv, May 25, 2023.
“Alpaca-lora,” 2023,
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv, Oct. 16, 2021.
R. Anil et al., “PaLM 2 Technical Report.” arXiv, May 17, 2023.
R. Ma, M. Qian, P. Manakul, M. Gales, and K. Knill, “Can Generative Large Language Models Perform ASR Error Correction?” arXiv, Jul. 09, 2023.
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Queensland, Australia: IEEE, Apr. 2015, pp. 5206–5210.
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.” arXiv, Aug. 27, 2019.
Z. Cao, T. Qin, T.-Y. Liu, M.-F. Tsai, and H. Li, “Learning to rank: from pairwise approach to listwise approach,” in Proceedings of the 24th international conference on Machine learning, Corvalis Oregon USA: ACM, Jun. 2007, pp. 129–136.
X. Wang, C. Li, N. Golbandi, M. Bendersky, and M. Najork, “The LambdaLoss Framework for Ranking Metric Optimization,” in Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Torino Italy: ACM, Oct. 2018, pp. 1313–1322.
J. Carletta, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos, W. Kraaij, M. Kronenthal, G. Lathoud, M. Lincoln, A. Lisowska, I. McCowan, W. Post, D. Reidsma, and P. Wellner, “The AMI Meeting Corpus: A Pre-announcement,” in Machine Learning for Multimodal Interaction, S. Renals and S. Bengio, Eds., in Lecture Notes in Computer Science, vol. 3869. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, pp. 28–39.
S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-End Speech Processing Toolkit.” arXiv, Mar. 30, 2018.
T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. Von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-Art Natural Language Processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online: Association for Computational Linguistics, 2020, pp. 38–45.
I. Loshchilov and F. Hutter, “Decoupled Weight Decay Regularization.” arXiv, Jan. 04, 2019.