簡易檢索 / 詳目顯示

研究生: 林韋廷
Lin, Wei-Ting
論文名稱: 探究新穎深度學習方法於中英文混語語音辨識之使用
Several Novel Deep Learning Approaches to Mandarin-English Code-switching Automatic Speech Recognition
指導教授: 陳柏琳
Chen, Berlin
口試委員: 王新民
Wang, Hsin-Min
洪志偉
Hung, Jeih-Weih
王家慶
Wang, Jia-Ching
陳柏琳
Chen, Berlin
口試日期: 2021/08/30
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 56
中文關鍵詞: 語碼轉換中英文混語語音辨識三元組損失Conformer語言模型
英文關鍵詞: code-switching, Mandarin-English automatic speech recognition, triplet loss, Conformer, language model
研究方法: 實驗設計法紮根理論法比較研究內容分析法
DOI URL: http://doi.org/10.6345/NTNU202101427
論文種類: 學術論文
相關次數: 點閱:230下載:18
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在多語言社會中易有一段對話中包含了多種語言的情形發生,不僅是多語言社會,甚至是單語言社會也受全球化的影響,對話中常參雜一些其他語言,這種現象稱為語碼轉換(Code-Switching, CS)。在CS自動語音辨識(Automatic Speech Recognition, ASR)中,需要同時辨識出兩種或更多種的語言,但又與多語言語音辨識不同,語者除了在句子間轉換語言外,更常在句子內進行轉換,所以也在最近被視為一個難題而被關注。本論文的研究分為兩個方面,分別為端對端和DNN-HMM混語語音辨識之改進方法,前者著重於增強中英文混語語料庫SEAME。我們採用了前陣子提出的模型Conformer,並設計語言遮罩(language-masked) multi-head attention架構應用到解碼器(decoder)端,希望讓各自的語言能學到其獨立的語言特性並強化其單語言的辨識能力。另外,為了防止模型學出的中文和英文特徵向量相近,而將三元組損失(Triplet loss)用於訓練模型。後者我們提出多種不同階段的語言模型合併策略以用於企業應用領域的多種語料。在本篇論文的實驗設定中,會有兩種中英文CS語言模型和一種中文的單語言模型,其中CS語言模型使用的訓練資料與測試集同一領域(Domain),而單語言模型是用大量一般中文語料訓練而成。我們透過多種不同階段的語言模型合併策略以探究ASR是否能結合不同的語言模型其各自的優勢以在不同任務上都有好的表現。在本篇論文中有三種語言模型合併策略,分別為N-gram語言模型合併、解碼圖 (Decoding Graph) 合併和詞圖 (Word Lattice) 合併。經由一系列的實驗結果證實,透過語言模型的合併的確能讓CS ASR對不同的測試集都有好的表現。而端到端混語語音辨識之方法於測試集上的字符錯誤率(Token Error Rate, TER)並沒有顯著的進步,但透過其他數據分析發現我們的研究方法仍有些微效果。

    In multilingual societies, it is common to have a conversation that contains multiple languages. Meanwhile, on account of globalization, the use of switching distinct languages is also pervasive within daily dialogues in monolingual societies. This phenomenon is called code-switching (CS). In CS automatic speech recognition (ASR), two or more languages need to be recognized at the same time. However, unlike multilingual speech recognition, speakers not merely switch their language between sentences but more often within sentences, and thus it is regarded as a problem and has been a concern recently. The research in this paper is divided into two aspects, namely end-to-end and DNN-HMM CS ASR improvement, with the former focusing on Mandarin-English CS corpus SEAME. We use the model Conformer proposed recently, construct the language-masked multi-head attention architecture and apply it on the decoder, aiming to allow each language to learn their individual language attributes and enhance its monolingual recognition ability. Moreover, in order to prevent the Mandarin and English embeddings learned by the model from being similar to each other, triplet loss is used to train the model. As regards the latter method, we put forward disparate strategies, which conduct the combination of various language models at different stages on CS speech corpora compiled from different industrial application scenarios. Our experimental configuration consists of two CS (i.e., mixing of Mandarin Chinese and English) language models and one monolingual (i.e., Mandarin Chinese) language model, where the two CS language models are domain-specific and the monolingual language model is trained on a general text collection. Through the language model combination at different stages of the ASR process, we purport to know if the ASR system could integrate the strengths of various language models to achieve improved performance across different tasks. More specifically, three strategies for combining language models are investigated, namely simple N-gram language model combination, decoding graph combination, and word lattice combination. A series of ASR experiments have confirmed the utility of the aforementioned LM combination strategies, but the end-to-end CS ASR method has no significant Token Error Rate (TER) reduction on the test sets. However, it is found that our method still has some minor effects through other data analysis approaches.

    第一章 緒論 1 第一節 研究背景 1 1.1 語言識別 (LID) 2 1.2 音素集或詞典(字典)的設計 2 1.3 單語言資料 3 1.4 多任務學習和模型架構改進 5 第二節 研究貢獻 7 第三節 論文章節安排 7 第二章 文獻探討 9 第一節 語言識別 9 第二節 限制輸出向量 10 第三章 基礎端對端語音辨識模型架構 13 第一節 Connectionist temporal classification (CTC) 13 第二節 Attention 機制 14 第三節 Transformer 15 3.1 Encoder 15 3.2 Scaled Dot-Product Attention 16 3.3 Multi-Head Attention 16 3.4 Position-wise Feed-Forward Network 17 3.5 Decoder 17 3.6 模型架構 18 第四節 CTC-Attention 混合模型 19 第五節 CTC-Attention 混合模型架構 19 第六節 Conformer 20 6.1 Multi-Head Attention 21 6.2 卷積(convolution)模組 21 6.3 Position-wise Feed forward Network 22 6.4 模型架構 22 第四章 端對端混語語音辨識之改進方法 24 第一節 Language-masked decoder 24 第二節 Triplet loss 26 第五章 DNN-HMM混語語音辨識之改進方法 28 第一節 語言模型合併 28 1.1 N-gram 語言模型合併 28 1.2 Graph合併(WFST 合併) 29 1.3 Lattice合併 29 第六章 端對端混語語音辨識實驗設定與結果 31 第一節 資料集 31 第二節 實驗設定 35 第三節 實驗結果 35 3.1 實驗結果分析 35 3.2 模型辨識結果比較 41 第七章 DNN-HMM混語語音辨識實驗設定與結果 45 第一節 資料集 45 第二節 實驗設定 46 第三節 實驗結果 46 3.1 實驗結果分析 46 3.2 模型辨識結果比較 48 第八章 結論與未來展望 50 參考文獻 51

    [1] Dau-cheng Lyu, Ren-yuan Lyu, Yuang-chin Chiang, and Chun-nan Hsu, “Speech recognition on code-switching among the Chinese dialects,” in Proc. ICASSP, 2006.
    [2] Shinji Watanabe, Takaaki Hori, and John R. Hershey, “Language independent end-to-end architecture for joint language identification and speech recognition,” in Proc. ASRU, 2017.
    [3] Hiroshi Seki, Shinji Watanabe, Takaaki Hori, Jonathan Le Roux, and John R. Hershey, “An end-to-end language-tracking speech recognizer for mixed-language speech,” in Proc. ICASSP, 2018.
    [4] Ne Luo, Dongwei Jiang, Shuaijiang Zhao, Caixia Gong, Wei Zou, and Xiangang Li, “Towards end-to-end code-switching speech recognition,” in Proc. ICASSP, 2019.
    [5] Ke Li , Jinyu Li , Guoli Ye , Rui Zhao , and Yifan Gong, “Towards code-switching asr for end-to-end ctc models,” in Proc. ICASSP, 2019.
    [6] Changhao Shan, Chao Weng, Guangsen Wang, Dan Su, Min Luo, Dong Yu, and Lei Xie, “Investigating end-to-end speech recognition for mandarin-english code-switching,” in Proc. ICASSP, 2019.
    [7] Zhiping Zeng, Yerbolat Khassanov, Van Tung Pham, Haihua Xu, Eng Siong Chng, and Haizhou Li, “On the end-to-end solution to Mandarin-English code-switching speech recognition,” in Proc. INTERSPEECH, 2019.
    [8] Metilda Sagaya Mary N J, Vishwas M. Shetty, and S. Umesh, “Investigation of methods to improve the recognition performance of Tamil-English code-switched data in Transformer framework,” in Proc. ICASSP, 2020.
    [9] Yizhou Lu, Mingkun Huang, Hao Li, Jiaqi Guo, and Yanmin Qian, “Bi-encoder Transformer network for Mandarin-English code-switching speech recognition using mixture of experts,” in Proc. INTERSPEECH, 2020.
    [10] Ngoc Thang Vu, Dau-Cheng Lyu, Jochen Weiner, Dominic Telaar, Tim Schlippe, Fabian Blaicher, Eng-Siong Chng, Tanja Schultz, and Haizhou Li, “A first speech recognition system for Mandarin-English code-switch conversational speech,” in Proc. ICASSP, 2012.
    [11] Yerbolat Khassanov, Haihua Xu, Van Tung Pham, Zhiping Zeng, Eng Siong Chng, Chongjia Ni, and Bin Ma, “Constrained output embeddings for end-to-end code-switching speech recognition with only monolingual data,” in Proc. INTERSPEECH, 2019.
    [12] Shun-Po Chuang, Tzu-Wei Sung, and Hung-Yi Lee, “Training a code-switching language model with monolingual data,” in Proc. ICASSP, 2020.
    [13] Emre Yılmaz, Samuel Cohen, Xianghu Yue, David van Leeuwen, and Haizhou Li, “Multi-graph decoding for code-switching ASR,” in Proc. INTERSPEECH, 2019.
    [14] Xianghu Yue, Grandee Lee, Emre Yılmaz, Fang Deng, and Haizhou Li, “End-to-end code-switching ASR for low-resourced language pairs,” in Proc. ASRU, 2019.
    [15] Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur, “Audio augmentation for speech recognition,” in Proc. INTERSPEECH, 2015.
    [16] Duo Ma, Guanyu Li, Haihua Xu, and Eng Siong Chng, “Improving code-switching speech recognition with data augmentation and system combination,” in Proc. APSIPA, 2019.
    [17] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, and Quoc V. Le, “SpecAugment: a simple data augmentation method for automatic speech recognition,” in Proc. INTERSPEECH, 2019.
    [18] Emre Yılmaz, Henk van den Heuvel, and David A. van Leeuwen, “Acoustic and textual data augmentation for improved ASR of code-switching speech,” in Proc. INTERSPEECH, 2018.
    [19] Emre Yılmaz, Henk van den Heuvel, and David A. van Leeuwen, “Code-switching detection with data-augmented acoustic and language models,” in Proc. SLTU, 2018.
    [20] Sahoko Nakayama, Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura, “Zero-shot code-switching ASR and TTS with multilingual machine speech chain,” in Proc. ASRU, 2019.
    [21] Yuewen Cao, Xixin Wu, Songxiang Liu, Jianwei Yu, Xu Li , Zhiyong Wu, Xunying Liu, and Helen Meng, “End-to-end code-switched TTS with mix of monolingual recordings,” in Proc. ICASSP, 2019.
    [22] Xuehao Zhou, Xiaohai Tian, Grandee Lee, Rohan Kumar Das, and Haizhou Li, “End-to-end code-switching TTS with cross-lingual language model,” in Proc. ICASSP, 2020.
    [23] Chenpeng Du, Hao Li, Yizhou Lu, Lan Wang, and Yanmin Qian, “Data augmentation for end-to-end code-switching speech recognition,” in Proc. SLT, 2021.
    [24] Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung, “Learn to code-switch: data augmentation using copy mechanism on language modeling” in Proc. ICASSP, 2019.
    [25] Ching-Ting Chang, Shun-Po Chuang, and Hung-Yi Lee, “Code-switching sentence generation by generative adversarial networks and its application to data augmentation,” in Proc. INTERSPEECH, 2019.
    [26] Yingying Gao, Junlan Feng, Ying Liu, Leijing Hou, Xin Pan, and Yong Ma, “Code-switching sentence generation by BERT and generative adversarial networks,” in Proc. INTERSPEECH, 2019.
    [27] Libo Qin, Minheng Ni, Yue Zhang, and Wanxiang Che, “CoSDA-ML: Multi-lingual code-switching data augmentation for zero-shot cross-lingual NLP,” in Proc. IJCAI, 2020.
    [28] Min Ma, Bhuvana Ramabhadran, Jesse Emond, Andrew Rosenberg, and Fadi Biadsy, “Comparison of data augmentation and adaptation strategies for code-switched automatic speech recognition,” in Proc. ICASSP, 2019.
    [29] Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung, “Towards end-to-end automatic code-switching speech recognition,” in Proc. ICASSP, 2019.
    [30] Gustavo Aguilar and Thamar Solorio, “From English to code-switching: transfer learning with strong morphological clues,” in Proc. ACL, 2020.
    [31] Genta Indra Winata, Samuel Cahyawijaya, Zhaojiang Lin, Zihan Liu, Peng Xu, and Pascale Fung, “Meta-transfer learning for code-switched speech recognition,” in Proc. ACL, 2020.
    [32] Sanket Shah, Basil Abraham, Gurunath Reddy M, Sunayana Sitaram, and Vikas Joshi, “Learning to recognize code-switched speech without forgetting monolingual speech recognition,” arXiv:2006.00782, 2020.
    [33] Gurunath Reddy Madhumani, Sanket Shah, Basil Abraham, Vikas Joshi, and Sunayana Sitaram, “Learning not to discriminate: task agnostic learning for improving monolingual and code-switched speech recognition,” arXiv:2006.05257, 2020.
    [34] Haobo Zhang, Haihua Xu, Van Tung Pham, Hao Huang, and Eng Siong Chng, “Monolingual data selection analysis for English-Mandarin hybrid code-switching speech recognition,” in Proc. INTERSPEECH, 2020.
    [35] Siddharth Dalmia, Ramon Sanabria, Florian Metze, and Alan W. Black, “Sequence-based multi-lingual low resource speech recognition,” in Proc. ICASSP, 2018.
    [36] Jui-Yang Hsu, Yuan-Jui Chen, and Hung-yi Lee, “Meta learning for end-to-end low-resource speech recognition,” in Proc. ICASSP, 2020.
    [37] Xinyuan Zhou, Emre Yılmaz, Yanhua Long, Yijie Li, and Haizhou Li, “Multi-encoder-decoder Transformer for code-switching speech recognition,” in Proc. INTERSPEECH, 2020.
    [38] Siddharth Dalmia, Yuzong Liu, Srikanth Ronanki, and Katrin Kirchhoff, “Transformer-Transducers for code-switched speech recognition,” in Proc. ICASSP, 2021.
    [39] Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai1, Jianhua Tao, and Zhengqi wen, “Decoupling pronunciation and language for end-to-end code-switching automatic speech recognition,” in Proc. ICASSP, 2021.
    [40] Alex Graves, Santiago Fernandez, Faustino Gomez, and Ju ̈rgen Schmidhuber, “Connectionist temporal classification: labeling unsegmented sequenece data with recurrent neural networks,” in Proc. ICML, 2006.
    [41] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Proc. NIPS, 2017.
    [42] Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, “ESPnet: end-to-end speech processing toolkit,” in Proc. INTERSPEECH, 2018.
    [43] Shinji Watanabee, Takaaki Hori, Suyoun Kim, John R. Hershey, Tomoki Hayashi, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, “Hybrid CTC/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing 11, 2017.
    [44] Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang, “Conformer: convolution-augmented Transformer for speech recognition,” in Proc. INTERSPEECH, 2020.
    [45] Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V. Le, and Ruslan Salakhutdinov, “Transformer-XL: attentive language models beyond a fixed-length context,” in Proc. ACL, 2019.
    [46] Zhanghao Wu, Zhijian Liu, Ji Lin, Yujun Lin, and Song Han, “Lite Transformer with long-short range attention,” in Proc. ICLR, 2020.
    [47] Prajit Ramachandran, Barret Zoph, and Quoc V. Le, “Searching for activation functions,” arXiv:1710.05941, 2017.
    [48] Yiping Lu, Zhuohan Li, Di He, Zhiqing Sun, Bin Dong, Tao Qin, Liwei Wang, and Tie-Yan Liu, “Understanding and improving Transformer from a multi-particle dynamic system point of view,” arXiv:1906.02762, 2019.
    [49] Ryo Masumura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, and Shota Orihashi, “Hierarchical Transformer-based large-context end-to-end ASR with large-context knowledge distillation,” in Proc. ICASSP, 2021.
    [50] Florian Schroff, Dmitry Kalenichenko, and James Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in Proc. CVPR, 2015.
    [51] Daniel Povey, Mirko Hannemann, Gilles Boulianne, Lukas Burget, Arnab Ghoshal, Miloˇs Janda, Martin Karafiat´, Stefan Kombrink, Petr Motl´ıcek, Yanmin Qian, Korbinian Riedhammer, Karel Vesely´, and Ngoc Thang Vu, “Generating exact lattices in the WFST framework,” in Proc. ICASSP, 2012.
    [52] Haihua Xu , Daniel Povey , Lidia Mangu , and Jie Zhu, “An improved consensus-like method for minimum bayes risk decoding and lattice combination,” in Proc. ICASSP, 2010.
    [53] Haihua Xua , Daniel Povey , Lidia Manguc , and Jie Zhua, “Minimum bayes risk decoding and system combination based on a recursion for edit distance,” in Proc. CSL, 2011.
    [54] Tien-Hong Lo and Berlin Chen, “Leveraging discriminative training and model combination for semi-supervised speech recognition,” in IJCLCLP, 2018.
    [55] Dau-Cheng Lyu, Tien-Ping Tan, Eng-Siong Chng, and Haizhou Li, “SEAME: a mandarin-english code-switching speech corpus in south-east asia,” in Proc. INTERSPEECH, 2010.
    [56] Grandee Lee, Thi-Nga Ho, Eng-Siong Chng , and Haizhou Li, “A review of the Mandarin-English code-switching corpus: SEAME,” in Proc. IALP, 2017.

    下載圖示
    QR CODE