簡易檢索 / 詳目顯示

研究生: 陳思澄
Chen, Ssu-Cheng
論文名稱: 使用詞向量表示與概念資訊於中文大詞彙連續語音辨識之語言模型調適
Exploring Word Embedding and Concept Information for Language Model Adaptation in Mandarin Large Vocabulary Continuous Speech Recognition
指導教授: 陳柏琳
Chen, Berlin
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2015
畢業學年度: 103
語文別: 中文
論文頁數: 77
中文關鍵詞: 語音辨識語言模型深度學習詞向量表示概念模型
英文關鍵詞: speech recognition, language modeling, deep learning, word representation, concept model
論文種類: 學術論文
相關次數: 點閱:110下載:14
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來深度學習(Deep Learning)激起一股研究熱潮;隨著深度學習的發展而有分散式表示法(Distributed Representation)的產生。此種表示方式不僅能以較低維度的向量表示詞彙,還能藉由向量間的運算,找出任兩詞彙之間的語意關係。本論文以此為發想,提出將分散式表示法,或更具體來說是詞向量表示(Word Representation),應用於語音辨識的語言模型中使用。首先,在語音辨識的過程中,對於動態產生之歷史詞序列與候選詞改以詞向量表示的方式來建立其對應的語言模型,希望透過此種表示方式而能獲取到更多詞彙間的語意資訊。其次,我們針對新近被提出的概念語言模型(Concept Language Model)加以改進;嘗試在調適語料中以句子的層次做模型訓練資料選取之依據,去掉多餘且不相關的資訊,使得經由調適語料中訓練出的概念類別更為具代表性,而能幫助動態語言模型調適。另一方面,在語音辨識過程中,會選擇相關的概念類別來動態組成概念語言模型,而此是透過詞向量表示的方式來估算,其中詞向量表示是由連續型模型(Continue Bag-of-Words Model)或是跳躍式模型(Skip-gram Model)生成,希望藉由詞向量表示記錄每一個概念類別內詞彙彼此間的語意關係。最後,我們嘗試將上述兩種語言模型調適方法做結合。本論文是基於公視電視新聞語料庫來進行大詞彙連續語音辨識(Large Vocabulary Continuous Speech Recognition, LVCSR)實驗,實驗結果顯示本論文所提出的語言模型調適方法相較於當今最好方法有較佳的效用。

    Research on deep learning has experienced a surge of interest in recent years. Alongside the rapid development of deep learning related technologies, various distributed representation methods have been proposed to embed the words of a vocabulary as vectors in a lower-dimensional space. Based on the distributed representations, it is anticipated to discover the semantic relationship between any pair of words via some kind of similarity computation of the associated word vectors. With the above background, this thesis explores a novel use of distributed representations of words for language modeling (LM) in speech recognition. Firstly, word vectors are employed to represent the words in the search history and the upcoming words during the speech recognition process, so as to dynamically adapt the language model on top of such vector representations. Second, we extend the recently proposed concept language model (CLM) by conduct relevant training data selection in the sentence level instead of the document level. By doing so, the concept classes of CLM can be more accurately estimated while simultaneously eliminating redundant or irrelevant information. On the other hand, since the resulting concept classes need to be dynamically selected and linearly combined to form the CLM model during the speech recognition process, we determine the relatedness of each concept class to the test utterance based the word representations derived with either the continue bag-of-words model (CBOW) or the skip-gram model (Skip-gram). Finally, we also combine the above LM methods for better speech recognition performance. Extensive experiments carried out on the MATBN (Mandarin Across Taiwan Broadcast News) corpus demonstrate the utility of our proposed LM methods in relation to several state-of-the art baselines.

    第一章 緒論 1 1.1 研究背景 1 1.2 語音辨識簡介 3 1.3 語言模型研究 5 1.4 論文貢獻 8 1.5 研究論文架構 9 第二章 文獻回顧以及方法探討 10 2.1 語言模型調適 10 2.2 語言模型演進 11 2.3 N連語言模型 17 2.4 主題模型 18 2.4.1 潛藏語意分析 19 2.4.2 機率式潛藏語意分析 20 2.4.3 潛藏狄利克里分配 21 2.4.4 詞主題模型 23 2.5 關聯模型 24 2.6 遞迴式類神經網路語言模型 26 2.7 長短期記憶類神經網路 27 第三章 融入概念資訊於語言模型中 30 3.1 概念語言模型 30 3.2 詞概念語言模型 31 3.3 群聚概念語言模型 33 第四章 融入詞向量表示於語言模型中 35 4.1 詞向量表示法 35 4.2 連續型詞袋模型 36 4.3 跳躍式模型 38 4.4 階層軟式最大化 39 4.5 負例採樣 42 4.6 分散式儲存模型 44 4.7 分散式詞袋模型 45 第五章 結合詞向量表示與概念資訊應用於語言模型 46 5.1 將詞向量應用於語言模型 46 5.2 將詞向量表示應用於詞圖搜尋 47 5.3 結合詞向量表示與群聚概念資訊於語言模型 49 第六章 實驗架構與結果討論 50 6.1 實驗架構 50 6.1.1 臺師大大詞彙連續語音辨識系統 50 6.1.1.1 特徵擷取 50 6.1.1.2 聲學模型 50 6.1.1.3 詞典建立 51 6.1.1.4 詞彙樹複製與搜尋 52 6.1.1.5 詞圖搜尋與N-條最佳結果(N-Best)之產生 53 6.1.2 語言模型評估方式 54 6.1.2.1 語言複雜度 54 6.1.2.2 辨識錯誤率 55 6.2 實驗語料 56 6.3 實驗結果與探討 58 6.3.1 基礎實驗 58 6.3.2 關聯模型 59 6.3.3 遞迴式類神經網路語言模型 60 6.3.4 詞向量表示應用於群聚概念語言模型 62 6.3.5 詞向量表示應用於詞圖搜尋 64 6.3.6 各式語言模型之實驗結果比較 65 第七章 結論與未來展望 67 參考文獻 69

    [1] K.-F. Lee, “Automatic Speech Recognition: The Development of the SPHINX Recognition System,” Boston: Kluwer Academic Publishers, 1989.
    [2] C. Manning and H. Schutze, “Foundations of statistical natural language processing,” Cambridge, MA: MIT Press, 1999.
    [3] P. F. Brown, V. J. Della Pietra, S. A. Della Pietra and R. L. Mercer, “The mathematics of statistical machine translation: Parameter estimation,” Computational Linguistics, Vol. 19, No. 2, pp. 263–311, 1993.
    [4] C. Zhai and J. Lafferty, “A study of smoothing methods for language models applied to ad hoc information retrieval,” in Proceedings of the ACM Special Interest Group on Information Retrieval, pp. 334–342, 2001.
    [5] X. Zhu and R. Rosenfeld, “Improving trigram language modeling with the world wide web.” in Proceeding of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 941–944, 2006.
    [6] C. Chelba, D. Bikel, M. Shugrina, P. Nguyen and S. Kumar, “Large scale language modeling in automatic speech recognition,” Technical report, Google, 2012.
    [7] W. -Y. Ma and K.-J. Chen, “Introduction to CKIP Chinese word segmentation system for the first international Chinese word segmentation bakeoff,” in Proceedings of the Second SIGHAN Workshop on Chinese Language Processing, pp. 168–171, (http://ckipsvr.iis.sinica.edu.tw/).
    [8] F. Jelinek, “Up from trigrams! The struggle for improved language models,” in Proceedings of the International Speech Communication Association, pp. 1037–1040, 1991.
    [9] G. Tur and A. Stolcke, “Unsupervised language model adaptation,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp.173–176, 2007.
    [10] J. R. Bellegarda, “Statistical language model adaptation: review and perspectives,” Speech Communication, Vol. 42, No. 1, pp. 93–108, 2004.
    [11] G. Tur and A. Stolcke, “Unsupervised Language Model Adaptation For Meeting Recognition.” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2007.
    [12] M. Novak, R. Mammone, “Use of Non-negative Matrix Factorization for Language Model. Adaptation in a Lecture Transcription Task.” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001.
    [13] L. Chen, J.-L. Gauvuin, L. Lamel, and G. Addu, “Unsupervised Language Model Adaptation for Broadcast News.” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003.
    [14] D. D. Lewis, “Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval.” in Proceedings of the International Conference on Machine Learning, 1998.
    [15] M. Collins, “Discriminative reranking for natural language parsing.” in Proceedings of the International Conference on Machine Learning, 2000.
    [16] J. Gao, H. Suzuki, W. Yuan, “An Empirical Study on Language Model Adaptation”, ACM Transactions on Asian Language Information Processing, Vol. 5, No. 3, September 2005, pp. 209–227
    [17] 劉鳳萍, “使用鑑別式語言模型於語音辨識結果重新排序,”國立台 灣師範大學資訊工程所碩士論文, 2009.
    [18] J. Goodman, “A bit of progress in language modeling (extended version),” Machine Learning and Applied Statistics Group, Technique Report, Microsoft, 2001.
    [19] R. Rosenfeld, “Two decades of statistical language modeling: where do we go from here,” IEEE, Vol. 88, No. 8, pp. 1270–1278, 2000.
    [20] J. R. Bellegarda, “A multispan language modeling framework for large vocabulary speech recognition,” IEEE Transactions on Acoustic, Speech and Signal Processing, Vol. 6, No. 5, pp. 456–467, 1998.
    [21] I. J. Good, “The population frequencies of species and the estimation of population parameters,” Biometrika, Vol. 40, No. 3–4, pp. 237–264, 1953.
    [22] R. Kneser and H. Ney, “Improved backing-off for N-gram language modeling,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 181–184, 1995.
    [23] P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai and R. L. Mercer, “Class-based N-gram models of natural language,” Computational Linguistics, Vol. 18, No. 4, pp. 467–479, 1992.
    [24] X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang, K.-F. Lee and R. Rosenfeld, “The SPHINX-II speech recognition system: An overview,” Computer, Speech, and Language, Vol. 7, No. 2, pp. 137–148, 1993.
    [25] R. Lau, R. Rosenfeld and S. Roukos, “Trigger-based language models: a maximum entropy approach,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, pp. 45–48, 1993.
    [26] L. Saul and F. Pereira,“ Aggregate and mixed-order Markov models for statistical language processing.” in Proceedings of the Empirical Methods on Natural Language Processing, 1997.
    [27] C. Chelba, “A structured language model,” in Proceedings of the Annual Meeting on Association for Computational Linguistics, pp. 498–450, 1997.
    [28] C. Chelba and F. Jelinek, “Exploiting syntactic structure for language modeling,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 225–231, 1998.
    [29] J. R. Bellegarda, “A latent semantic analysis framework for large–span language modeling,” in Proceedings of European Conference on Speech Communication and Technology, pp.1451–1454, 1997.
    [30] T. Hofmann, “Probabilistic latent semantic indexing,” in Proceeding of the ACM Special Interest Group on Information Retrieval, pp. 50–57, 1999.
    [31] M. Novak, R. Mammone, “Use of Non-negative Matrix Factorization for Language Model. Adaptation in a Lecture Transcription Task.” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2001.
    [32] Z. Chen, K. F. Lee and M. J. Li, “Discriminative training on language model,” in Proceedings of the International Speech Communication Association, pp. 493–496, 2000.
    [33] H.-K. J. Kuo, E. Fosler-Lussier, H. Jiang and C. H. Lee,” Discriminative Training of Language Models for Speech Recognition.” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2002.
    [34] Y. Bengio, R. Ducharme, P. Vincent. A neural probabilistic language model. Journal of Machine Learning Research, 3:1137–1155, 2003.
    [35] H.-S. Chiu, B. Chen, “Word Topical Mixture Models for Dynamic Language Model Adaptation.” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2007.
    [36] M. Afify, O. Siohan, R. Sarikaya, “Gaussian Mixture Language Models for Speech Recognition.” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2007
    [37] T. Mikolov, M. Karafiát, L. Burget, J. Černocký and S. Khudanpur, “Recurrent neural network based language model,” in Proceedings of the International Speech Communication Association, pp. 1045–1048, 2010.
    [38] Y. Bengio, P. Simard, P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” Neural Networks, IEEE Transactions on, Vol. 5, No. 2, pp. 157– 166, 1994.
    [39] Hochreiter, S., Schmidhuber, J., “Long Short-Term Memory”, Neural Computation 9 (8), 1997, pp. 1735–1780.
    [40] G.E. Hinton. Learning distributed representations of concepts. in Proceedings of the Eighth Annual Conference of the Cognitive Science Society, pages 1–12, Amherst 1986, 1986. Lawrence Erlbaum, Hillsdale.
    [41] Collobert, R., & Weston, J., A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pp. 160–167, 2008.
    [42] Mnih and G. E. Hinton. Three new graphical models for statistical language modelling. In International Conference on Machine Learning, pages 641–648, 2007.
    [43] Mnih, A., & Hinton, G. E. A scalable hierarchical distributed language model. Advances in Neural Information Processing Systems 21, MIT Press, pp. 1081–1088, 2009.
    [44] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. In International Conference on Learning Representations, 2013.
    [45] F. Morin and Y. Bengio, “Hierarchical probabilistic neural network language model,” In Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 246–252, 2005.
    [46] A. Mnih and K. Kavukcuoglu, “Learning word embeddings efficiently with noise-contrastive estimation,” Advances in Neural Information Processing Systems, pp. 2265–2273, 2013.
    [47] Q. Le, T. Mikolov. 2014. Distributed Representations of Sentences and Documents. in Proceedings of the International Conference on Machine Learning, 2014.
    [48] J. Bellegarda, “A latent semantic analysis framework for large-span language modeling.” In Eurospeech-97, Rhodes, Greece, September 1997.
    [49] J. R. Bellegarda, “Latent Semantic Mapping.” IEEE Signal Processing Magazine, Vol. 22. No. 5, pp. 70–80, 2005.
    [50] D. Gildea and T. Hofmann, “Topic-based language models using EM.” in Proceedings of the International Speech Communication Association, 1999.
    [51] D. Blei, A. Ng, and M. Jordan, “Latent Dirichlet Allocation.”In Journal of Machine Learning Research, 2003.
    [52] 邱炫盛, “利用主題與位置相關語言模型於中文連續語音辨識,”國立台灣師範大學資訊工程所碩士論文, 2007.
    [53] V. Lavrenko and W. Croft, “Relevance-based language models,” in Proceeding of the ACM Special Interest Group on Information Retrieval, pp. 120–127, 2001.
    [54] R. Baeza-Yates and B. Ribeiro-Neto, “Modern Information Retrieval: the Concepts and Technology behind Search,” Addison-Wesley Professional, 2011.
    [55] K.-Y. Chen and B. Chen, “Relevance language modeling for speech recognition,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 5568–5571, 2011.
    [56] B. Chen and K.-Y. Chen, “Leveraging relevance cues for language modeling in speech recognition,” Information Processing & Management, Vol. 49, No 4, pp. 807–816, 2013.
    [57] 郝柏翰,“運用鄰近與概念資訊於語言模型調適之研究,”國立臺灣師範大學資訊工程所碩士論文,2014。
    [58] Ortmanns, S., Ney, H., & Aubert, X. (1997). A word graph algorithm for large vocabulary continuous speech recognition. Computer Speech and Language, 11, 43–72.
    [59] Kullback, S., & Leibler, R. (1951). On information and sufficiency,” Annals of Mathematical Statistics, 22(1), 79–86.
    [60] Baeza-Yates, R., & Ribeiro-Neto, B. (2011). Modern Information Retrieval: the Concepts and Technology behind Search, Addison-Wesley Professional.
    [61] Zhai, C. X. (2008). Statistical language models for information retrieval: A critical review. Foundations and Trends in Information Retrieval, 2(3), 137–213.
    [62] D. Povey and P. C. Woodland, “Minimum phone error and I-smoothing for improved discriminative training,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 105–108, 2002.
    [63] Chen, B., Kuo, J.-W., & Tsai, W.-H. (2004). Lightly supervised and data-driven approaches to Mandarin broadcast news transcription. In Proceedings of the IEEE International Conference on Acoustics, Speech, Signal Processing, 777–780.
    [64] Wang, H.-M., Chen, B., Kuo, J.-W., & Cheng, S.-S. (2005). MATBN: a Mandarin Chinese broadcast news corpus. International Journal of Computational Linguistics & Chinese Language Processing, 10(1), 219–235.
    [65] Liu, S.-H., Chu, F.-H., Lin, S.-H., Lee, H.-S., & Chen, B. (2007). Training data selection for improving discriminative training of acoustic models. In Proceedings of IEEE workshop on Automatic Speech Recognition and Understanding, 284–289.
    [66] Stolcke, A. (2000). SRI Language Modeling Toolkit. Available at: http://www.speech.sri.com/projects/srilm/.

    下載圖示
    QR CODE