研究生: |
黃邦烜 Bang-Xuan Huang |
---|---|
論文名稱: |
遞迴式類神經網路語言模型使用額外資訊於語音辨識之研究 Recurrent Neural Network-based Language Modeling with Extra Information Cues for Speech Recognition |
指導教授: |
陳柏琳
Chen, Berlin |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2012 |
畢業學年度: | 100 |
語文別: | 中文 |
論文頁數: | 78 |
中文關鍵詞: | 語音辨識 、語言模型 、前饋式類神經網路 、遞迴式類神經網路 |
英文關鍵詞: | automatic speech recognition, language modeling, feed-forward neural network, recurrent neural networks |
論文種類: | 學術論文 |
相關次數: | 點閱:236 下載:11 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語言模型藉由大量的文字訓練後,可以捕捉自然語言的規律性,並根據歷史詞序列來區辨出下一個詞應該為何,因此在自動語音辨識(Automatic Speech Recognition, ASR)系統中扮演著不可或缺的角色。傳統統計式N連(N-gram)語言模型是常見的語言模型,它基於已知的前N-1個詞來預測下一個詞出現的可能性。當N小時,缺乏了長距離的資訊;而N大時,會因訓練語料不足產生資料稀疏之問題。近年來,由於類神經網路(Neural Networks)的興起,許多相關研究應運而生,類神經網路語言模型即是一例。令人感興趣的是,類神經網路語言模型能夠解決資料稀疏的問題,它透過將詞序列映射至連續空間來估測下一個詞出現的機率,因此在訓練語料中不會遇到未曾出現過的詞序列組合。除了傳統前饋式類神經網路語言模型外,近來也有學者使用遞迴式類神經網路來建構語言模型,其希望使用遞迴的方式將歷史資訊儲存起來,進而獲得長距離的資訊。
本論文研究遞迴式類神經網路語言模型於中文大詞彙連續語音辨識之使用,探索額外使用關聯資訊以更有效地捕捉長距離資訊,並根據語句的特性動態地調整語言模型。實驗結果顯示,使用關聯資訊於遞迴式類神經網路語言模型能對於大詞彙連續語音辨識的效能有相當程度的提昇。
The goal of language modeling (LM) attempts to capture the regularities of natural languages. It uses large amounts of training text for model training so as to help predict the most likely upcoming word given a word history. Therefore, it plays an indispensable role in automatic speech recognition (ASR). The N-gram language model, which determines the probability of an upcoming word given its preceding N-1 word history, is most prominently used. When N is small, a typical N-gram language model lacks the ability of rendering long-span lexical information. On the other hand, when N becomes larger, it will suffer from the data sparseness problem because of insufficient training data. With this acknowledged, research on the neural network-based language model (NNLM), or more specifically, the feed-forward NNLM, has attracted considerable attention of researchers and practitioners in recent years. This is attributed to the fact that the feed-forward NNLM can mitigate the data sparseness problem when estimating the probability of an upcoming word given its corresponding word history through mapping them into a continuous space. In addition to the feed-forward NNLM, a recent trend is to use the recurrent neural network-based language model (RNNLM) to construct the language model for ASR, which can make efficient use of the long-span lexical information inherent in the word history in a recursive fashion.
In this thesis, we not only investigate to leverage extra information relevant to the word history for RNNLM, but also devise a dynamic model estimation method to obtain an utterance-specific RNNLM. We experimentally observe that our proposed methods can show promise and perform well when compared to the existing LM methods on a large vocabulary continuous speech recognition (LVCSR) task.
1.中文部分
[邱炫盛,2007] 邱炫盛,“利用主題與位置相關語言模型於中文連續語音辨識,”國立臺灣師範大學資訊工程所碩士論文,2007。
[劉鳳萍,2009] 劉鳳萍,“使用鑑別式言模型於語音辨識結果重新排序,”國立臺灣師範大學資訊工程所碩士論文,2009。
[陳冠宇,2010] 陳冠宇,“主題模型於語音辨識使用之改進,”國立臺灣師範大學資訊工程所碩士論文,2010。
[劉家妏,2010] 劉家妏,“多種鑑別式語言模型應用於語音辨識之研究,” 國立臺灣師範大學資訊工程所碩士論文,2010。
[賴敏軒,2011] 賴敏軒,“實證探究多種鑑別式語言模型於語音辨識之研究,”國立臺灣師範大學資訊工程所碩士論文,2011。
2.西文部分
[Aubert, 2002] X. L. Aubert, “An overview of decoding techniques for large vocabulary continuous speech recognition,” Computer Speech and Language, Vol. 16, No. 1, pp. 89-114, 2002.
[Alexandrescu and Kirchhoff, 2006] A. Alexandrescu and K. Kirchhoff, “Factored neural language models,” in Proc. North American Chapter of the Association for Computational Linguistics, pp. 1-4, 2006.
[Arisoy et al., 2010] E. Arisoy, M. Saraclar, B. Roark, and I. Shafran, “Syntactic and sub-lexical features for Turkish discriminative language models,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5538 -5541, 2010.
[Bahl et al., 1983] L. R. Bahl, F. Jelinek, and R. L. Mercer, “A maximum likelihood approach to continuous speech recognition,” in Proc. IEEE Transactions on Patten Analysis and Machine Intelligence, Vol. PAMI-5, No. 2, pp. 179-190, 1983.
[Bahl et al., 1986] L. R. Bahl, P. F. Brown, P. V. de Souza, and R. L. Mercer, “Maximum mutual information estimation,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 49-52, 1986.
[Brown et al., 1992] P. F. Brown, V. J. Della Pietra, P. V. deSouza, J. C. Lai, and R. L. Mercer. “Class-based n-gram models of natural language,” Computational Linguistics, Vol. 18, No. 4, pp. 467-479, 1992.
[Bengio et al., 1993] Y. Bengio, P. Frasconi, and P. Simard, “The problem of learning long-term dependencies in recurrent networks,” in Proc. IEEE International Conference on Neural Networks, Vol. 3, pp. 1183-1188, 1993.
[Bengio et al., 1994] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE Transaction on Neural Networks, Vol. 5, No. 2, pp. 157-166, 1994.
[Bengio et al., 2001] Y. Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic language model,” in Proc. Advances in Neural Information Processing Systems, pp. 933-938, 2001.
[Boden, 2002] Mikael Boden, “A guide to recurrent neural networks and back-propagation,” in the Dallas project, 2002.
[Bellegarda, 2005] J. R. Bellegarda, “Latent semantic mapping,” IEEE Signal Processing Magazine, Vol. 22, No. 5, pp. 70- 80, 2005.
[Chen and Goodman, 1996] S. F. Chen, and J. Goodman, “An empirical study of smoothing techniques for language modeling,” in Proc. the 34th annual meeting on Association for Computational Linguistics, pp. 310-318, 1996.
[Clarkson and Robinson, 1997] P. R. Clarkson, and A. J. Robinson, “Language model adaptation using mixtures and an exponentially decaying cache,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 799-802, 1997.
[Chelba and Jelinek, 2000] C. Chelba, and F. Jelinek, “Structured language modeling,” Computer, Speech and Language, Vol. 14, No. 4, pp. 283-332, 2000.
[Chen et al., 2004] B. Chen, J.-W. Kuo, and W.-H. Tsai. “Lightly supervised and data-driven approaches to mandarin broadcast news transcription,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 10, No. 1, pp. 1-18, 2004.
[Davis and Mermelstein, 1980] S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 28, No. 4, pp. 357-366, 1980.
[Elman, 1990] J. L. Elman, “Finding structure in time,” Cognitive Science, Vol. 14, No. 2, pp. 179-211, 1990.
[Gales, 1998] M. J. F. Gales “Maximum likelihood linear transformations for HMM-based speech recognition," Computer, Speech and Language, Vol. 12, pp.75-98, 1998.
[Gildea and Hofmann, 1999] D. Gildea and T. Hofmann, “Topic-based language models using EM,” in Proc. 6th European Conference on Speech Communication and Technology, pp. 2167-2170, 1999.
[Goodman, 2001] J. Goodman, “Classes for fast maximum entropy training,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, pp. 561-564, 2001.
[Goodman, 2001] J. Goodman, “A bit of progress in language modeling,” Computer, Speech and Language, pp. 403-434, 2001.
[Gao et al., 2005] J. Gao, H. Suzuki, and W. Yuan, “An empirical study on language model adaptation,” ACM Transactions on Asian Language Information Processing, Vol. 5, No. 3, pp. 209-227, 2005.
[Hermansky, 1990] H. Hermansky, “Perceptual linear predictive analysis of speech,” The Journal of the Acoustical Society of America, Vol. 87, No. 4, 1990.
[Huang et al., 2007] Z. Huang, M. P. Harper, and W. Wang, “Mandarin part-of-speech tagging and discriminative reranking,” in Proc. Empirical Methods in Natural Language Processing, pp. 1093-1102, 2007.
[Jordan, 1986] M. L. Jordan, “Attractor dynamics and parallelism in a connectionist sequential machine,” in Proc. the Eighth Annual Conference of the Cognitive Science Society, pp.531-546, 1986.
[Juang and Katagiri, 1992] B. H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Transactions on Signal Processing, Vol. 40, No. 12, pp. 3043-3054, 1992.
[Katz, 1987] S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,” IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. ASSP-35, No. 3, pp. 400, 1987.
[Kuhn, 1988] R. Kuhn, “Speech recognition and the frequency of recently used words: A modified Markov model for natural language,” in Proc. International Conference on Computational Linguistics, pp. 348-350, 1988.
[Kneser and Ney, 1995] R. Kneser and H. Ney, “Improved backing-off for m-gram language modeling,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 181-184, 1995.
[Kumar, 1997] N. Kumar, Investigation of Silicon-Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition, Ph.D. dissertation, John Hopkins University, Baltimore, 1997.
[Kang et al., 2011] M. Kang, T. Ng, and L. Nguyen, “Mandarin word-character hybrid-input neural network language model,” in Proc. International Speech Communication Association, pp. 625-628, 2011.
[Lawrence et al., 1996] S. Lawrence, C. L. Giles, and S. Fong, “Can recurrent neural networks learn natural language grammars?,” in Proc. International Conference on Neural Networks, pp. 1853-1858, 1996.
[Le et al., 2011] H.-S. Le, I. Oparin, A. Allauzen, J.-L. Gauvain, and F. Yvon,“Structured output layer neural network language model,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5524-5527, 2011.
[MacQueen, 1967] J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. the fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297, 1967.
[Makhoul, 1975] J. Makhoul, “Linear prediction: A tutorial review,” Proceeding of the IEEE, Vol. 63, No. 4, pp. 561-580, 1975.
[Mikolov et al., 2009] T. Mikolov, J. Kopecký, L. Burget, O. Glembek, and J. Cernocký, “Neural network based language models for highly inflective languages,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4725-4728, 2009.
[Mikolov et al., 2010] T. Mikolov, M. Karafiát, L. Burget, J. Cernocký, and S. Khudanpur, “Recurrent neural network based language model,” in Proc. International Speech Communication Association, pp. 1045-1048, 2010.
[Mikolov et al., 2011] T. Mikolov, S. Kombrink, L. Burget, J. Cernocký, and S. Khudanpur, “Extensions of recurrent neural network language model,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5528-5531, 2011.
[Mikolov et al., 2011] T. Mikolov, A. Deoras, D. Povey, L. Burget, and J.Cernocky, “Strategies for training large scale neural network language models,” in Proc. the IEEE Automatic Speech Recognition and Understanding Workshop, pp. 450-455, 2011.
[Mikolov et al., 2011] T. Mikolov, A. Deoras, S. Kombrink, and L. Burget, “Empirical evaluation and combination of advanced language modeling techniques,” in Proc. International Speech Communication Association, pp. 605-608, 2011.
[Mikolov et al., 2011] T. Mikolov, S. Kombrink, A. Deoras, L. Burget, and J. Černocký, “RNNLM - Recurrent neural network language modeling toolkit,” in Proc. IEEE workshop on Automatic Speech Recognition and Understanding, pp.16, 2011.
[Och, 2003] F. J. Och, “Minimum error rate training in statistical machine translation,” in Proc. the 41st Annual Meeting on Association for Computational Linguistics, pp. 160-167, 2003.
[Oba et al., 2010] T. Oba, T. Hori, and A. Nakamura, “A comparative study on methods of weighted language model training for reranking LVCSR N-best hypotheses,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5126-5129, 2010.
[Oparin et al., 2012] I. Oparin, M. Sundermeyer, H. Ney, and J. L. Gauvain, “Performance analysis of neural networks in combination with n-gram language models,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 5005-5008, 2012.
[Povey, 2004] D. Povey, Discriminative training for large vocabulary speech recognition, Ph.D Dissertation, Peterhouse, University of Cambridge, 2004.
[Park et al., 2010] J. Park, X. Liu, M. J. F. Gales, and P. C. Woodland, “Improved neural network based language modeling and adaptation,” in Proc. International Speech Communication Association, pp. 1041-1044, 2010.
[Rosenblatt, 1958] F. Rosenblatt, “The perceptron: A probabilistic model for information storage and organization in the brain,” Cornell Aeronautical Laboratory, Psychological Review, Vol. 65, No. 6, pp. 386-408, 1958.
[Rabiner, 1989] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” in Proc the IEEE, Vol. 77, No. 2, 1989.
[Rojas, 1996] R. Rojas, “Neural networks: a systematic introduction,” Springer-Verlag, 1996.
[Roark et al., 2007] B. Roark, M. Saraclar, M. Collins, and M. Johnson, “Discriminative n-gram language modeling,” Computer Speech and Language, Vol. 21, No. 2, pp. 373-392, 2007.
[Shannon, 1948] C. E. Shannon and W. Weaver, A mathematical theory of communication, Urbana, University of Illinois Press, 1948.
[Saul and Pereira, 1997] L. Saul and F. Pereira, “Aggregate and mixed-order Markov models for statistical language processing,” in Proc. the Conference on Empirical Methods in Natural Language Processing, pp.81-89, 1997.
[Schuster and Paliwal, 1997] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,” IEEE Transactions on Signal Processing, Vol. 45, No. 11, pp. 2673-2681, 1997.
[Schwenk, 2004] H. Schwenk, “Efficient training of large neural networks for language modeling,” in Proc. IEEE International Joint Conference Neural Networks, Vol. 4, pp. 3059-3064, 2004.
[Schwenk and Gauvain, 2005] H. Schwenk and J. L. Gauvain, “Training neural network language models on very large corpora,” in Proc. Empirical Methods in Natural Language Processing, pp. 201-208, 2005.
[Schwenk et al., 2007] H. Schwenk, M. R. Costa-jussa, and Jose A. R. Fonollosa, “Continuous space language models,” in Proc. International Workshop on Spoken Language Translation, pp. 166-173, 2007.
[Sarikaya et al., 2010] R. Sarikaya, A. Emami, M. Afify, and B. Ramabhadran, “Continuous space language modeling technique,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5186-5189, 2010.
[Sak et al., 2010] H. Sak, M. Saraclar, and T. Güngör, “Morphology-based and sub-word language modeling for Turkish speech recognition,” in Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5402-5405, 2010.
[Towsey et al., 1998] M. Towsey, J. Diederich, I. Schellhammer, S. Chalup, and C. Brugman, “Natural language learning by recurrent neural networks: A comparison with probabilistic approaches,” in Proc. the joint conference on new methods in language processing and computational natural language learning, pp. 3-10, 1998.
[Troncoso et al., 2004] C. Troncoso, T. Kawahara, H. Yamamoto, and G. Kikui, “Trigger-based language model construction by combining different corpora,” Institute of Electronics, Information and Communication Engineers Technical Report, Vol. 104, No. 542, pp. 25-30, 2004.
[Tam and Schultz, 2005] Y. C. Tam and T. Schultz, “Dynamic language model adaptation using variational Bayes inference,” in Proc. 9th European Conference on Speech Communication and Technology, pp. 5-8, 2005.
[Viterbi, 1967] A. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Transaction on Information Theory, Vol. 13, No. 2, pp. 260-269, 1967.
[Villiers and Barnard, 1992] J. Villiers and E. Barnard, “Back-propagation neural nets with one and two hidden layers,” IEEE Transaction on Neural Network, Vol. 4, No. 1, pp. 136-141, 1992.
[Werbos, 1990] P. J. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE, Vol. 78, No. 10, pp. 1550-1560, 1990.
[Wang et al., 2005] H.-M. Wang, B. Chen, J.-W. Kuo, and S.-S. Cheng, “MATBN: A Mandarin Chinese broadcast news corpus,” International Journal of Computational Linguistics & Chinese Language Processing, Vol. 10, No. 2, pp. 219-236, 2005.
[Xu and Rudnicky, 2000] W. Xu and A. Rudnicky, “Can artificial neural networks learn language models?,” in Proc. International Conference on Speech and Language Processing, pp. 202-205, 2000.
[Zamora-Martinez et al., 2009] F. Zamora-Martinez, M. J. Castro-Bleda, and S. Espana-Boquera, “Fast evaluation of connectionist language models,” in Proc. International Work Conference on Artificial Neural Networks, Vol. 5517, pp. 33-40, 2009
[Zamora-Martinez et al., 2012] F. Zamora-Martinez, S. España-Boquera, and M. J. Castro-Bleda, “Cache neural network language models based on long-distance dependencies for a spoken dialog system,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 4993-4996, 2012.