研究生: |
羅永典 Yueng-Tien Lo |
---|---|
論文名稱: |
使用邊際資訊於鑑別式聲學模型訓練 A Study on Margin-Based Discriminative Training of Acoustic Models |
指導教授: |
陳柏琳
Chen, Berlin |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2010 |
畢業學年度: | 98 |
語文別: | 中文 |
論文頁數: | 51 |
中文關鍵詞: | 語音辨識 、聲學模型鑑別式訓練 、邊際資訊 、資料選取 |
論文種類: | 學術論文 |
相關次數: | 點閱:121 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文旨在探究近年具代表性的鑑別式聲學模型訓練方法及其背後之一致性,並且延伸發展各種不同以邊際為基礎的資料選取方法來改善鑑別式聲學模型訓練,應用於中文大詞彙連續語音辨識。首先,為了進一步探討近年各種鑑別式訓練方法,我們整理歸納近年所發展鑑別式訓練方法之目標函數其背後一致性。其次,我們討論了各種不同邊際資訊應用於鑑別式訓練的方法,進而在大詞彙連續語音辨識中有效地降低語音辨識錯誤率。再者,我們結合了柔性邊際與增進式方法使得在資料選取的範圍上更為明確且具彈性,以提供更具鑑別資訊的統計量。在實作上,我們觀察了以語句為層次的選取資料為例,以進一步了解各式統計資訊對於鑑別式訓練成效之影響。最後,本論文以公視新聞語料做為實驗平台,實驗結果初步證實了本論文所提出之作法在某種程度上能夠改善過去方法所面臨的過度訓練之問題。
This thesis sets the goal at investigating the consistency properties underlying the most popular algorithms for discriminative training of acoustic models. Various margin- and boosting-based training data selection methods are also extensively explored in conjunction with the discriminative training algorithms for Mandarin large vocabulary continuous speech recognition (LVCSR). First, for providing an in-depth evaluation of the utilities of the discriminative acoustic model training algorithms developed recently, we try to deduce the consistency properties from their individual training objectives. Second, we compare among different margin- and boosting-based methods that have the abilities to make acoustic training concentrate more on discriminative training data so as to effectively enhance the LVCSR performance. Furthermore, we also attempt to pair the soft-margin- with the boosting-based methods to make good use of more discriminative statistics, while the implementation is instantiated by utterance-level data selection. All experiments are conducted on a Mandarin broadcast news corpus compiled in Taiwan, and the associated results seem to demonstrate that the proposed approaches could relieve the over-training problem to a certain extent.
[Atal 1974] B. S. Atal, “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” Journal of the Acoustical Society of America, Vol. 55, No. 6, pp. 1304-1312, 1974.
[Aubert 2002] X. Aubert, “An overview of decoding techniques for large vocabulary continue speech recognition,” Computer Speech and Language, Vol. 16, pp.89-114, 2002.
[Bahl et al. 1983] L. R. Bahl, F. Jelinek and R. L. Mercer, “A maximum likelihood approach to continuous speech recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. PAMI-5, No. 2, 1983.
[Bahl et al. 1986] L. R. Bahl, P. F. Brown, P. V. de Souza and R. L. Mercer, “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proc. ICASSP, pp. 49-52, 1986.
[Barras et al. 1986] C. Barras, E. Geoffrois, Z. B. Wu and M. Liberman, “Transcriber: development and use of a tool for assisting speech corpora production,” Speech Communication, Vol. 33, pp. 5-22, 2001.
[Baum 1972] L. E. Baum, “An inequality and associated maximization technique in statistical estimation for probabilistic functions of markov processes,” Inequalities, Vol. 3, No. 1, pp.1-8, 1972.
[Chen et al. 2002] B. Chen, H.-M. Wang and L.-S. Lee, “Discriminating capabilities of syllable-based features and approaches of utilizing them for voice retrieval of speech information in mandarin chinese,” IEEE Trans. Speech and Audio Processing, Vol. 10, No. 5, pp. 303-314, 2002.
[Chen et al. 2004] B. Chen, J.-W. Kuo and W.-H. Tsai, “Lightly supervised and data-driven approaches to mandarin broadcast news transcription,” in Proc. ICASSP, p777-780, 2004.
[Chen et al. 2005] B. Chen, J.-W. Kuo and W.-H. Tsai, ”Lightly supervised and data-driven approaches to mandarin broadcast news transcription,” International Journal of Computational Linguistics & Chinese Language Processing, Vol. 10, No. 1, pp 1-18, 2005.
[Chen et al. 2009] B. Chen, J.-W. Kuo and W.-H. Tsai, ”Training data selection for improving discriminative training of acoustic models,” Pattern Recognition Letters, Vol. 30, No. 13, pp. 1228-1235, October 2009.
[Davis and Mermelstein 1980] S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. Acoustic, Speech, and Signal Processing, Vol. 28, No. 4, pp. 357-366, 1980.
[Doumpiotis et al. 2004] V. Doumpiotis, S. Tsakalidis and W. Byrne, “Lattice segmentation and minimum bayes risk discriminative training,” in Proc. Eurospeech, pp. 163-166, 2004.
[Duda et al. 2000] R. O. Duda, P. E. Hart and D. G. Stork, Pattern classification, Second Edition. New York: John & Wiley, 2000.
[Fei 2007] Fei Sha,, Large margin training of acoustic models for speech recognition, Ph.D Dissertation, University of Pennsylvania. 2007.
[Gales 1998] M. J. F. Gales, “Maximum likelihood linear transformations for hmm-based speech recognition,” Computer Speech and Language, Vol. 12, No. 2, pp. 75-98, 1998.
[Gales 2002] M. J. F. Gales, “Maximum likelihood multiple subspace projections for hidden markov models,” IEEE Trans. on Speech and Audio Processing, Vol. 10, No. 2, pp. 37-47, 2002.
[Goel and Byrne 2000] V. Goel and W. Byrne, “Minimum Bayes-risk automatic speech recognition,” Computer Speech and Language, Vol. 14, pp.115-135, 2000.
[Gopinath 1998] R. A. Gopinath, “Maximum likelihood modeling with gaussian distributions for classification,” in Proc. ICASSP, pp. 661-664, 1998.
[He et al. 2008] X. He, L. Deng, and C. Wu, “Discriminative learning in sequential pattern recognition --- A unifying review for optimization-oriented speech recognition,” in IEEE Signal Processing Magazine, vol. 25, No. 5, pp. 14-36, September, 2008.
[Heigold et al. 2008] G. Heigold, T. Deselaers, R. Schlüter, and H. Ney, “Modified MMI/MPE: A direct evaluation of the margin in speech recognition,” in Proc. ICML, pp. 384-391, 2008.
[Huang et al. 2001] X. Huang, A. Acero and H.-W. Hon, Spoken language processing: A guide to theory, algorithm and system development, Upper Saddle River, NJ, USA, Prentice Hall PTR, 2001.
[Jiang 2005] H. Jiang, “Confidence measures for speech recognition: A survey,” Speech Communication, Vol. 45, pp. 455-470, 2005.
[Jiang and Li 2007] H. Jiang and X. Li, “Incorporating training errors for large margin hmms under semi-definite programming framework,” in Proc. ICASSP, pp. 629-632, 2007.
[Jiang et al. 2006] H. Jiang, X. Li and C. Liu, “Large margin hidden markov models for speech recognition,” IEEE Trans. Audio, Speech and Language Processing, Vol. 14, No. 5, pp. 1584-1595, 2006.
[Jiang 2010] H. Jiang, “Discriminative training for automatic speech recognition: A survey,” Computer and Speech, Language, pp. 589-608, Vol. 24, No. 4, October 2010.
[Juang and Katagiri 1992] B.-H. Juang and S. Katagiri, “Discriminative learning for minimum classification error,” IEEE Trans. Signal Processing, Vol. 40, No. 12, pp. 3043-3054, 1992.
[Juang et al. 1997] B.-H. Juang, W. Chou and C.-H. Lee, “Minimum classification error rate methods for speech recognition,” IEEE Trans. Speech and Audio Processing, Vol. 5, No. 3, pp.257-265, 1997.
[Kaiser et al. 2002] J. Kaiser, B. Horvat and Z. Kacic, “Overall risk criterion estimation of hidden markov model parameters,” Speech Communication, Vol. 38, pp. 383-398, 2002.
[Katz 1987] S. M. Katz, “Estimation of probabilities from sparse data for other language component of a speech recognizer,” IEEE Trans. Acoustics, Speech and Signal Processing, Vol. 35, No. 3, pp. 400-401, 1987.
[Korkmazsky et al. 2004] F. Korkmazsky, D. Fohr and I. Illina, “Using linear interpolation to improve histogram equalization for speech recognition,” in Proc. ICSLP, pp. 2089-2092, 2004.
[Kumar 1997] N. Kumar, Investigation of silicon-auditory models and generalizaion of linar discriminant analysis for improved speech recognition, Ph.D. Dissertation, John Hopkins University, Baltimore, 1997.
[Kuo and Chen 2005] J.-W. Kuo and B. Chen, “Minimum word error based discriminative training of language models,” in Proc. Eurospeech, pp. 1277-1280, 2005.
[Kuo et al. 2006] J.-W. Kuo, S.-H. Liu, H.-M. Wang and B. Chen, “An empirical study of word error minimization approaches for mandarin large vocabulary speech recognition,” International Journal of Computational Linguistics and Chinese Language Processing, Vol. 11, No.3, pp. 201-222, 2006.
[LDC] Linguistic Data Consortium: http://www.ldc.upenn.edu .
[Li and Jiang 2007] X. Li and H. Jiang, “Solving large-margin hidden markov model estimation via semidefinite programming,” IEEE Trans. Audio, Speech and Language Processing, Vol. 15, No. 8, pp. 2383-2392, 2007.
[Li et al. 2005] X. Li, H. Jiang and C. Liu, “Large margin hmms for speech recognition,” in Proc. ICASSP, pp. 513-516, 2005.
[Li et al. 2006] J. Li, M. Yuan and C.-H. Lee, “Soft margin estimation of hidden Markov model parameters,” in Proc. Interspeech, pp. 2422-2425, 2006.
[Li et al. 2007a] J. Li, M. Yuan and C.-H. Lee, “Approximate test risk bound minimization through soft margin estimation,” IEEE Trans. Audio, Speech, and Language Processing, Vol. 15, No. 8, pp.2393-2404, 2007.
[Li et al. 2007b] J. Li, Z.-J. Yan, C.-H. Lee and R.-H. Wang, “A study on soft margin estimation for LVCSR,” in Proc. ASRU , pp. 268-271, 2007.
[Li et al. 2008] J. Li, Soft margin estimation for automatic speech recognition, Ph.D. Dissertation, Electrical and Computer Engineering, Georgia Institute of Technology, July 2008.
[Lin et al. 2007] S.-H. Lin, Y.-M. Yeh and B. Chen, “A comparative study of histogram equalization (HEQ) for robust speech recognition,” International Journal of Computational Linguistics and Chinese Language Processing, Vol. 12, No. 2, pp. 217-238, 2007. [Liu et al. 2007a] S.-H. Liu, F.-H. Chu, S.-H. Lin and B. Chen, “Investigation data selection for minimum phone error training of acoustic models,” in Proc. ICME, pp. 348-351 ,2007.
[Liu et al. 2007b] S.-H. Liu, F.-H. Chu, S.-H. Lin, H.-S. Lee and B. Chen, “Training data selection for improving discriminative training of acoustic models,” in Proc. ASRU, pp. 284-289, 2007.
[Mangu et al. 2000] L. Mangu, E. Brill and A. Stolcke, “Finding consensus in speech recognition: word error minimization and other applications of confusion networks,” Computer Speech and Language, Vol. 14, pp. 373-400, 2000.
[McDermott et al. 2009] E. McDermott, S. Watanabe, and A. Nakamura, “Margin-space integration of MPE loss via differencing of MMI functionals for generalized error-weighted discriminative training,” in Proc. Interspeech 2009.
[Nakamura et al. 2009] A. Nakamura, E. McDermott, S. Watanabe, and S. Katagiri, “A unified view for discriminative objective functions based on negative exponential of difference measure between strings,” in Proc. ICASSP ,pp. 1633-1636 ,2009.
[Ney et al. 1994] H. Ney, U. Essen and R. Kneser, “On structuring probabilistic dependences in stochastic language modeling,” Computer Speech and Language, Vol. 8, pp. 1-38, 1994.
[NIST] National Institute of Standards and Technology. http://www.nist.gov/ .
[Normandin 1991] Y. Normandin, Hidden Markov models, maximum mutual information estimation and the speech recognition problem, Ph.D Dissertation, McGill University, Montreal, 1991.
[Ortmanns et al 1997] S. Ortmanns, H. Ney and X. Aubert, “A word graph algorithm for large vocabulary continuous speech recognition,” Computer Speech and Language, Vol. 11, pp. 11-72, 1997.
[Povey 2004] D. Povey, Discriminative training for large vocabulary speech recognition, Ph.D Dissertation, University of Cambridge, 2004.
[Povey and Woodland 2002] D. Povey and P. C. Woodland, “Minimum phone error and I-smoothing for improved discriminative training,” in Proc. ICASSP, pp. 105-108, 2002.
[Povey et al. 2008] D. Povey, “Boosted MMI for model and feature space discriminative training,” in Proc. ICASSP, pp. 4057-4060, 2008.
[Rosenfeld 1996] R. Rosenfeld, “A maximum entropy approach to adaptive statistical language modeling,” Computer Speech and Language, Vol. 10, No. 2, pp. 187-228, 1996.
[Saon et al. 2000] G. Saon, M. Padmanabhan, R. Gopinath and S. Chen, “Maximum likelihood discriminant feature spaces,” in Proc. ICASSP, pp. 1129-1132, 2000.
[Saon et al. 2008] G. Saon and D. Povey, “Penalty function maximization for large margin HMM training,” in Proc. Interspeech, pp. 920–923, 2008.
[Schlüter and Ney 2001] R. Schlüter and H. Ney, “Model-based MCE bound to the true Bayes’ error,” IEEE Signal Process. Letters, Vol. 8, No. 5, pp. 131-133, 2001.
[Schlüter and Ney 2001] R. Schlüter, W. Macherey, B. Müller, and H. Ney, “Comparison of discriminative training criteria and optimization methods for speech recognition,” Speech Communication, Vol. 34, pp. 287-310, May 2001.
[Scholkopf and Smola 2002] B. Scholkopf and A. Smola, Learning with kernels: support vector machine, regularization, optimization, and beyond, Cambridge, MA: MIT Press, 2002.
[Smola et al. 2000] A. J. Smola, P. Bartlett, B. Scholkopf and D. Schuurmans, Advances in large margin classifiers, The MIT Press, 2000.
[SRILM 2007] A. Stolcke, SRI language modeling toolkit, version 1.5.3, http://www.speech.sri.com/projects/srilm/ .
[Vapnik 2000] V. Vapnik, The nature of statistical learning theory, Second Edition, Springer, New York, 2000.
[Viikki and Laurila 1998] O. Viikki and K. Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition,” Speech Communication, Vol. 25, pp. 133-147, 1998.
[Viterbi 1967] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Trans. Information Theory, Vol. 13, No. 2, pp. 260-269. 1967.
[Wang et al. 2005] H.-M. Wang, B. Chen, J.-W. Kuo and S.-S. Cheng, “MATBN: A mandarin chinese broadcast news corpus,” International Journal of Computational Linguistics and Chinese Language Processing, Vol. 10, No.2, pp. 219-236, 2005.
[Wessel et al. 2001] F. Wessel, R. Schluter and H. Ney, “Explicit word error minimization using word hypothesis posterior probabilities,” in Proc. ICASSP , pp. 33-36, 2001.
[Young et al. 2006] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. C. Woodland, The HTK book, Version 3.4, 2006. http://htk.eng.cam.uk/
[Yu et al. 2008] D. Yu, L. Deng, X. He, and A. Acero, “Large-margin minimum classification error training: A theoretical risk minimization perspective,” Computer Speech and Language, Vol. 22, No. 4 pp. 415-429, October 2008.
[Zheng and Stolcke 2005] J. Zheng and A. Stolcke, “Improved discriminative training using phone lattices,” in Proc. Eurospeech, pp. 2125–2128, 2005.
[郭人瑋 2005] 郭人瑋, 最小化音素錯誤鑑別式聲學模型學習於中文大詞彙連續語音辨識之初步研究, 國立台灣師範大學資訊工程研究所碩士論文, 2005.
[劉士弘 2007] 劉士弘, 改善鑑別式聲學模型訓練於中文連續語音辨識之研究, 國立台灣師範大學資訊工程研究所碩士論文, 2007.
[朱芳輝 2008] 朱芳輝, 資料選取方法於鑑別式聲學模型訓練之研究, 國立台灣師範大學資訊工程研究所碩士論文, 2008.