Author: |
劉士弘 Shih-Hung Liu |
---|---|
Thesis Title: |
改善鑑別式聲學模型訓練於中文連續語音辨識之研究 Improved discriminative training for Mandarin continuous speech recognition |
Advisor: |
陳柏琳
Chen, Berlin |
Degree: |
碩士 Master |
Department: |
資訊工程學系 Department of Computer Science and Information Engineering |
Thesis Publication Year: | 2007 |
Academic Year: | 95 |
Language: | 中文 |
Number of pages: | 123 |
Keywords (in Chinese): | 鑑別式聲學模型訓練 、大詞彙連續語音辨識 、時間音框正確率函數 、資料選取 |
Keywords (in English): | Discriminative training, Large vocabulary continuous speech recognition, time frame accuracy function, data selection |
Thesis Type: | Academic thesis/ dissertation |
Reference times: | Clicks: 198 Downloads: 15 |
Share: |
School Collection Retrieve National Library Collection Retrieve Error Report |
本論文探討改善鑑別式聲學模型於中文大詞彙連續語音辨識之研究。首先,本論文提出一個新的時間音框層次音素正確率函數來取代最小化音素錯誤訓練的原始音素正確率函數,此新的音素正確率函數在某種程度上能充分地懲罰刪除錯誤。其次,本論文提出一個新的以時間音框層次正規化熵值為基礎的資料選取方法來改進鑑別式訓練,其正規化熵值是由訓練語料所產生之詞圖中高斯分布之事後機率所求得。此資料選取方法可以讓鑑別式訓練更集中在那些離決定邊界較近的訓練樣本所收集的統計值,以達到較佳的鑑別力。此資料選取方法更進一步地應用到非監督鑑別式聲學模型訓練上。最後,本論文也嘗試修改鑑別式訓練的目標函數,以收集不同的統計值來改進最小化音素錯誤鑑別式訓練。所使用的實驗題材是公視新聞語料。由初步的實驗結果來看,結合時間音框層次的資料選取方法和新的音素正確率函數在前幾次的迭代訓練中確實有些微且一致的進步。
This thesis considers improved discriminative training of acoustic models for Mandarin large vocabulary continuous speech recognition (LVCSR). First, we presented a new phone accuracy function based on the frame-level accuracy of hypothesized phone arcs instead of using the raw phone accuracy function of minimum phone error (MPE) training, which to some extent can sufficiently penalize deletion errors of speech recognition. Second, a novel data selection approach based on the normalized frame-level entropy of Gaussian posterior probabilities obtained from the word lattice of the training utterance was explored for discriminative training. It has the merit of making the training algorithm focus much more on the training statistics of those frame samples that center nearly around the decision boundary for better discrimination. The proposed data selection approach was further applied to unsupervised discriminative training of acoustic models. Finally, a few other modifications of the training objective functions, as well as the lattice structures, for the accumulation of MPE training statistics were investigated. Experiments conducted on the Mandarin broadcast news corpus (MATBN) collected in Taiwan showed that the integration of the frame-level data selection and new phone accuracy function could achieve slight but consistent improvements over the conventional MPE training at lower training iterations.
[A. Smola et al.] A. J. Smola, P. Bartlett, B. Scholkopf, D. Schuurmans, “Advances in Large Margin Classifiers”, The MIT Press
[A. Stolcke et al. 1997] A. Stolcke, Y. Konig, M. Weintraub, “Explicit Word Error Minimization in N-Best List Rescoring”, in Proc. ICASSP 1997
[Atal 1974] B. S. Atal, “Effectiveness of Linear Prediction Characteristics of The Speech Wave for Automatic Speaker Identification and Verification,” Journal of the Acoustical Society of America, Vol. 55, No. 6, pp.1304-1312, 1974
[Aubert 2002] X. Aubert, “An Overview of Decoding Techniques for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language, Vol. 16, pp. 89-114, 2002
[Bahl et al. 1983] Lalit R. Bahl, F. Jelinek and Robert L. Mercer (1983). “A Maximum Likelihood Approach to Continuous Speech Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. PAMI-5, no.2, March 1983.
[Bahl et al. 1986] L. R. Bahl, P. F. Brown, P. V. de Souza and R. L. Mercer. “Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition,” in Proc. ICASSP 1986.
[Barras et al. 2001] C. Barras, E. Geoffrois, Z.B. Wu and M. Liberman, “Transcriber : Development and use of a tool for assisting speech corpora production,” Speech communication, 33 : 5-22, 2001.
[Baum 1972] L. E. Baum (1972). “An Inequality and Associated Maximization Technique in Statistical Estimation for Probabilistic Functions of Markov Processes,” Inequalities, 3(1):1-8, 1972.
[B.H. Junag et al. 1992] B. H. Juang, S. Katagiri, “Discriminative Learning for Minimum Classification Error”, IEEE Trans. Signal Processing, Vol.40, No.12 1992
[B.H. Junag et al. 1997] B. H. Juang, Wu Chou, Chin-Hui Lee, “Minimum Classification Error Rate Methods for Speech Recognition”, IEEE Trans. SAP, Vol.5, No.3 1997
[Chen et al. 2002] B. Chen, H.-M. Wang , and L.-S. Lee, “Discriminating Capabilities of Syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese,” IEEE Trans. Speech and Audio Processing , 10(5) : 303-314, 2002.
[Chen et al. 2004] B. Chen, J.-W. Kuo, W.H. Tsai, “Lightly supervised ad data-driven approaches to Mandarin broadcast news transcription,” in Proc. ICASSP, 2004
[Chen et al. 2005] B. Chen, J.-W. Kuo and W.-H. Tsai, "Lightly Supervised and Data-driven Approaches to Mandarin Broadcast News Transcription," International Journal of Computational Linguistics & Chinese Language Processing, Vol. 10, No. 1, pp1-18,2005
[Davis and Mermelstein 1980] S. B. Davis and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acoustic, Speech, and Signal Processing, Vol. 28, No. 4, pp.357-366, 1980
[Doumpiotis et al. 2004] V. Doumpiotis, S. Tsakalidis, W. Byrne (2004). “Lattice Segmentation and Minimum Bayes Risk Discriminative Training,” in Proc. Eurospeech’04.
[Doumpiotis & Byrne 2004] V. Doumpiotis and W. Byrne (2004). “Pinched Lattice Minimum Bayes Risk Discriminative Traning for Large Vocabulary Continuous Speech Recognition,” in Proc. ICSLP’04.
[Duda et al. 1973] R. O. Duda, P. E. Hart and D. G. Stork (2000). Pattern Classification, First Edition. New York: John & Wiley, 2000.
[Duda et al. 2000] R. O. Duda, P. E. Hart and D. G. Stork (2000). Pattern Classification, Second Edition. New York: John & Wiley, 2000.
[Fiscus 1997] J. Fiscus (1997). “A Post-processing System to Yield Reduced Word Error Rates: Recognizer Output Voting Error Reduction (ROVER),” in Proc. ASRU’97.
[Gales 1999] M. J. F. Gales, “Semi-tied Covariance Matrices for Hidden Markov Models,” IEEE Trans. on Speech, Audio and Signal Processing, Vol. 7, No.3, pp. 272-281, 1999
[G. Heigold et al. 2005] G. Heigold et al, “Minimum Exact Word Error Training”, in Proc. ASRU 2005
[Gibson et al. 2006] Gibson M. and Hain T., ”Hypothesis Spaces for Minimum Bayes Risk Training in Large Vocabulary Speech Recognition”, in Proc. ICSLP 2006
[Gopinath 1998] R. A. Gopinath, “Maximum Likelihood Modeling with Gaussian Distributions,” in Proc. of ICASSP 1998
[Goel & Byrne 2000] V. Goel and W. Byrne (2000). “Minimum Bayes-Risk Automatic Speech Recognition,” Computer Speech and Language, Vol. 14, pp.115-135, 2000.
[Goel et al. 2004] V. Goel ,S. Kumar, W. Byrne (2004). “Segmental Minimum Bayes-Risk Decoding for Automatic Speech Recognition,” IEEE Transactions on Speech and Audio Processing, Vol. 12, No. 3, pp.234-249, 2004.
[Gopalakrishnan et al. 1991] P. S. Gopalakrishnan, D. Kanevsky, A. Nádas & D. Nahamoo (1991). “An Inequality for Rational Functions with Applications to Some Statistical Estimation Problems,” IEEE Trans. Information Theory, Vol. 37, pp.107-113, 1991.
[Huang et al. 2001] X. Huang, A. Acero and H. Hon, “Spoken Language Processing,” Prentice Hall, 2001
[Jiang et al. 2006] Hui Jiang, Xinwei Li, Chaojun Liu, “Large Margin Hidden Markov Models for Speech Recognition”, IEEE Transaction on ASLP 2006
[Jiang 2005] H. Jiang,“Confidence Measures for Speech Recognition: A Survey,” Speech Communication, Vol. 45, pp. 455-470, 2005.
[Jinyu Li et al. 2006] Jinyu Li, Ming Yuan, Chin-Hui Lee, “Soft Margin Estimation of Hidden Markov Model Parameters”, in Proc. ICSLP 2006
[Jinyu Li et al. 2007] Jinyu Li, S. M. Siniscalchi, Chin-Hui Lee, “Appeoximate Test Risk Minimization Through Soft Margin Estimation”, in Proc. ICASSP 2007
[Juang & Katagiri 1992] B.-H. Juang and S. Katagiri (1992). “Discriminative Learning for Minimum Error Classification,” IEEE Trans. Signal Processing, Vol. 40, No. 12, pp. 3043-3054, 1992.
[J. Zheng et al. 2005] Jing Zheng and Andreas Stolcke (2005) “Improved Discriminative Training Using Phone Lattices”, In Proc. Interspeech 2005
[Jun Du et al. 2006] Jun Du, Peng Liu, F. K. Soong, J. L. Zhou, R. H. Wang, “Minimum Divergence Based Discriminative Training”, in Proc. ICSLP 2006
[Jun Du et al. 2007] J. Du, P. Liu, F. K. Soong, J. L. Zhou, R. H. Wang, “A New Minimum Divergence Approach to Discriminative Training”, in Proc. ICASSP 2007
[Kaiser et al. 2002] J. Kaiser, B. Horvat, Z. Kacic (2002). “Overall Risk Criterion Estimation of Hidden Markov Model Parameters,” Speech Communication, Vol. 38, pp.383-398, 2002.
[Kamppari et al. 2000] S. O. Kamppari and T. J. Hazen, “Word and Phone Level Acoustic Confidence Scoring,” in Proc. of ICASSP 2000
[Katagiri et al. 1998] S. Katagiri, B. H. Juang, Chih-Hui Lee, “Pattern Recognition Using a Family of Design Algorithms based upon the Generalized Probabilistic Descent Method”, Proceeding of the IEEE, Vol. 86, No.11, 1998
[Katz 1987] S. M. Katz, “Estimation of probabilities form sparse data for other language component of a speech recognizer,” IEEE Trans. Acoustics, Speech and Signal Processing , 35(5) : 300-401, 1987
[Korkmazsky et al. 2004] F. Korkmazsky, D. Fohr and I. Illina, “Using Linear Interpolation to Improve Histogram Equalization for Speech Recognition,” in Proc. of ICSLP, 2004
[Kumar 1997] N. Kumar, “Investigation of Silicon-Auditory Models and Generalizaion of Linar Discriminant Analysis for Improved Speech Recognition”, Ph.D. Thesis, John Hopkins University, Baltimore, 1997
[Kuo et al. 2005] Jen-Wei Kuo, Berlin Chen, "Minimum Word Error Based Discriminative Training of Language Models," in Proc. Eurospeech 2005
[Kuo et al. 2006] Jen-Wei Kuo, Shih-Hung Liu, Hsin-min Wang, Berlin Chen, "An Empirical Study of Word Error Minimization Approaches for Mandarin Large Vocabulary Speech Recognition," International Journal of Computational Linguistics & Chinese Language Processing, Vol. 11, No. 3, 2006
[Lamel 2002] Lori Lamel, J. Gauvain, G.. Adda, “Lightly Supervised and Unsupervised Acoustic Model Training,” Computer Speech and Language, Vol.16, pp.115-129, 2002
[LDC] Linguistic Data Consortium : http://ldc.upenn.edu/.
[Levenshtein 1966] A. Levenshtein (1966). “Binary Codes Capable of Correcting Deletions, Insertions and Reversals,” Soviet Physics Doklady, Vol. 10, No. 8, pp.707-710, 1966.
[Li et al. 2005] Xinwei Li, Hui Jiang, Chaojun Liu, “Large Margin HMMs for Speech Recognition”, in Proc. ICASSP 2005
[Lin et al. 2006] Shih-Hsiang Lin, Yao-Ming Yeh, Berlin Chen, "Exploiting Polynomial-Fit Histogram Equalization and Temporal Average for Robust Speech Recognition," in Proc. ICSLP 2006.
[Liu et al. 2007] Shih-Hung Liu, Fang-Hui Chu, Shih-Hsiang Lin, Berlin Chen, "Investigating Data Selection for Minimum Phone Error Training of Acoustic Models," in Proc. ICME 2007
[Ma et al. 2006] J. Ma, S. Matsoukas, O. Kimball, R. Schwartz, “Unsupervised Training on Large Amounts of Broadcast News Data”, in Proc. ICASSP 2006
[Matias et al. 2006] L. Mathias, G.. Y., J. Fritsch , “Discriminative Training of Acoustic Models Applied to Domains with Unreliable Transcripts”, in Proc. ICASSP 2005
[Mangu et al. 2000] L. Mangu, E. Brill and A. Stolcke. “Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion Networks,” Computer Speech and Language, Vol. 14, pp.373-400, 2000.
[McDermott et al. 1997] E. McDermott and S. Katagiri (1997). “String-Level MCE for Continuous Phoneme Recognition,” in Proc. Eurospeech 1997
[Na et al. 1995] K. Na, B. Jeon, D. Chang, S. Chae, and S. Ann. “Discriminative Training of Hidden Markov Models using Overall Risk Criterion and Reduced Gradient Method,” in Proc. Eurospeech 1995.
[Ney et al. 1994] H. Ney, U. Essen, and R. Kneser, “On Structuring Probabilistic Dependences in Stochastic Language Modeling,” Computer Speech and Language, Vol. 8, pp.1-38, 1994
[Normandin 1991] Y. Normandin (1991). “Hidden Markov Models, Maximum Mutual Information Estimation, and the Speech Recognition Problem,” Ph.D Dissertation, McGill University, Montreal, 1991.
[NTNU 2004] Speech Lab, Graduate Institute of Computer Science and Information Engineering, Nation Taiwan Normal University. http://speech.csie.nctu.edu.tw/
[Ortmanns et al. 1997] S. Ortmanns, H. Ney, X. Aubert, “A Word Graph Algorithm for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language, Vol. 11, pp.11-72, 1997
[Povey & Woodland 2002] D. Povey and P. C. Woodland (2002). “Minimum Phone Error and I-smoothing for Improved Discriminative Training,” in Proc. ICASSP 2002.
[Povey 2004] Daniel Povey , “Discriminative Training for Large Vocabulary Speech Recognition,” Ph.D Dissertation, University of Cambridge, 2004.
[Povey et al. 2007] Daniel Povey and B. Kingsbury, “Evaluation of Proposed Modifications to MPE for Large Scale Discriminative Training”, in Proc. ICASSP 2007.
[PTS] Public Television Service Foundation. http://www.pts.org.tw/.
[Rabiner 1989] L.R. Rabiner, “A tutorial on hidden Markov models and selected applications inspeech recognition”, Proceedings of the IEEE 1989
[Rose et al. 1995] R. C. Rose, B. H. Juang and C.-H. Lee, “A Training Procedure for Verifying String Hypothesis in Continuous Speech Recogniton,” in Proc. of ICASSP 1995
[Rosenfeld 1996] R. Rosenfeld, “A Maximum Entropy Approach to Adaptive Statistical Language Modeling,” Computer Speech and Language, Vol. 10, No. 2, pp 187-228, 1996
[Sanchis et al. 2004] A. Sanchis, A. Juan, and E. Vidal, “New Features Based on Multiple Word Graphs For Utterance Verification,” in Proc. of ICSLP, 2004
[SLG] Spoken Language Group at Chinese Information Processing Laboratory, Institute of Information Science, Academia Sinica. http://sovideo.iis.sinica.edu.tw/SLG/index.htm.
[SLP NTNU] Speech Lab, Graduate Institute of Computer Science and Information Engineering, Nation Taiwan Normal University. http://speech.csie.nctu.edu.tw/
[SRILM 2002] A. Stolcke, “SRI language modeling toolkit,” Version 1.5.2, http://www.speech.sri.com/projects/srilm/ .
[Schwartz et al. 1990] R. Schwartz and Y. L. Chow, “The N-Best algorithm: an efficient
and exact procedure for finding the N most likely sentence hypotheses,”in Proc.
IEEE Int. Conf. Acoustics, Speech, Signal Processing, vol.1, pp. 81-84, 1990.
[Schlüter et al. 2001] R. Schlüter, W. Macherey, B. Muller, H. Ney (2001). “Comparison of Discriminative Training Criteria and Optimization Methods for Speech Recognition,” Speech Communication, Vol. 34, pp. 287-310, 2001
[Valtchev et al. 1996] V. Valtchev, J. J. Odell, P. C. Woodland, S. J. Young. (1996). “Lattice-Based Discriminative Training for Large Vocabulary Speech Recognition,” in Proc. ICASSP 1996.
[Vapnik 1995] V. Vapnik, ”The Nature of Statistical Learning Theory”, Springer-Verlag, New York, 1995
[Viterbi 1967] A. J. Viterbi (1967). “Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm,” IEEE Trans. Information Theory, vol. 13, no. 2, April 1967.
[Viikki and Laurila 1998] O. Viikki, K. Laurila, “Cepstral Domain Segmental Feature Vector Normalization for Noise Robust Speech Recognition,” Speech Communication, Vol. 25, pp. 133-147, 1998
[Wang et al. 2005] Hsin-min Wang, Berlin Chen, Jen-Wei Kuo and Shih-Sian Cheng, "MATBN: A Mandarin Chinese Broadcast News Corpus," International Journal of Computational Linguistics & Chinese Language Processing, Vol. 10, No. 2, 2005
[Wang et al. 2007] L. Wang, M.J.F. Gales, P.C. Woodland, “Unsupervised Training for Mandarin Broadcast News and Conversation Transcrption”, in Proc. ICASSP 2007
[Wessel et al 2001] F. Wessel, R. Schluter, K. Macherey, H. Ney, “Explicit Word Error Minimization Using Word Hypothesis Posterior Probability”, in Proc. ICASSP 2001
[Wessel et al 2001b] Frank Wessel and Hermann Ney ,“Unsupervised Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition”, in Proc. ASRU 2001
[Wessel et al 2005] Frank Wessel and Hermann Ney ,“Unsupervised Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition”, IEEE Trans. SAP, Vol.13, No.1 2005
[X. Li et al. 2005] Xinwei Li, Hui Jiang, “A Constrained Joint Optimization Method for Large Margin HMM Estimation”, in Proc. ASRU 2005
[X. Li et al. 2006] Xinwei Li, Hui Jiang, “Solving Large Margin Estimation of HMMs via Semidefinite Programming”, in Proc. ICSLP 2006
[Young 1994] S. R. Young, “Detecting Misrecognition and Out-of-vocabulary Words,” in Proc. of ICASSP 1995
[Young et al. 2006] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. C. Woodland (2006). The HTK Book. Version 3.4, 2006. http://htk.eng.cam.ac.uk/
[Zhang and Rudnicky 2001] R. Zhang and A. I. Rundicky, “Apply N-Best List Re-ranking to Acoustic Model Combinations of Boosting Training,” in Proc. of ICSLP 2004
[郭人瑋 2005] 郭人瑋,“最小化音素錯誤鑑別式聲學模型學習於中文大詞彙連續語音辨識之初步研究”,Master Thesis, NTNU, 2005
[陳燦輝 2006] 陳燦輝, “信心度評估於中文大詞彙連續語音辨識之研究”, Master Thesis, NTNU, 2006