研究生: |
陳燦輝 Tzan-hwei Chen |
---|---|
論文名稱: |
信心度評估於中文大詞彙連續語音辨識之研究 Exploring the Use of Confidence Measures for Mandarin Large Vocabulary Continuous Speech Recognition |
指導教授: |
王新民
Hsin-MinWang 陳柏琳 Chen, Berlin |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2006 |
畢業學年度: | 94 |
語文別: | 中文 |
論文頁數: | 80 |
中文關鍵詞: | 信心度評估 、熵值 、最小化貝氏風險法則 、中文大詞彙連續語音辨識 |
英文關鍵詞: | Confidence Measures, Entropy, Minimum Bayes Risk, Large Vocabulary Continuous Speech Recognition |
論文種類: | 學術論文 |
相關次數: | 點閱:165 下載:9 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文初步地探討信心度評估(Confidence Measures)於中文大詞彙連續語音辨識上之研究。除了討論原本一般信心度評估應用於判斷語音辨識結果(例如候選詞)是否正確之外,也嘗試將信心度評估應用在詞圖搜尋(Word Graph Rescoring)或N-最佳詞序列(N-best List)重新排序(Reranking)的研究。而實驗語料則是使用公視新聞語料庫(MATBN)中的外場記者(Field Reporters)跟受訪者(Interviewees)語句,以分別探討信心度評估在偏朗讀語料(Read Speech)或偏即性口語(Spontaneous Speech)等兩種不同性質的語句上是否能有不同的效能。首先,本論文嘗試使用熵值(Entropy)資訊並結合以事後機率為基礎之信心度評估方法,在MATBN外場記者(Read Speech)及外場受訪者(Spontaneous Speech)測試語料所得到的最佳實驗結果,可較傳統僅使用以事後機率為基礎之信心度評估可以分別有16.37%及12.00%的信心度錯誤率相對減少(Relative Reduction)。另一方面,在以最小化音框錯誤率(Time Frame Error)搜尋法來增進詞圖搜尋的正確率之實驗中,本論文嘗試結合以梅爾倒頻譜係數(Mel-frequency Cepstral Coefficients, MFCC),以及以異質性線性鑑別分析(Heteroscedastic Linear Discriminant Analysis, HLDA)搭配最大相似度線性轉換(Maximum Likelihood Linear Transformation, MLLT)兩種不同語音特徵參數所形成的詞圖資訊,並以最小化音框錯誤率搜尋法來降低語音辨識系統的字錯誤率,經由實驗顯示在外場記者測試語料能有4.6%的字錯誤率相對減少,而在外場受訪者測試語料的部份則有4.8%的字錯誤率相對減少,相較於僅使用異質性線性鑑別分析及最大相似度線性轉換求得語音特徵參數的詞圖並配合最小化音框錯誤率法有較佳的結果。最後,本論文嘗試在傳統以Levenshtein距離為成本函式(Cost Function)的最小化貝氏風險(Minimum Bayes Risk)辨識法則中,適當的加入以特徵為基礎的信心度評估。雖然經由實驗得知,在外場記者以及外場受訪者的語料中,對於辨識錯誤率並沒有很明顯的進步或退步,但相較於傳統利用Levenshtein距離為成本函式的最小化貝氏風險辨識法則而言,卻有較佳的結果。
This thesis investigated the use of various kinds of confidence measures for Mandarin large vocabulary continuous speech recognition (LVCSR). These confidence measures were not only used as a post processor to justify the correctness of final recognition hypotheses, but also directly integrated into the word graph rescoring and N-best list reranking procedures for the generation of better recognition hypotheses. All experiments were carried out on the Mandarin broadcast news corpus (MATBN), including the speech utterances of field reporters and interviewees which also respectively belong to the read speech style and the spontanesous speech one. Several approaches to utilizing confidence measures for Mandarin LVCSR were presented and extensively studied in this thesis. First, the entropy information and the posterior probability based confidence measure were tightly combined, and the experimental results showed that such an approach could give relative confidence error rate reductions of 16.37% and 12.00%, respectively, for the field reporters’ speech and the interviewees’ speech, compared to those obtained by using the posterior probability based confidence measure alone. On the other hand, we attempted to jointly consider the information inherent in the word graph constructed by using the Mel-frequency cepstral coefficients (MFCC), and the word graph constructed by using the discriminant acoustic features resulting form the heteroscedastic linear discriminant analysis and maximum likelihood linear transformation (HLDA+MLLT). The minimum time frame error decoding was conducted on these two word graphs simultanesously to find the best word sequence among them. The experimental results showed that such an approach could achieve character error rate reductions of 4.6% and 4.8%, respectively, for the field reporters’ speech and the interviewees’ speech, which were better than the results obtained by conducting the minimum time frame error decoding on the word graph of HLDA+MLLT alone. Finally, we incorporated the feature-based confidence measure with the minimum Bayes risk decoding. Compared to the conventional minimum Bayes risk decoding, the proposed approach demonstrates slight but consistent performance gains.
[Abdou and Scordilis 2003] S. Abdou and M. S. Scordilis, “An Efficient Fast Matching Approach Using Posterior Probability Estimates in Speech Recognition,” Proc. of European Conference on Speech Communication Technology, 2003.
[Afify et al. 2005] M. Afify, F. Liu, H. Jiang and O. Siohan, “A New Verification-based Fast-match for Large Vocabulary Continuous Speech Recognition,” IEEE Trans. Speech and Audio processing, Vol . 13,No.4,pp 546-553, 2005.
[Aubert 2002] X. Aubert, “An Overview of Decoding Techniques for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language, Vol. 16, pp. 89-114, 2002.
[Atal 1974] B. S. Atal, “Effectiveness of Linear Prediction Characteristics of The Speech Wave for Automatic Speaker Identification and Verification,” Journal of the Acoustical Society of America, Vol. 55, No. 6, pp.1304-1312, 1974.
[Bahl et al. 1983] L. R. Bahl, F. Jelinek and R. L. Mercer, “A Maximum Likelihood Approach to Continuous Speech Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. PAMI-5, No.2, pp.179-190, 1983
[Bahl et al. 1986] L. R. Bahl, P. F. Brown, P. V. de Souza and R. L. Mercer, “Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 1986.
[Barras et al. 1986] C. Barras, E. Geoffrois, Z. B. Wu, and M. Liberman, “Transcriber: Development and Use of a Tool for Assisting Speech Corpora Production,” Speech Communication, Vol. 33, pp.5-22, 2001.
[Baum 1972] L. E. Baum, “An Inequality and Associated Maximization Technique in Statistical Estimation for Probabilistic Functions of Markov Processes,” Inequalities, Vol. 3, No. 1, pp.1-8, 1972.
[Belllegarda 1998] J. R. Bellegarda, “A Multispan Language modeling Framework for Large Vocabulary Speech Recognition,“ IEEE Trans. on Acoustic, Speech and Signal Processing, Vol. 6, No. 5, pp. 456-467, 1998.
[Belllegarda 2000] J. R. Bellegarda, “Exploiting Latent Semantic Information in Statistical Language Modeling,” Proceedings of the IEEE, Vol. 88, pp.1279-1296, 2000.
[Bellegarda 2005] J. R. Bellegarda, “Latent Semantic Mapping,” IEEE Signal Processing Magazine, Vol.22, pp70-80, 2005.
[Benitez et al. 2000] M .C. Benitez, A. Rubio, and A. Torre, “Different Confidence Measures for Word Verification in Speech Recognition,” Speech Communication, Vol. 32, pp. 79–94, 2000.
[Boll 1979] S. F. Boll, “Suppression of Acoustic Noise in Speech Using Spectral Subtraction,” IEEE Trans. on Acoustic, Speech and Signal Processing, Vol. 27, No. 2, pp. 113-120, 1979.
[Chase 1997] L. Chase, “Word and Acoustic Confidence Annotation for Large Vocabulary Speech Recognition,” Proc. of European Conference on Speech Communication Technology, 1997.
[Chen and Goodman 1999] S. F. Chen and J. Goodman, “An Empirical Study of Smoothing Techniques for Language Modeling,” Computer Speech and Language, Vol. 13, pp. 359-393, 1999.
[Chen et al. 2004] B. Chen, J.-W. Kuo and W.-H. Tsai, “Lightly Supervised and Data-driven Approaches to Mandarin Broadcast News Transcription,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 2004.
[Chen et al. 2005] B. Chen, J.-W. Kuo and W.-H. Tsai, "Lightly Supervised and Data-driven Approaches to Mandarin Broadcast News Transcription," International Journal of Computational Linguistics & Chinese Language Processing, Vol. 10, No. 1, pp1-18,2005.
[Cox and Dasmahapatra 2002] S. Cox and S. Dasmahapatra, “High-level Approaches to Confidence Estimation in Speech Recognition,” IEEE Trans. Acoustic, Speech, and Signal Processing, Vol. 10, No. 7, pp.460-471, 2002.
[Cox and Rose 1996] S. Cox and R. Rose, “Confidence Measures for the Switchboard Database,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 1996.
[Davis & Mermelstein 1980] S. B. Davis and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans. Acoustic, Speech, and Signal Processing, Vol. 28, No. 4, pp.357-366, 1980.
[Eide et al. 1995] E. Eide, H. Gish, P. Jeanrenaud and A. Mielke, “Understanding and Improving Speech Recognition Performance Through the Use of Diagnostic Tools,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 1995.
[Fabian et al. 2003] T. Fabian, R. Lieb, G. Ruske and T. Thomae, “Impact of Word Graph Density on Quality of Posterior Probability Based Confidence Measures,” Proc. of European Conference on Speech Communication Technology, 2003.
[Fabian et al. 2005] T. Fabian, R. Lieb, G. Ruske and T. Thomae, “A Confidence-guided Dynamic Pruning Approach – Utilization of Confidence Measurement in Speech Recognition,” Proc. of European Conference on Speech Communication Technology, 2005.
[Furnas et al. 1988] G. W. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. A. Harshman, L. A. Streeter and L. E. Lochbaum, ”Information Retrieval Using a Singular Value Decomposition Model of Latent Semantic Structure,” Proc. of International Conference on Research and Development in Information Retrieval, pp 465-480, 1988.
[Gales 1999] M. J. F. Gales, “Semi-tied Covariance Matrices for Hidden Markov Models,” IEEE Trans. on Speech, Audio and Signal Processing, Vol. 7, No.3, pp. 272-281, 1999.
[Goel and Byrne 2000] V. Goel and W. Byrne, “Minimum Bayes-risk Automatic Speech Recognition,” Computer Speech and Language, Vol. 14, pp.115-135, 2000.
[Gopinath 1998] R. A. Gopinath, “Maximum Likelihood Modeling with Gaussian Distributions,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 1998.
[Guo et al. 2004] G. Guo, C. Huang, H. Jiang and R.-H. Wang, “A Somparative Study on Various Confidence Measures in Large Vocabulary Speech Recognition,” Proc. of International Conference on Spoken Language Processing, 2004.
[Hazen et al. 2002] T. J. Hazen, S. Seneff, and J. Polifroni, “Recognition Confidence Scoring and Its Use in Speech Understanding Systems,” Computer Speech and Language, Vol. 16, pp.49-67, 2002.
[Huang et al. 2001] X. Huang, A. Acero and H. Hon, “Spoken Language Processing,” Prentice Hall, 2001.
[Jelinek 1999] F. Jelinek, “Statistical Methods for Speech Recognition,” the MIT press, 1999.
[Jiang 2005] H. Jiang,“Confidence Measures for Speech Recognition: A Survey,” Speech Communication, Vol. 45, pp. 455-470, 2005.
[Juang & Katagiri 1992] B.-H. Juang and S. Katagiri, “Discriminative Learning for Minimum Error Classification,” IEEE Trans. Signal Processing, Vol. 40, No. 12, pp. 3043-3054, 1992
[Katz 1987] S. M. Katz, “Estimation of Probabilities from Sparse Data for Other Language Component of a Speech Recognizer,” IEEE Trans. Acoustics, Speech and Signal Processing, Vol. 35, No.3, pp. 400-401, 1987.
[Kamppari and Hazen 2000] S. O. Kamppari and T. J. Hazen, “Word and Phone Level Acoustic Confidence Scoring,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 2000.
[Kemp and Schaaf 1997] T. Kemp and T. Schaaf, “Estimating Confidence Using Word Lattice,” Proc of European Conference on Speech Communication Technology, 1997.
[Korkmazsky 2004] F. Korkmazsky, D. Fohr and I. Illina, “Using Linear Interpolation to Improve Histogram Equalization for Speech Recognition,” Proc. of International Conference on Spoken Language Processing, 2004.
[Lane and Kawahara 2005] I. R. Lane and T. Kawahara, “Utterance Verification Incorporating In-domain Confidence and Discourse Coherence Measures,” Proc. of European Conference on Speech Communication Technology, 2005.
[LDC] Linguistic Data Consortium: http://ldc.upenn.edu/.
[Lo and Soong 2005] W. K. LO and F. K. Soong, “Generalized Posterior Probability for Minimum Error Verification of Recognized Sentences,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 2005.
[Mangu et al. 2000] L. Mangu, E. Brill and A. Stolcke, “Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion NetWorks,” Computer Speech and Language, Vol. 14, pp.373-400, 2000.
[Neti et al. 1997] C. V. Neti, S. Roukos, E. Eide, “Word-based Confidence Measures as a Guide for Stack Search in Speech Recognition,” Proc. of International Conference on Acoustics, Speech and Signal Processing, 1997.
[Ney et al. 1994] H. Ney, U. Essen, and R. Kneser, “On Structuring Probabilistic Dependences in Stochastic Language Modeling,” Computer Speech and Language, Vol. 8, pp.1-38, 1994.
[NIST] National Institute of Standards and Technology. http://www.nist.gov/.
[NTNU 2004] Speech Lab, Graduate Institute of Computer Science and Information Engineering, Nation Taiwan Normal University. http://speech.csie.nctu.edu.tw/.
[Ortmanns et al. 1997] S. Ortmanns, H. Ney and X. L. Aubert, “A Word Graph Algorithm for Large Vocabulary Continuous Speech Recognition,“ Computer Speech and Language, Vol. 11, pp.43-72, 1997.
[Povey 2004] D. Povey, “Discriminative Training for Large Vocabulary Speech Recognition,” Ph.D Dissertation, Peterhouse, University of Cambridge, July 2004.
[PTS] Public Television Service Foundation. http://www.pts.org.tw.
[Qian et al. 2004] Y. Qian, T. Lee and F. K. Soong, "Tone Information as a Confidence Measure for Improving Cantonese LVCSR,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 2004.
[Rabiner 1989] L. R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Procedings of the IEEE, Vol. 77, No. 2, 1989.
[Razik et al. 2005] J. Razik, O. Mella, D. Fohr and J. P. Haton, “Local Word Confidence Measure Using Word Graph and N-best List,” Proc. of European Conference on Speech Communication Technology, 2005.
[Rose et al. 1995] R. C. Rose, B. H. Juang and C.-H. Lee, “A Training Procedure for Verifying String Hypothesis in Continuous Speech Recogniton,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 1995.
[Rosenfeld 1996] R. Rosenfeld, “A Maximum Entropy Approach to Adaptive Statistical Language Modeling,” Computer Speech and Language, Vol. 10, No. 2, pp 187-228, 1996.
[Sanchis et al. 2003] A. Sanchis, A. Juan and E. Vidal, “Improving Utterance Verification Using a Soomthed Naïve Bayes Model,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 2003.
[Sanchis et al. 2004] A. Sanchis, A. Juan, and E. Vidal, “New Features Based on Multiple Word Graphs For Utterance Verification,” Proc. of International Conference on Spoken Language Processing, 2004.
[San-Segundo et al. 2001] R. San-Segundo, B. Pellom, K. Hacioglu and W. Ward, “Confidence Measures for Spoken Dialogue System,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 2001.
[Saon et al. 2000] G. Saon, M. Padmanabhan, R. Gopinath and S. Chen, “Maximum Likelihood Discriminant Feature Spaces,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 2000.
[Schaaf and Kemp 1997] T. Schaaf and T. Kemp, “Confidence Measure for Spontaneous Speech Recognition,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 1997.
[SLG] Spoken Language Group at Chinese Information Processing Laboratory, Institute of Information Science, Academia Sinica. http://sovideo.iis.sinica.edu.tw/SLG/index.htm.
[SRILM 2000] A. Stolcke, “SRI language Modeling Toolkit,” version 1.3.3, http://www.speech.sri.com/projects/srilm/
[Stolcke et al. 1997] A. Stolcke, Y. Konig and M. Weintraub, “Explicit Word Error Rate Minimization in N-Best List Rescoring,” Proc of European Conference on Speech Communication Technology. 1997.
[Tseng and Liu 2001] S.-C. Tseng and Y.-F. Liu, ”Mandarin Conversational Dialogue Corpus. MCDC,” Technical Note 2001-01. Institute of Linguistics, Academia Sinica, Taipei.
[Uhrik and Ward 1997] C. Uhrik and W. Ward, “Confidence Metrics Based on N-gram Language model Backoff Behaviors,” Proc of European Conference on Speech Communication Technology. 1997.
[Viikki and Laurila 1998] O. Viikki, K. Laurila, “Cepstral Domain Segmental Feature Vector Normalization for Noise Robust Speech Recognition,” Speech Communication, Vol. 25, pp. 133-147, 1998.
[Viterbi 1967] A. J. Viterbi, “Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm,” IEEE Trans. Information Theory, Vol. 13, No. 2, 1967.
[Wang et al. 2005] H.-M. Wang, B. Chen, J.-W. Kuo and S.-S. Cheng, “MATBN: A Mandarin Chinese Broadcast News Corpus,” International Journal of Computational Linguistics and Chinese Language Processing, Vol. 10, No.2, pp.219-236, 2005.
[Wessel et al. 2000] F. Wessel, R. Schlüter and H. Ney, “Using Posterior Word ProBabilities for Improved Speech Recognition,” Proc. of International Conference on Acoustics, Speech and Signal Processing, 2000.
[Wessel et al. 2001] F. Wessel, R. Schlüter, K. Macherey and H. Ney, “Confidence Measure for Large Vocabulary Continuous Speech Recognition,” IEEE Trans. Speech and Audio Processing, Vol.9, No. 3, pp.288-298, 2001.
[Wessel et al. 2001b] F. Wessel, R. Schlüter, K. Macherey and H. Ney, “Explicit Word Error Minimization Using Word Hypothesis Posterior Probabilities,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 2001.
[Wessel and Ney 2005] F. Wessel and H. Ney, “Unsupervised Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition,” IEEE Trans. Speech and Audio Processing, Vol.13, No. 1, pp.23-31, 2005.
[Wilpon et al. 1990] J. G. Wilpon, L. R. Rabiner, C.-H. Lee and R. Goldman, “Automatic Recognition of Keywords in Unconstrained Speech Using Hidden Markov Models,” IEEE Trans. Acoustics Speech Signal Process, Vol.38, No.11,pp.1870-1878, 1990.
[Young 1994] S. R. Young, “Detecting Misrecognition and Out-of-vocabulary Words,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 1995.
[Young et al. 2002] S. Young, G. Evermann, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. C. Woodland. The HTK Book. Version 3.2, 2002. http://htk.eng.cam.ac.uk/.
[Zhang and Rudnicky 2001] R. Zhang and A. I. Rundicky, “Word Level Confidence Annotation Using Combination of Features,” Proc of European Conference on Speech Communication Technology, 2001.
[Zhang and Rudnicky 2004] R. Zhang and A. I. Rundicky, “Apply N-Best List Re-ranking to Acoustic Model Combinations of Boosting Training,” Proc. of International Conference on Spoken Language Processing, 2004.
[Zhou et al. 2006] Z. Zhou, J. Gao, F. K. Soong and H. Meng, “A Comparative Study of Discriminative Methods for Reranking LVCSR N-best Hypotheses in Domain Adaptation and Generalization,” Proc. of International Conference on Acoustic, Speech and Signal Processing, 2006.
[郭人瑋 2005] 郭人瑋, ”最小化音素錯誤鑑別式聲學模型學習於中文大詞彙連續語音辨識之初步研究,” 國立台灣師範大學資訊工程所碩士論文, 2005.