簡易檢索 / 詳目顯示

研究生: 許庭瑋
TingWei Hsu
論文名稱: 英文連續語音辨識之初步研究
An Initial Study on English Continuous Speech Recognition
指導教授: 陳柏琳
Chen, Berlin
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2007
畢業學年度: 95
語文別: 中文
論文頁數: 89
中文關鍵詞: 連續語音辨識詞內三連音素模型狀態連結音素模糊矩陣
英文關鍵詞: Continuous Speech Recognition, Intra Triphone, State tying, Confusion Matrix
論文種類: 學術論文
相關次數: 點閱:263下載:30
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文為英文連續語音辨識之初步研究。我們實作英文連續語音辨識器,並探討其主要組成,包含語音特徵擷取、聲學模型及語言模型等。首先,針對語音特徵擷取,我們比較傳統式梅爾倒頻譜係數(Mel-frequency Cepstral Coefficients, MFCC)與線性鑑別分析(Linear Discriminant Analysis, LDA)和異質性線性鑑別分析(Heteroscedastic Linear Discriminant Analysis, HLDA)之效能。再者,針對聲學模型,我們探討詞內三連音素模型(Intra-word Triphone Models)、狀態連結(State-Tying)技術、音素模糊矩陣(Phone Confusion Matrix)與非監督式聲學模型訓練(Unsupervised Acoustic Model Training)的使用,以提升語音辨識率。最後,針對語言模型,在語音辨識過程中分別利用詞頻數混合法(Count Merging)與模型插補法(Model Interpolation),結合背景與同領域語言模型訓練語料,以達到較佳之詞發生預測。本論文實驗是以美國之音與台灣腔英文語料為題材,並有一些初步的觀察及發現。

    This thesis is intended to perform a preliminary study on English continuous speech recognition. An English continous speech recognizer was implemented, while parts of its major constituents, including speech feature extraction, acoustic modeling and language modeling, were extensively investigated as well. First, for speech feature extraction, we compared the performance of linear discriminant analysis (LDA) and heteroscedastic linear discriminant analysis (HLDA) to that of the conventional Mel-frequency cepstral coefficients (MFCC) .Second, for acoustic modeling, we explored the use of the intra-word triphone models, the state-tying scheme and the phone confusion matrix, as well as the unsupervised training of acoustic models, for better speech recognition results. Finally, for language modeling, both count-merging and model-interpolation approaches were respectively expoited to combine the background and in-domain language model training corpora to enable better prediction of word occurrences during the speech recognition process. The experiments were conducted on the Voice of America (VOA) and the English Across Taiwan (EAT) corpora.

    第1章 緒論 1 1.1 研究動機 1 1.2 語音辨識流程 2 1.2.1 特徵擷取 (Feature Extraction) 4 1.2.2 聲學模型 (Acoustic Model) 7 1.2.3 語言模型 (Language Model) 9 1.2.4 語言解碼 (Linguistic Decoding) 10 1.3 研究內容 10 1.4 論文大綱 11 第2章 文獻回顧 13 2.1 現階段英文語音辨識研究內容 13 2.1.1 美國BBN科技公司 15 2.1.2 美國IBM華生研究中心 20 2.1.3 英國劍橋大學 23 2.1.4 綜合討論 26 2.2 聲學模型音素單位相似度測量 28 2.2.1 資料導向方法 28 2.2.2 以知識為基準之方法 30 第3章 實驗語料與設定說明 33 3.1 實驗詞典與英文音素定義 33 3.2 實驗語料 36 3.2.1 台灣腔英語(English Across Taiwan, EAT) 36 3.2.2 美國之音(The Voice of America, VOA) 38 3.2.3 英國國家文字語料庫(British National Corpus, BNC) 38 3.3 台師大大詞彙連續語音辨識系統 39 3.3.1 語音特徵擷取 39 3.3.2 聲學模型建立 40 3.3.3 語言模型建立 49 3.3.4 詞典建立 50 3.3.5 語言解碼 50 第4章 英文語音辨識之基礎實驗 53 4.1 VOA語料之基礎實驗 53 4.1.1 實驗設定 53 4.1.2 基礎語音特徵擷取 53 4.1.3 基礎三連音素聲學模型 56 4.1.4 基礎語言模型 57 4.2 EAT語料之基礎實驗 58 4.2.1 實驗設定 58 4.2.2 基礎語音特徵擷取 58 4.2.3 基礎三連音素聲學模型 59 4.2.4 基礎語言模型 60 4.3 實驗討論 60 第5章 改進英文辨識之各項實驗 63 5.1 鑑別性特徵擷取 63 5.2 語言模型調適 65 5.2.1 詞頻數混合法 66 5.2.2 線性插補法 67 5.3 模糊矩陣之使用 68 5.3.1 聲學模型訓練階段使用 68 5.3.2 辨識器搜尋階段使用 69 5.4 非監督式聲學模型訓練 72 5.4.1 信心度評估法 74 5.4.2 實驗設定與結果 76 5.5 實驗討論 79 第6章 結論與未來展望 81 參考文獻 83

    [Aubert 2002] X. Aubert, “An Overview of Decoding Techniques for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language, Vol. 16, pp. 89-114, 2002.
    [Bacchiani et al. 2003] M. Bacchiani and B. Roark.”Unsupervised Language Model Adaptation, “In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2003.
    [Bahl et al. 1983] L. R. Bahl, F. Jelinek and R. L. Mercer, “A Maximum Likelihood Approach to Continuous Speech Recognition,” IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. PAMI-5, No.2, pp.179-190, 1983
    [Baum 1972] L. E. Baum, “An Inequality and Associated Maximization Technique in Statistical Estimation for Probabilistic Functions of Markov Processes,” Inequalities, Vol. 3, No. 1, pp.1-8, 1972.
    [Bayeh et al. 2004] R. Bayeh et al., “Towards multilingual speech recognition using data driven source/target acoustical units association”, ICASSP’04, vol. I, pp. 521-524, Montreal, Canada, May 2004.
    [Beyerlein et al. 1999] P. Beyerlein et al., “Towards language independent acoustic modeling”, ASRU’99, Keystone, CO, USA, December 1999.
    [BNC corpus] British National Corpus:http://www.natcorp.ox.ac.uk/
    [Brian Mak et al.1996] Brian Mak, E. Barnard, “Phone Clustering Using the Bhattacharyya Distance,” ICSLP ‘96, volume 4, pages 2005-2008, 1996
    [Campbell 1984] N. Campbell, “Canonical Variate Analysis – a general formulation,”
    Australian Journal of Statistics, 1984.
    [Chen et al. 2004] B. Chen, J.-W. Kuo and W.-H. Tsai, “Lightly Supervised and Data-driven Approaches to Mandarin Broadcast News Transcription,” Proc. Of International Conference on Acoustic, Speech and Signal Processing, 2004.
    [Chen et al. 2005] B. Chen, J.-W. Kuo and W.-H. Tsai, "Lightly Supervised and Data-driven Approaches to Mandarin Broadcast News Transcription," International Journal of Computational Linguistics & Chinese Language Processing, Vol. 10, No.1,pp1-18,2005.
    [Chen et al. 2004b] B. Chen, J.-W. Kuo, W.H. Tsai, “Lightly supervised ad data-driven approaches to Mandarin broadcast news transcription,” in Proc. ICASSP, 2004
    [Chen and Goodman 1999] S. F. Chen, J. Goodman. An Empirical Study of Smoothing Techniques for Language Modeling. Computer Speech and Language, 13, 1999.
    [Colthurst et al. 2000] Thomas Colthurst, Owen Kimball, Fred Richardson, Han Shu, Chuck Wooters, Rukmini Iyer, Herbert Gish,”The 2000 BBN Byblos LVCSR System,”In ICSLP-2000, vol.2, 1011-1014.
    [Davis et al. 1980] DAVIS, S. et MERMELSTEIN, P. “Comparison of parametric representation for mononsyllabic word recognition in continuously spoken sentences.” IEEE International Conference on Acoustics, Speech and Signal Processing, 28(4):357–366.1980.
    [Dempster et al. 1997] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, Volume 39, no. 1, pages 1-38, 1977.
    [EAT corpus] English Across Taiwan:http://www.aclclp.org.tw/
    [EARS] EARS at ICSI , http://www.icsi.berkeley.edu/Speech/EARS/index.html
    [Evermann et al. 2004] G. Evermann, H.Y. Chan, M.J.F. Gales, T. Hain, X. Liu, D. Mrva, L. Wang, P.C. Woodland, ” DEVELOPMENT OF THE 2003 CU-HTK CONVERSATIONAL TELEPHONE SPEECH TRANSCRIPTION SYSTEM,”in Proc. ICASSP 2004
    [Evermann et al. 2003] G. Evermann & P.C. Woodland,” DESIGN OF FAST LVCSR SYSTEMS,” in Proc. ASRU,2003
    [Festlex CMU] Festlex CMU:http://linux.maruhn.com/sec/festlex_cmu.html
    [Fiscus 1997] J. G. Fiscus, “A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER),” IEEE ASRU Workshop, 1997.
    [Furui 1981] S. Furui, “Cepstral Analysis Technique for Automatic Speaker Verification,” IEEE Trans. Acoust. Speech Signal Process, 1981.
    [Gales & Woodland 1996] M. J. F. Gales and P. C. Woodland (1996). “Mean and Variance Adaptation within the MLLR Framework,” Computer Speech and Language, Vol. 10, pp.249-264, 1996.
    [Gopinath 1998] R. A. Gopinath, “Maximum likelihood modeling with Gaussian distributions,” In Proceedings of ICASSP, Seattle, 1998.
    [Gray et al. 1973] J.D. Markel, A.H. Gray, and H. Wakita,”Linear Prediction of Speech-Theory and Practice”, SCRL Monograph No. 10, Speech Communications Research Laboratory, Santa Barbara, California, 1973.
    [Gunawardana & Byrne 2001] A. Gunawardana and W. Byrne (2001).“Discriminative Speaker Adaptation with Conditional Maximum Likelihood Linear Regression,” in Proc. Eurospeech’01.
    [Hazen et al. 2002] T. J. Hazen, S. Seneff, and J. Polifroni, “Recognition Confidence Scoring and Its Use in Speech Understanding Systems,” Computer Speech and Language, Vol. 16, pp.49-67, 2002.
    [Hermansky 1990] Hermansky, H. “Perceptual linear predictive (PLP) analysis of speech”, J. Acoust. Soc. Am., 87(4), pp. 1738-1752.1990.
    [Huo et al.1995] Qiang Huo, Chorkin Chan and Chin-Hui Lee,”Bayesian Adaptive Learning of the Parameters of Hidden Markov Model for Speech Recognition,” IEEE Trans. on Speech and Audio Processing, Vol. 3, No. 5, pp.334-345, 1995.
    [Hung et al. 2001] J-W Hung, H-M Wang and L-S Lee, “Comparative Analysis for Data-Driven Temporal Filters Obtained Via Principal Component Analysis(PCA) and Linear Discriminant Analysis(LDA) in Speech Recognition,” Eurospeech, 2001.
    [Jelinek 1999] F. Jelinek, “Statistical Methods for Speech Recognition,” the MIT press,1999.
    [Katz 1987] S. M. Katz. Estimation of Probabilities from Sparse Data for the Language Model Component of A Speech Recognizer. IEEE Trans. On Acoustics, Speech and Signal Processing, Volume 35 (3), pages 400-401, March 1987.
    [Kohler et al. 1996] J. Kohler,“Multi-lingual phoneme recognition exploiting acoustic-phonetic similarities of sounds”, ICSLP’96, pp. 2195- 2198, Philadelphia, PA, USA, October 1996.
    [Kumar 1997] N. Kumar, “Investigation of Silicon-Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition,” Ph.D. thesis, John Hopkins University, Baltimore, 1997.
    [Kumar and Andreou 1998] N. Kumar and A. G. Andreou, “Heteroscedastic Discriminant Analysis and Reduced Rank HMMs for Improved Speech Recognition,” Speech Communication, vol.26 no.4, pp.283-297, Dec. 1998.
    [Lamel et al. 2002] Lori Lamel, J. Gauvain, G.. Adda, “Lightly Supervised and Unsupervised Acoustic Model Training,” Computer Speech and Language, Vol.16, pp.115-129, 2002
    [Leggetter & Woodland 1995] C. J. Leggetter, P. C. Woodland (1995). “Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models,” Computer Speech and Language, Vol. 9, pp.171-185, 1995.
    [LDC] Linguistic Data Consortium, http://www.ldc.upenn.edu/
    [Le et al. 2006] Viet Bac Le, Laurent Besacier, Tanja Schultz, ”Acoustic Phonetic Unit Similarities for Context Dependent Acoustic Model Portability,” ICASSP, 2006.
    [Lee 1989] Lee K-F. “Automatic Speech Recognition:The Development of the SPHINX System”. Kluwer Academic Publishers, Boston. 1989
    [Leggetter et al. 1995] C. Leggetter, P. Woodland,”Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density HMMs,” Computer Speech and Language, Vol. 9, pp. 171-185, 1995.
    [Matsoukas et al. 2002] Spyros Matsoukas, Thomas Colthurst, Owen Kimball,Alex Solomonoff, Fred Richardson, Carl Quillen, Herbert Gish, Pierre Dognin,” THE 2001 BYBLOS ENGLISH LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION SYSTEM,” in Proc. ICASSP,2002
    [Mangu et al. 2000] L. Mangu, E. Brill and A. Stolcke, “Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion NetWorks,” Computer Speech and Language, Vol. 14, pp.373-400, 2000.
    [Molau et al. 2001] Sirko Molau,Michael Pitz,Hermann Ney. “Histogram Based Normalization in the Acoustic Feature Space”. ICSLP 2001
    [Ney et al. 1999] Ney, H., Ortmanns, S., “Dynamic Programming Search for Continuous Speech Recognition,” IEEE Signal Processing Magazine, vol. 16, no. 5, 1999, pp. 64-83.
    [Nguyen et al. 2005] Long Nguyen, Bing Xiang, Mohamed Afify, Sherif Abdou, Spyros Matsoukas, Richard Schwartz, and John Makhoul,” The BBN RT04 English Broadcast News Transcription System,” in Proc. INTERSPEECH, 2005
    [NIST 2007] National Institute of Standards and Technology,
    http://www.nist.gov/speech/participants/index.htm
    [Odell 1995] Julian James Odell, The use of context in large vocabulary speech recognition,” Ph.D. dissertation, Univ. Cambridge, Cambridge, U.K., 1995.
    [Ortmanns et al. 1997] S. Ortmanns, H. Ney, X. Aubert, “A Word Graph Algorithm for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language, Vol. 11, pp.11-72, 1997.
    [Povey 2004] D. Povey, “Discriminative Training for Large Vocabulary Speech Recognition,” Ph.D Dissertation, Peterhouse, University of Cambridge, July 2004.
    [Povey et al. 2005] D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau and G. Zweig (2005). “fMPE: Discriminatively Trained Features for Speech Recognition,” in Proc. ICASSP’05.
    [Prasad et al. 2005] R. Prasad, S. Matsoukas, C.-L. Kao, J. Ma, D.-X. Xu, T. Colthurst, O. Kimball, R. Schwartz ,” The 2004 BBN/LIMSI 20xRT English Conversational Telephone Speech Recognition System,” in Proc. INTERSPEECH,2005
    [Rabiner et al. 1989] Rabiner, L. R., A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE, 1989
    [Saon et al. 2000] G. Saon, M. Padmanabhan, R. Gopinath and S. Chen, “Maximum Likelihood Discriminant Feature Spaces,” ICASSP, 2000.
    [Schultz et al. 2001] T. Schultz, A. Waibel, “Language independent and language adaptive acoustic modeling for speech recognition”, Speech Communication, vol. 35, no. 1-2, pp. 31-51, August 2001.
    [Soltau et al. 2005] Hagen Soltau, Brian Kingsbury, Lidia Mangu, Daniel Povey, George Saon and Geoffrey Zweig,” THE IBM 2004 CONVERSATIONAL TELEPHONY SYSTEM FOR RICH TRANSCRIPTION,” in Proc. ICASSP,2005
    [Sooful et al. 2001] J. J. Sooful, E. C. Botha, “An acoustic distance measure for automatic cross-language phoneme mapping”, PRASA’01, pp. 99-102, South Africa, November 2001.
    [SRILM] A. Stolcke. SRI Language Modeling Toolkit. version 1.5.2,
    http://www.speech.sri.com/projects/srilm/ .
    [Uebel et al. 2001] L.F. Uebel and P.C. Woodland. “Speaker Adaptation Using Lattice-based MLLR.” In Proc. ISCA ITRW on Adaptation Methods in Speech Recognition, 2001.
    [Viikki and Laurila 1998] O. Viikki and K. Laurila, “Cepstral Domain Segmental Feature Vector Normalization for Noise Robust Speech Recognition,” Speech Communication, Vol. 25, pp. 133-147, August 1998.
    [Viterbi 1967] A. J. Viterbi, “Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm,” IEEE Trans. Information Theory, Vol.13, No. 2, 1967.
    [VOA corpus] The Voice of America, VOA:http://www.voanews.com/
    [Wessel et al 2001] F. Wessel, R. Schluter, K. Macherey, H. Ney, “Explicit Word Error Minimization Using Word Hypothesis Posterior Probability”, in Proc. ICASSP 2001
    [Wessel et al 2001b] Frank Wessel and Hermann Ney ,“Unsupervised Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition”, in Proc. ASRU 2001
    [Wilpon et al. 1990] J. G. Wilpon, L. R. Rabiner, C-H. Lee and R. Goldman,“Automatic Recognition of Keywords in Unconstrained Speech Using Hidden Markov Models,”IEEE Trans. Acoustics Speech Signal Process, Vol.38, No.11,pp.1870-1878, 1990.
    [Witten et al. 1991] Witten, Ian H. and Timothy C. Bell. 1991. The zero-frequency problem: Estimating the probabilities of novel events in adaptive text compression. IEEE Transactions on Information Theory, 37(4):1085~1094, July.
    [Young et al. 1994] J Young, JJ Odell, PC Woodland,"Tree-Based State Tying for High Accuracy Acoustic Modelling",Proceedings of the workshop on Human Language Technology, 1994
    [Young et al. 2006] Steve Young, Gunnar Evermann, Mark Gales, Tomas Hain, Dan Kershaw, Gareth Moore, Julian Odell, Dave Ollason, Dan Povey, Valtcho Valtchev, Phil Woodland, The HTK Book (for HTK Version 3.4)
    [張志豪 2005] 張志豪,”強健性和鑑別力語音特徵擷取技術於大詞彙連續語音辨識之研究,” 國立台灣師範大學資訊工程所碩士論文, 2005.

    QR CODE