簡易檢索 / 詳目顯示

研究生: 張志豪
論文名稱: 強健性和鑑別力語音特徵擷取技術於大詞彙連續語音辨識之研究
Robust And Discriminative Feature Extraction Techniques For Large Vocabulary Continuous Speech Recognition
指導教授: 陳柏琳
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2005
畢業學年度: 93
語文別: 中文
論文頁數: 90
中文關鍵詞: 資料相關線性特徵轉換主成份分析線性鑑別分析異質性線性鑑別分析異質性鑑別分析最大相似度線性轉換
英文關鍵詞: Data-driven Linear Feature Transformation, Principal Component Analysis, Linear Discriminant Analysis, Heteroscedastic Linear Discriminant Analysis, Heteroscedastic Discriminant Analysis, Maximum Likelihood Linear Transformation
論文種類: 學術論文
相關次數: 點閱:223下載:10
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語音是人類主要且最方便的溝通方式之一。現今由於小型電子產品的成功發展,如手機、個人數位代理(PDA)等,再加上無線通訊和無線網路的普及,一般都認為在不久的未來,語音將扮演舉足輕重的角色,且將擔任人類與各種不同智慧型產品溝通的主要人機介面。因此,自動語音辨識(Automatic Speech Recognition, ASR)的研究也變得日益受重視。其中,為了能讓自動語音辨識在真實且多變的環境下也可以適用,許多鑑別性(Discriminative)和強健性(Robust)的特徵擷取(Feature Extraction)技術在近二十年來也陸續被提出。
    根據上述的觀察,在本論文裡我們研究基於聽覺知覺特性(Auditory-perception-based)的特徵擷取技術和資料相關(Data-driven)的線性特徵轉換(Linear Feature Transformation)技術,以達到強健性語音辨識的目的。對於基於聽覺知覺特性的特徵擷取技術,我們廣泛地比較常見的梅爾倒頻譜係數(Mel-frequency Cepstral Coefficients, MFCC)與感知線性預測係數(Perceptual Linear Prediction Coefficients, PLPC),並且比較用來取得與結合時域軌跡(Time Trajectory)資訊的各種方法。在資料相關線性特徵轉換這方面,首先我們嘗試驗證,線性鑑別分析(Linear Discriminant Analysis, LDA)在語音辨識的特徵空間轉換上的表現的確優於主成份分析(Principal Component Analysis, PCA)。然後我們研究幾種線性鑑別分析的改進方法,像是異質性線性鑑別分析(Heteroscedastic Linear Discriminant Analysis, HLDA)和異質性鑑別分析(Heteroscedastic Discriminant Analysis, HDA)等,這些方法在求取線性鑑別分析過程中,並未如傳統的線性鑑別分析般需假設每個類別分佈會有相同變異量(Variation)。此外,我們提出分別利用最小分類錯誤(Minimum Classification Error, MCE)和最大交互訊息(Maximum Mutual Information, MMI)等估測法來最佳化線性轉換矩陣,並與傳統最大相似度(Maximum Likelihood, ML)估測法作比較。最後,我們也進一步地結合最大相似度線性轉換(Maximum Likelihood Linear Transformation, MLLT)與其他強健性技術諸如特徵平均消去法(Feature Mean Subtraction)、特徵正規化法(Feature Normalization)等。本論文裡所有實驗皆使用中文廣播新聞為語料庫(Mandarin broadcast news corpus, MATBN)。實驗內容包括了中文自由音節辨識(Free Syllable Decoding),與大詞彙連續語音辨識(Large Vocabulary Continuous Speech Recognition, LVCSR)上。初步的實驗結果顯示出本論文所提出的作法對於語音辨識率有相當顯著的提昇。

    Speech is the primary and the most convenient means of communication between people. Due to the successful development of much smaller electronic devices and the popularity of wireless communication and networking, it is widely believed that speech will play a more active role and will serve as the major human-machine interface for the interaction between people and different kinds of smart devices in the near future. Therefore, research on automatic speech recognition (ASR) is now becoming more and more emphasized, and in which the development of discriminative as well as robust feature extraction approaches for ASR to be deployed in real and diverse environments has continuously gained much attention over the past two decades.
    With the above observation in mind, in this thesis we studied the techniques of auditory-perception-based feature extraction and data-driven linear feature transformation for robust speech recognition. For auditory-perception-based feature extraction, we extensively compares the conventional Mel-frequency Cepstral Coefficients (MFCC) with the Perceptual Linear Prediction Coefficients (PLPC), as well as compared various ways to derive and combine their corresponding time trajectory information. For data-driven linear feature transformation, we started with the attempt to show the superior performance of the linear discriminant analysis (LDA) over that of the principal component analysis (PCA) in the feature transformation for speech recognition. We then investigated several improved approaches, such as the heteroscedastic linear discriminant analysis (HLDA) and heteroscedastic discriminant analysis (HDA) etc., for removing the inherent assumption of the same cluster variation in the derivation of LDA. Moreover, we proposed the use of the minimum classification error (MCE) and maximum mutual information (MMI) criteria, respectively, in the optimization of the transformation matrices, in comparison to the maximum likelihood (ML) criterion. Finally, the maximum likelihood linear transformation (MLLT) and other robust techniques, such as the feature mean subtraction or/and variance normalization were further applied. All experiments were carried out on the Mandarin broadcast news corpus (MATBN). Very promising experimental results were initially indicated.

    研究摘要...................i Abstract...................iii 圖目錄.....................ix 表目錄.....................xi 第一章 緒論................1 1.1 研究動機...............1 1.2 研究目的...............2 1.3 研究內容...............3 1.4 研究貢獻...............5 1.5 章節大綱...............6 第二章 文獻回顧............7 2.1 人耳聽覺感知...........7 2.1.1梅爾倒頻譜係數........8 2.1.2 感知線性預測.........16 2.2資料相關線性特徵轉換....20 2.2.1 主成份分析(PCA)......21 2.2.2 線性鑑別分析(LDA)....24 2.2.3 異質性線性鑑別分析(HLDA)....27 2.2.3.1 Kumar方法..........28 2.2.3.2 Gales方法..........32 2.2.4 異質性鑑別分析.......35 2.2.5 最大相似度線性轉換...37 2.3強健性語音特徵技術......38 2.3.1 倒頻譜平均消去法.....38 2.3.2 倒頻譜正規化法.......39 第三章 資料導向線性特徵轉換之相關改進.......41 3.1 線性轉換矩陣估測方法改進................41 3.1.1 最小分類錯誤估測對角化異質性線性鑑別分析....41 3.1.2 最大交互訊息估測異質性線性鑑別分析....46 3.2 核函數使用..............47 3.2.1 核函數主成份分析......49 3.2.2 核函數線性鑑別分析....52 第四章 實驗環境與相關設定...57 4.1 實驗語料庫..............57 4.2 師大廣播新聞轉寫系統....63 4.2.1 前端處理...................63 4.2.2 聲學模型...................63 4.2.3 詞典建立及語言模型訓練.....64 4.2.4 詞彙述複製搜尋.............64 4.3 實驗評估方式.................66 4.4 頻域-時域特徵擷取............67 第五章 實驗......................69 5.1 梅爾倒頻譜與感知線性預測的比較....70 5.2 頻域-時域特徵擷取的輸入選取.......71 5.3 資料相關線性特徵轉換參數討論......73 5.4 資料相關線性特徵轉換綜合分析......76 5.5 強健性實驗........................78 5.6 中文大詞彙連續語音辨識............84 第六章 結論與未來展望.................85 參考文獻..............................87

    [Aubert 2002] X. L. Aubert, “An Overview of Decoding Techniques for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language, January 2002.
    [Campbell 1984] N. Campbell, “Canonical Variate Analysis – a general formulation,” Australian Journal of Statistics, 1984.
    [Chao et al. 2005] Y. H. Chao, H. M. Wang and R. C. Chang, “GMM-BASED Bhattacharyya Kernel Fisher Discriminant Analysis for Speaker Recognition,” ICASSP, 2005.
    [Chen et al. 2004b] B. Chen, J-W Kuo and W-H Tsai, “Lightly Supervised and Data-Driven Approaches to Mandarin Broadcast News Transcription,” in Proc. ICASSP, 2004.
    [Chen et al. 2005] B. Chen, J-W Kuo and W-H Tsai, “Lightly Supervised and Data-Driven Approaches to Mandarin Broadcast News Transcription,” International Journal of Computational Linguistics and Chinese Language Processing, Vol. 10, No. 1, pp.1-18, March 2005.
    [Davis and Mermelstein 1980] S.B. Davis and P. Mermelstein, “Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences,” IEEE Trans on ASSP, Vol.28, No.4, pp357-366, Aug. 1980.
    [Dillon and Goldstein 1984] W. R. Dillon and M. Goldstein, “Multivariate Analysis,” John Wiley and Sons, 1984.
    [Duda and Hart 1973] R. O. Duda and P. B. Hart, “Pattern Classification And Scene Analysis,” John Wiley and Sons, 1973.
    [Feltcher and Munson 1933] H. Feltcher and Munson, “Curves of equal loudness determined experimentally,” 1993.
    [Fisher 1936] R. A. Fisher, “The use of Multiple measurements in taxonomic problems,” Ann. Eugen., 1936.
    [Fisher 1938] R. A. Fisher, “The statistical utilization of multiple measurements,” Ann.Eugen., 1938.
    [Fukunaga 1990] K. Fukunaga, “Introduction to statistical pattern recognition,” E.2nd, Academic Press, 1990.
    [Furui 1981] S. Furui, “Cepstral Analysis Technique for Automatic Speaker Verification,” IEEE Trans. Acoust. Speech Signal Process, 1981.
    [Gales 1997] M. J. F. Gales, “Maximum Likelihood Linear Transformations For HMM-Based Speech Recognition,” Technical Report, CUED/FINFENG/TR291, Cambridge Univ., 1997.
    [Gales 1999] M. J. F. Gales, “Semi-tied Covariance Matrices for Hidden Markov Models,” IEEE Trans. SAP, 7(3), pages 272–281, 1999.
    [Gales 2001] M. J. F. Gales, “Maximum Likelihood Multiple Projection Schemes for Hidden Markov Models,” Cambridge University Technical Report RT-365, 2001.
    [Gopinath 1998] R. A. Gopinath, “Maximum likelihood modeling with Gaussian distributions,” In Proceedings of ICASSP, Seattle, 1998.
    [Gunn and Kandola 2001] S. R. Gunn and J. S. Kandola, “Structural Modelling with Sparse Kernels,” Technical Report, 2001.
    [Hastie et al. 2001] T. Hastie, R. Tibshirani and J. Friedman, “The Elements of Statistical Learning : Data Mining, Inference, and Prediction,” 2001.
    [Hastie 1994] T. Hastie, “Flexible Disriminant Analysis by Optimal Scoring,” JASA, pp. 1255-1270, 1994.
    [Hastie and Tibshrani 1994] T. Hastie and R. Tibshrani, “Discriminant analysis by Gaussian mixtures,” Technical report, AT&T Bell Laboratories, 1994.
    [Hermansky 1998] H. Hermansky, “Should Recognizers Have Ears? ,” Speech Communication, 1998.
    [Hermansky 1990] H. Hermansky, “Perceptual Linear Predictive Analysis of Speech,” J. Acoust. Soc. Am., 1990.
    [Hung et al. 2001] J-W Hung, H-M Wang and L-S Lee, “Comparative Analysis for Data-Driven Temporal Filters Obtained Via Principal Component Analysis(PCA) and Linear Discriminant Analysis(LDA) in Speech Recognition,” Eurospeech, 2001.
    [Katz 1987] S. M. Katz, “Estimation of Probabilities from Sparse Data for Other Language Component of a Speech Recognizer,” IEEE Trans. Acoustics, Speech and Signal Processing, Vol. 35, No.3, pp. 400-401, 1987.
    [Ko 1999] A. Ko, “Acoustic Feature Analysis For Robust Speech Recognition,” 1999.
    [Kumar 1997] N. Kumar, “Investigation of Silicon-Auditory Models and Generalization of Linear Discriminant Analysis for Improved Speech Recognition,” Ph.D. thesis, John Hopkins University, Baltimore, 1997.
    [Kumar and Andreou 1998] N. Kumar and A. G. Andreou, “Heteroscedastic Discriminant Analysis and Reduced Rank HMMs for Improved Speech Recognition,” Speech Communication, vol.26 no.4, pp.283-297, Dec. 1998.
    [LDC] Linguistic Data Consortium: http://www.ldc.upenn.edu.
    [Lima et al. 2003] A. Lima, H. Zen, Y. Nankaku, C. Miyajima, K. Tokuda and T. Kitamura, “On the Use of Kernel PCA for Feature Extraction in Speech Recognition,” EUROSPEECH, 2003.
    [Li 2004] X-B Li, “Dimensionality Reduction Using MCE-Optimized LDA Transformation,” ICASSP, 2004.
    [Liu 2001] X. Liu, “Linear Projection Schemes for Automatic Speech Recognition,” Master of Philosophy, University of Cambridge, 2001.
    [Lukš 2004] B. Lukš, “Combination of Speech Features Using Smoothed Heteroscedastic Linear Discriminant Analysis,” ICSLP, 2004.
    [Makhoul 1975] J. Makhoul, “Spectral Linear Prediction : Properties and Applications,” IEEE, 1975.
    [Makhoul and Cosell 1976] J. Makhoul and L. Cosell, “LPCW : An LPC vocoder with linear predictive spectral warping,” in Proceedings of the IEEE International Conference on Acoustics. Speech. And Signal Processing, 1976.
    [Mika 1999] S. Mika, “Fisher Discriminant Analysis With Kernels,” IEEE International Workshop on Neural Networks for Signal Processing, 1999.
    [Mika 2002] S. Mika, “Kernel Fisher Discriminant”, Ph.D. thesis, 2002.
    [NIST] National Institute of Standards and Technology. http://www.nist.gov/.
    [Roth and Steinhage 1999] V. Roth and V. Steinhage, “Nonlinear Discriminant Analysis Using Kernel Function,” Data Mining and Knowledge Discovery, 1998.
    [Saon et al. 2000] G. Saon, M. Padmanabhan, R. Gopinath and S. Chen, “Maximum Likelihood Discriminant Feature Spaces,” ICASSP, 2000.
    [Sch¨olkopf et al. 1998] B. Sch¨olkopf, A. Smola, and K. R. M¨uller, “Nonliner Component Analysis as a Kernel Eigenvalue Problem,” Neural Computation, no. 10, pp. 1299–1319, 1998.
    [Smola et al. 1999] A. J. Smola, O. L. Mangasarian and B. Sch¨olkopf, “Sparse Kernel Feature Analysis,” Technical Report, 1999.
    [SRILM] A. Stolcke, “SRI language Modeling Toolkit,” version 1.3.3, http://www.speech.sri.com/projects/srilm/.
    [Viikki and Laurila 1998] O. Viikki and K. Laurila, “Cepstral Domain Segmental Feature Vector Normalization for Noise Robust Speech Recognition,” Speech Communication, Vol. 25, pp. 133-147, August 1998.
    [Wang et al. 2005] H-M Wang, B. Chen, J-W Kuo, and S-S Cheng. “MATBN : A Mandarin Chinese Broadcast News Corpus,” accepted to appear in International Journal of Computational Linguistics and Chinese Language Processing, 2005.
    [Zhang and Matsoukas 2005] B. Zhang and S. Matsoukas, “Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition,” ICASSP, 2005.
    [Zurada 1992] J. M. Zurada, “Introduction to Artificial Neural Systems,” West Publishing Comoany, 1992.

    QR CODE