研究生: |
陳鴻彬 Hung-Bin Chen |
---|---|
論文名稱: |
以能量為基礎之語音正規化方法研究及其於語音端點偵測之應用 On the Study of Energy-Based Speech Feature Normalization and Application to Voice Activity Detection |
指導教授: |
陳柏琳
Chen, Berlin |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2007 |
畢業學年度: | 95 |
語文別: | 中文 |
論文頁數: | 74 |
中文關鍵詞: | 語音正規化 、語音端點偵測 |
英文關鍵詞: | Speech Feature Normalization, Voice Activity Detection |
論文種類: | 學術論文 |
相關次數: | 點閱:129 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文主要探討強健(Robust)性語音辨識技術在不同噪音環境下的情況,並且於時間軸上研究雜訊語音(Noisy Speech)在對數能量上重建出乾淨語音(Clean Speech)對數能量的方法。基於每一語句對數能量特徵值的分佈特性,我們期望發展出一個有效的方法可以重刻雜訊語音對數能量的尺度,以減緩噪音環境所造成不匹配的情形,並達到更好的辨識率效果。
根據時間軸上的語音訊號觀察顯示,低能量的語音音框比高能量的語音音框更容易受到加成性噪音(Additive Noise)的影響,並且當出現嚴重的加成性噪音影響的時候,對數能量特徵強度在語句中幾乎會整個被提高,因此我們提出一個簡單但是有效的方法,稱之為對數能量尺度重刻正規化技術(Log Energy Rescaling Normalization, LERN),適當的重刻雜訊語音的對數能量特徵值使成為接近乾淨語音的環境狀況。
語音辨識實驗採用的是包含多種噪音環境的語料,該語料是由歐洲電信標準協會(European Telecommunications Standards Institute, ETSI)所發行的Aurora-2.0語料庫,語料庫內容為英語發音的連續數字字串的小詞彙。提供有八種噪音來源和七種訊噪比(Signal-to-Noise Ratio, SNR)的情況。實驗方面,結果顯示對數能量尺度重刻正規化方法(LERN)的效果比其他的能量或對數能量上的正規化方法好。此外,另一組實驗則採用中文廣播新聞語料庫(Mandarin broadcast news corpus, MATBN)在大詞彙連續語音辨識(Large Vocabulary Continuous Speech Recognition, LVCSR)上的測試,並證明對數能量尺度重刻正規化方法(LERN)依然可以有效提升辨識率。
This thesis considered robust speech recognition in various noise environments, with a special focus on investigating the ways to reconstruct the clean time-domain log-energy features from the noise-contaminated ones. Based on the distribution characteristics of the log-energy features of each speech utterance, we aimed to develop an efficient approach to rescale the log-energy features of the noisy speech utterance so as to alleviate the mismatch caused by environmental noises for better speech recognition performance.
As the time-domain phenomena of the speech signals reveal that lower-energy speech frames are more vulnerable to additive noises than higher-energy ones, and that the magnitudes of the log-energy features of the speech utterance tend to be lifted up when they are seriously interfered with additive noise, we therefore proposed a simple but effective approach, named log-energy rescaling normalization (LERN), to appropriately rescale the log-energy features of noisy speech to that of the desirable clean one.
The speech recognition experiments were conducted under various noise conditions using the European Telecommunications Standards Institute (ETSI) Aurora-2.0 database. The database contains a set of connected digit utterances spoken in English. It offers eight noise sources and seven different signal-to-noise ratios (SNRs). The experiment results showed that the performance of the proposed LERN approach was considerably better than the other conventional energy or log-energy feature normalization methods. Another set of experiments conducted on the large vocabulary continuous speech recognition (LVCSR) of Mandarin broadcast news also evidenced the effectiveness of LERN.
[Aubert 2002] X. L. Aubert, “An Overview of Decoding Techniques for Large Vocabulary Continuous Speech Recognition,” Computer Speech and Language, January 2002.
[Boll 1979] S.F. Boll, “Supperssion of Acoutstic Noise in Speech Using Spectral Subtraction,” IEEE Trans. on ASSP, Vol. 27, No. 2, pp. 133-120, 1979.
[Bocchieri and Wilpon 1992] EL Bocchieri, JG Wilpon "Discriminative analysis for feature reduction in automatic speechrecognition," Acoustics, Speech, and Signal Processing, ICASSP 1992.
[Chen et al. 2004] Berlin Chen, Jen-Wei Kuo, Wen-Hung Tsai, “Lightly Supervised and Data-Driven Approaches to Mandarin Broadcast News Transcription,” in Proc. ICASSP 2004.
[Chen et al. 2005] Berlin Chen, Jen-Wei Kuo, Wen-Huang Tsai, “Lightly Supervised and Data-Driven Approaches to Mandarin Broadcast News Transcription,” International Journal of Computational Linguistics and Chinese Language Processing, Vol. 10, No. 1, pp. 1-18, March 2005.
[Davis et al. 1980] Davis, S. B. and P Mermelstein, "Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Sentences." IEEE Transactions on Acoustics, Speech, and Signal Processing 28(4); pp. 357-366, 1980.
[ETSI 2000] H. G. Hirsch, D. Pearce, “The AURORA Experimental Framework for the Performance Evaluations of Speech Recognition Systems under Noisy Conditions,” in Proc. ISCA ITRW ASR 2000.
[Furui 1981] S. Furui, “Cepstral Analysis Techniques for Automatic Speaker Verification,” IEEE Trans. on ASSP, 1981.
[Gauian and Lee 1994] J.L. Gauian and C.H. Lee, “Maximum a Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains,” IEEE Trans. on Speech and Audio Processing, 1994.
[Gomez et al. 2004] R. Gomez, A. Lee, K. Shikano, “Robust Speech Recognition with Spectral Subtraction in low SNR,” in Proc. ICSLP 2004.
[Gillick and Cox 1989] L. Gillick and S. Cox, "Some Statistical Issues in the Comparison of Speech Recognition Algorithms", in Proc. ICASSP 89, pp. 532-535.
[Gong 1995] Gong, Y., "Speech Recognition in Noisy Environments:A Survey," Speech Communication 16(3); pp. 261-291.
[Górriz et al. 2006] J.M. G´orriz, J. Ram´ırez, C.G. Puntonet, J.C. Segura, “An Efficient Bispectrum Phase Entropy-based Algorithm for VAD,” in Proc. ICSLP 2006.
[Gillick and Cox 1989] L. Gillick and S. Cox, "Some Statistical Issues in the Comparison of Speech Recognition Algorithms", in Proc. ICASSP 89, pp. 532-535. Matched Pairs Sentence-Segment Word Error (MAPSSWE) Test http://www.nist.gov/speech/tests/sigtests/mapsswe.htm.
[Hermansky 1998] Hynek Hermansky, “Should Recognizers Have Ears?”, Speech Communication, 1998.
[Huang and Hon 2001] X. Huang, A. Acero and H. Hon, “Spoken Language Processing: A Guide to Theory, Algorithm and System Development,” Prentice Hall PTR Upper Saddle River, NJ, USA, 2001.
[HTK 2006] S. Young et al., “The HTK Book Version 3.4,” 2006.
[Katz 1987] S. M. Katz, “Estimation of Probabilities from Sparse Data for Other Language Component of a Speech Recognizer,” IEEE Trans. Acoustics, Speech and Signal Processing, Vol. 35, No. 3, pp. 400-401, 1987.
[Leggetter and Woodland 1995] C.J. Leggetter and P.C. Woodland, “Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models,” Computer Speech and Language, 1995.
[LDC] Linguistic Data Consortium: http://www.ldc.upenn.edu.
[Lin et al. 2006] Shih-Hsiang Lin, Yao-Ming Yeh, Berlin Chen, "Exploiting Polynomial-Fit Histogram Equalization and Temporal Average for Robust Speech Recognition," the 9th International Conference on Spoken Language Processing (Interspeech - ICSLP 2006), Pittsburgh PA, USA, September 17-21, 2006.
[Misra et al. 2004] H Misra, S Ikbal, H Bourlard, H Hermansky, “Spectral Entropy Based Feature For Robust ASR,” Acoustics, Speech, and Signal Processing, 2004.
[NIST] National Institute of Standards and Technology. http://www.nist.gov/.
[Ramírez 2004] Juan Manuel Górriz, Javier Ramírez, Carlos G. Puntonet, and José Carlos Segura, ”Generalized LRT-Based Voice Activity Detector,” IEEE Signal Processing Letters, Vol. 13, No. 10, October 2006.
[SRILM] A. Stolcke, “SRI language Modeling Toolkit, ” version 1.3.3, http://www.speech.sri.com/projects/srilm/.
[Tai and Hung 2006] Chung-fu Tai and Jeih-weih Hung, “Silence Energy Normalization for Robust Speech Recognition in Additive Noise Environments,” in Proc. ICSLP 2006.
[Viikki and Laurila 1998] O. Viikki, K. Laurila, “Cepstral Domain Segmental Feature Vector Normalization for Noise Robust Speech Recognition,” Speech Communication, Vol. 25, pp. 133-147, August 1998.
[Weizhong and Douglas 2005] Weizhong Zhu and Douglas O’Shaughnessy, ” Log-Energy Dynamic Range Normalizaton for Robust Speech Recognition,” in Proc. ICASSP 2005 pp. 245- 248.
[Wang et al. 2005] Hsin-min Wang, Berlin Chen, Jen-Wei Kuo, and Shih-Sian Cheng, “MATBN: A Mandarin Chinese Broadcast News Corpus,” International Journal of Computational Linguistics & Chinese Language Processing, Vol. 10, No. 2, June 2005, pp. 219-236.
[戴仲甫 2006] 戴仲甫, “強健性語音辨認中能量特徵強化及音框選擇之改進技術的研究,” 2006.