簡易檢索 / 詳目顯示

研究生: 張庭豪
Chang, Ting-Hao
論文名稱: 調變頻譜分解之改良於強健性語音辨識
Several Refinements of Modulation Spectrum Factorization for Robust Speech Recognition
指導教授: 陳柏琳
Chen, Berlin
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2015
畢業學年度: 103
語文別: 中文
論文頁數: 98
中文關鍵詞: 語音辨識雜訊強健性調變頻譜非負矩陣分解
英文關鍵詞: speech recognition, noise, robustness, modulation spectrum, nonnegative matrix factorization
論文種類: 學術論文
相關次數: 點閱:138下載:42
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 自動語音辨識(Automatic Speech Recognition, ASR)系統常因環境變異而導致效能嚴重地受影響;所以長久以來語音強健(Robustness)技術的發展是一個極為重要且熱門的研究領域。本論文旨在探究語音強健性技術,希望能透過有效的語音特徵調變頻譜處理來求取較具強健性的語音特徵。為此,我們使用非負矩陣分解(Nonnegative Matrix Factorization, NMF)以及一些改進方法來分解調變頻譜強度成分,以獲得較具強健性的語音特徵。本論文有下列幾項貢獻。首先,結合稀疏性的概念,期望能夠求取到具調變頻譜局部性的資訊以及重疊較少的NMF基底向量表示。其次,基於局部不變性的概念,希望發音內容相似的語句之調變頻譜強度成分,在NMF空間有越相近的向量表示以維持語句間的關連程度。再者,在測試階段經由正規化NMF之編碼向量,更進一步提升語音特徵之強健性。最後,我們也結合上述NMF的改進方法。本論文的所有實驗皆於國際通用的Aurora-2連續數字資料庫進行;實驗結果顯示相較於僅使用梅爾倒頻譜特徵之基礎實驗,我們所提出的改進方法皆能顯著地降低語音辨識錯誤率。此外,也嘗試將我們所提出的改進方法與一些知名的特徵強健技術做比較和結合,以驗證這些改進方法之實用性。實驗平台使用HTK與KALDI兩種語音辨識系統。前者用來實驗上述所提出NMF改良之效能;後者用來實驗類神經網路(Neural Network)技術於語音辨識之聲學模型的效能,並探討調變頻譜正規化法與其結合之效果。

    The performance of an automatic speech recognition (ASR) system is often severely deteriorated due to the interference from varying environmental noise. As such, the development of effective and efficient robustness techniques has long been a challenging research subject in the ASR community. In this thesis, we attempt to obtain noise-robust speech features through modulation spectrum processing of the original speech features. To this end, we explore the use of nonnegative matrix factorization (NMF) and its extensions on the magnitude modulation spectra of speech features so as to distill the most important and noise-resistant information cues that can benefit the ASR performance. The main contributions include three aspects: 1) we leverage the notion of sparseness to obtain more localized and parts-based representations of the magnitude modulation spectra with fewer basis vectors; 2) the prior knowledge of the similarities among training utterances is taken into account as an additional constraint during the NMF derivation; and 3) the resulting encoding vectors of NMF are further normalized so as to further enhance their robustness of representation. A series of experiments conducted on the Aurora-2 benchmark task demonstrate that our methods can deliver remarkable improvements over the baseline NMF method and achieve performance on par with or better than several widely-used robustness methods.

    第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 2 1.3 強健性語音辨識 3 1.4 研究內容與貢獻 5 1.5 論文章節安排 6 第二章 文獻回顧 7 2.1 梅爾倒頻譜語音特徵參數擷取 7 2.2 強健性語音特徵技術 11 2.2.1 以模型為基礎之強健性技術 11 2.2.2 以語音特徵為基礎之強健性技術 11 2.2.3 綜合式技術 12 2.3 語音特徵時間序列處理技術介紹 13 2.4 調變頻譜正規化法 22 第三章 非負矩陣分解法 28 3.1 非負稀疏編碼法(NNSC) 32 3.2 稀疏非負矩陣分解法(SNMF) 33 3.3 局部非負矩陣分解法(LNMF) 34 3.4 稀疏約束的非負矩陣分解法(NMFSC) 35 3.5 非平滑非負矩陣分解法(NSNMF) 37 3.6 基於圖正則化非負矩陣分解法(GNMF) 41 第四章 類神經網路相關研究探討 45 4.1 類神經網路的介紹 45 4.2 多層神經網路 48 4.3 誤差倒傳遞演算法 48 4.4 摺積神經網路 50 第五章 語料庫介紹與實驗設定及基礎實驗結果 52 5.1 AURORA-2 語料庫 52 5.2 實驗設定 54 5.3 辨識效能評估方式 55 5.4 基礎實驗結果 56 第六章 調變頻譜非負矩陣分解法之研究 59 6.1 以非負矩陣分解法為基礎的調變頻譜正規化法 59 6.2 稀疏化的改進之非平滑非負矩陣分解法 64 6.3 基於圖正則化非負矩陣分解法運用於調變頻譜 69 6.4 統計圖等化法之非負矩陣分解法 76 6.5 類神經網路用於語音辨識之聲學模型 81 第七章 結論與未來展望 85 參考文獻 87

    Abdel-Hamid, O., L. Deng and D. Yu (2013), “Exploring convolutional neural network structures and optimization techniques for speech recognition,” in INTERSPEECH (pp. 3366-3370).
    Acharya, T., A. K. Ray (2005), “Image Processing:Principles and Applications,” Wiley Interscience.
    Belkin, M. and P. Niyogi (2001), “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Neural Information Processing Systems 14, pages 585–591. MIT Press,Cambridge, MA.
    Belkin, M., P. Niyogi, and V. Sindhwani (2006), “Manifold regularization: A geometric framework for learning from examples,” Journal of Machine Learning Research, 7:2399–2434.
    Beyerlein, P., X. Aubert, R. Haeb-Umbach, M. Harris, D. Klakow et al.(2002), ”Large vocabulary continuous speech recognition of Broadcast News–The Philips/RWTH approach,” Speech Communication, 37(1), 109-131.
    Bourlard, H. and N. Morgan (1994), “Connectionist Speech Recognition A Hybrid Approach, ” KLUWER ACADEMIC PUBLISHERS, ISBN 0-7923-9396-1.Boll, S.F. (1979), “Supperssion of Acouststic Noise in Speech Using Spectral,” IEEE Transactions on Acoustics, Speech and Signal Processing.
    Cai, D., X. He, J. Han, T. S. Huang (2011), ”Graph Regularized Nonnegative Matrix Factorization for Data Representation,” IEEE Trans on Pattern Analysis and Machine Intelligence, 33(8): 1548-1560.
    Chen, B., W. H. Chen, S. H. Lin, and W. Y. Chu (2011), “Robust Speech Recognition Using Spatial–Temporal Feature Distribution Characteristics,” Pattern Recognition Letters, Vol. 32, No. 7, pp. 919–926.
    Chu, W. Y., J. W. Hung and B. Chen (2011), “Modulation Spectrum Factorization for Robust Speech Recognition,” in Proceedings of APSIPA Annual Summit and (APSIPA ASC ), pp. 18–21.
    Cooke, M. P., A. Morris and P. D. Green (1997), “Missing Data Techniques For Robust Speech Recognition,” in Proceeding of International Conference on Acoustics, Speech and Signal Processing , pp. 863–866.
    Delashmit, W. H., M. T. Manry (2005), “Recent Developments in Multilayer Perceptron Neural Networks,” in Proceedings of the 7th Annual Memphis Area Engineering and Science Conference.
    Dharanipragada, S., Padmanabhan, M. (2000), ” A nonlinear unsupervised adaptation technique for speech recognition,” in Proc. Internat. Conf. on Spoken Lang. Process., vol. 4, pp. 556–559.
    Droppo, J. (2008), “Tutorial of International Conference on Spoken Language Processing,” Interspeech.
    Drullman, R., J. M. Festen, and R. Plomp (1994), “Effect of Temporal Envelope Smearing on Speech Reception, “ The Journal of the Acoustical Society of America, Vol. 95, No. 2, pp. 1053–1064.
    Duda, R. O. and P. E. Hart (1973), “Pattern classification and scene analysis,” New York, John Wiley and Sons.Duda, R. O., P. E. Hart and D. G. Stork (2001),” Pattern Classification,” Wiley Interscience.
    Ephraim, Y. and D. Malah (1985), “Speech Enhancement Using a Minimum Mean-Square Error Log-Spectral Amplitude Estimator,” IEEE Trans.Feng, T., S. Z. Li, H. Y. Shum, and H. Zhang (2002), “Local Nonnegative Matrix
    Factorization as a Visual Representation,” in Proc. Second Int’l Conf. Development and Learning.
    Fruri, S. (1981), “Cepstral Analysis Techniques for Automatic Speaker Verification,” IEEE Transaction on Acoustic, Speech and Signal Processing.
    Gauvain, J. L. and C. H. Lee (1994), “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains,” IEEE Transaction on Speech and Audio Processing, vol. 2(2): pp. 291-297.
    Gales, M. J. F. (1998), “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech and Language, vol. 12(2): pp. 75-98.
    Gales, M. J. F. and S. J. Young (1995), “Robust speech recognition in additive andconvolutional noise using parallel model combination.,” Computer Speech and Language, vol. 9: pp. 289-307.
    Greenberg, S. (1997), “On the origins of speech intelligibility in the real world,” Proceedings of ESCA-NATO Tutorial and Research Workshop on Robust
    Speech Recognition for Unknown Communication Channels. Pont-a-Mousson, France, April.
    Hadsell, R., S. Chopra and Y. LeCun. (2006), “Dimensionality reduction by learning
    an invariant mapping.” in Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), pages 1735–1742.
    Harpur, G. F. and R.W. Prager (1996), “Development of Low Entropy Coding in a Recurrent Network,” Network: Computation in Neural Systems, vol. 7, pp. 277-284.
    Hain, T., P. C. Woodland et al. (2005), “Automatic Transcription of Conversational Telephone Speech,” IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 13, NO. 6:pp. 1173-1185.
    Hirsch, H. G., C. Ehrlicher (1995), "Noise Estimation Techniques for Robust Speech Recognition,” IEEE.
    Hermansky, H., N. Morgan and H. G. Hirsch (1993), “Recognition of Speech in Additive and Convolutional Noise Based on RASTA Spectral Processing,” IEEE.
    Hermansky, H. and N. Morgan. (1994), “RASTA processing of speech,” IEEE Transactions on Speech and Audio Processing, vol. 2(4): pp. 578-589.
    Hermansky, H. (1995), “Exploring temporal domain for robustness in speech recognition,” Proc. of 15th International Congress on Acoustics, vol. II.: pp.
    61-64, June 1995.
    Hermansky, H. (1997), “Should recognizers have ears?,” Invited Tutorial Paper,Proceedings of ESCA-NATO Tutorial and Research Workshop on Robust speech recognition for unknown communication channels, pp.1-10, Pont-a-Mousson, France, April.
    Hermansky, H. (1998), “Should Recognizers Have Ears?, “ Speech Communication, Vol. 25, pp. 3–27.
    Hilger, F., H. Ney (2006), “Quantile based histogram equalization for noise robust large vocabulary speech recognition,” IEEE Trans. Audio Speech Lang. Process. 14(3), 845–854.
    Hirsch, H. G. and D. Pearce (2002), “The AURORA experimental framework for the performance evaluations of speech recognition systems under noisy conditions,” in Proc. ISCA ITRW ASR2000, Paris, France.
    Hoyer, P.O. (2002), “Nonnegative Sparse Coding,” Proc. IEEE Workshop Neural Networks for Signal Processing.
    Hoyer, P.O. (2004), “Nonnegative Matrix Factorization with Sparseness Constraints,” J. Machine Learning Research, vol. 5, pp. 1457-1469.
    Huang, S. Y., W. H. Tu, J. W. Hung (2009), “A study of sub-band modulation spectrum compensation for robust speech recognition,” ROCLING XXI:Conference on Computational Linguistics and Speech Processing (ROCLING 2009), Taichung, Taiwan.
    Huang, X., A. Acero, H. W. Hon (2001), “Spoken language processing: A guide to theory, algorithm and system development,”Upper Saddle River, NJ,
    USA,Prentice Hall PTR.
    Hung, J. W., W. H. Tu and C. C. Lai (2012), “Improved Modulation Spectrum Enhancement Methods for Robust Speech Recognition,” Signal Processing, Vol. 92, No. 11, pp. 2791–2814.
    Hung, J. W., H. T. Fan and Y. C. Lian (2012), “Modulation Spectrum Exponential Weighting for Robust Speech Recognition,” in Proceedings of International
    Conference on ITS Telecommunications, pp. 812–816.
    Hu, X., X. Lu, and C. Hori (2014), “Mandarin speech recognition using convolution neural network with augmented tone features,” in Chinese Spoken Language Processing (ISCSLP).
    Huo, Q., C. Chany and C. H. Lee (1995), “Bayesian adaptive learning of the parameters of hidden Markov model for speech recognition,” IEEE Transaction on Speech and Audio Processing, vol. 3(4): pp. 334-345.
    Joshi, V. et al. (2011), “Sub-band level histogram equalization for robust speech recognition,” in Proceedings of the Annual Conference of the International Speech Communication Association.
    Kollmeier, B., and R. Koch (1994), “Speech Enhancement Based on Physiological and Psychoacoustical Models of Modulation Perception,” Journal of the Acoustical Society of America, Vol. 95, pp. 1593–1602.
    Kumar, N. (1997), “Investigation of silicon-auditory models and generalization of linear discriminant analysis for improved speech recognition,” Ph.D.Dissertation, John Hopkins University.
    Lee, D. D. and H. S. Seung (1999), “Learning the parts of objects by non-negative matrix factorization,” Nature, 401:788–791.
    Lee, D. D. and H. S. Seung (2000), “Algorithms for Non-negative Matrix Factorization,” Advances in Neural Information Processing Systems.
    Lee, J. M. (2002), Introduction to Smooth Manifolds,” Springer-Verlag New York.
    Leggeter, C. J. and P. C. Woodland (1995), “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Computer Speech and Language, vol. 9: pp. 171-185.
    Lin, S. H., H. B. Chen, Y. M. Yeh, and B. Chen (2007), ”Improved Histogram Equalizaiton (HEQ) for Robust Speech Recognition,” in Proceedings of IEEE International Conference, pp. 2234–2237.
    Lin, S. H., Y. M. Yeh, B. Chen (2006), “Exploiting polynomial-fit histogram equalization and temporal average for robust speech recognition,” Interspeech- 9th International Conference on Spoken Language Processing (ICSLP), Pittsburgh, Pennsylvania.
    Lin, S. H., B. Chen, Y. M. Yeh (2009), “Exploring the use of speech features and their corresponding distribution characteristics for robust speech recognition,” IEEE Trans on Audio, Speech and Lang Process 17(1):84–94.
    Liu, W., N. Zheng and X. Lu (2003), “Nonnegative Matrix Factorization for Visual Coding,” Proc. IEEE Int’l Conf. Acoustics, Speech, and Signal Processing.
    Lockwood, P. and J. Boudy (1992), "Experiments with a Nonlinear Spectral Subtractor(NSS), Hidden Markov Models and The Projection, for Roubst Speech Recognition in Car,” Speech Communication.
    Macho, D. et al. (2002), “Evaluation of a noise-robust DSR front-end on Aurora Databases,” in 7th International Conference on Spoken Language Processing (ICSLP).
    Mel, B.W. (1999), “Computational Neuroscience. Think Positive to Find Parts,” Nature, vol. 401, pp. 759-760.
    Mika, S. (1999), “Fisher discriminant analysis with kernels,” IEEE International Workshop on Neural Networks for Signal Processing, Madison, Wisconsin.
    Olshausen, B. A. and D. J. Field (1996), “Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images,” Nature, vol.381, pp. 607-609.
    Pascual-Montano, A. (2006), J. M. Carazo, K. K. D. Lehmann and R. D. Pascual-Margui, “Nonsmooth nonnegtive matrix facotorization (nsNMF),” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 3, pp. 403–415.
    Povey, D., A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, M. Petr, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely (2011), “The Kaldi speech recognition toolkit,” in Proc. ASRU.
    Raj, B. (2000), “Reconstruction of Incomplete pectrograms for Robust Speech Recognition,” Ph. D. dissertation, ECE Department, Carnegie Mellon University, Pittsburgh.
    Roweis, S. and L. Saul (2000), “Nonlinear dimensionality reduction by locally linear Embedding,” Science, 290(5500):2323–2326.
    Saon, G., M. Padmanabhan, R. Gopinath and S. Chen (2000), “Maximum likelihood discriminant feature spaces,” IEEE International Conference on Acoustics, Speech, Signal processing (ICASSP '00), Istanbul, Turkey.
    Schalkwijk, J. P. M. and T. kailath (1966), ”A Coding Scheme for Additive Noise Channels with Feedback-Part I: No Bandwidth Constraint,”IEEE Transactions on Information Theory.
    Seung, H. S. and D. D. Lee (2000), ”The manifold ways of perception,” Science, 290(12).
    Stouten, V., H. V. hamme and P. Wambacq (2004), “Joint Removal of Additive and Convolutional Noise with Model-Based Feature Enhancement,“ ICASSP.
    Sun, L. C., C. W. Hsu and L. S. Lee (2007), “Modulation Spectrum Equalization for Robust Speech Recognition,“ in Proceedings of IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 81–86.
    Huang, S. Y., W. H. Tu and J. W. Hung (2009), “A Study of Sub-band Modulation Spectrum Compensation for Robust Speech Recognition,”in Proceedings of ROCLING Conference on Computational Linguistics and Speech. Processing
    Seung, S. (2002), “Multilayer perceptrons and backpropagation learning,” 9.641 Lecture 4: September 17.
    Tabrikian, J., G. S. Fostick and H. Messer (1999) “Detection of Environmental Mismatch in a Shallow Water Waveguide,” IEEE.
    Tenenbaum, J. et al. (2000), “A global geometric framework for nonlinear dimensionality reduction,” Science, 290(5500):2319–2323. spproach,” Speech Communication, vol. 37: pp. 109-131.
    Torre, A., Peinado, A.M., Segura, J.C., Perez-Cordoba, J.L., Bentez, M.C., Rubio, A.J. (2005), “Histogram equalization of speech representation for robust
    speech recognition,”. IEEE Trans. Speech Audio Process. 13 (3), 355–366.
    Varga, A. P. and R. K. Moore (1990), “Hidden Markov Model Decomposition of Speech and Noise,” in ICASSP.
    Vizinho, A. et al. (1998), "Cepstral Domain Segmental Feature Vector Normalization for Noise Robust Speech Recognition," Speech Communication.
    Viemeister, N. F (1979), “Temporal Modulation Transfer Functions Based Upon Modulation Thresholds,” Journal of the Acoustical Society of America, Vol. 66, pp. 1364–1380.
    Vuuren, S. V. and H. Hermansky (1998), “On the Importance of Components of the Modulation Spectrum for Speaker Verification,” in Proceedings of the International on Spoken Language Processing, Sydney, Australia.
    Wada, Y., K. Yoshida, T. Suzuki, H. Mizuiri, K. Konishi, K. Ukon, K.Tanabe, Y. Sakata and M. Fukushima (2006), “Synergistic Effects of Docetaxel And S-1 by Modulating The Expression Of Metabolic Enzymes Of 5-fluorouracil in Human Gastric Cancer Cell Lines, “ International Journal of Cancer, Vol. 119, pp. 783–791.
    Xiao, X., E. S. Chng and H. Li (2008), “Normalization of the speech modulation spectra for robust speech recognition,” IEEE Transaction on Audio, Speech, and Language Processing, vol. 16, no. 8.
    Yoshizawa, S., N. Hayasaka, N. Wada and Y. Miyanaga (2004), “Cepstral Gain Normalization for Noise Robust Speech Recognition,“in Proceedings of International Conference on Acoustics, Speech and Signal Processing.
    Young, S., G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. Woodland (2009), “The HTK Book (for version 3.4),” Cambridge University Engineering Department.

    下載圖示
    QR CODE