簡易檢索 / 詳目顯示

研究生: 高予真
Yu-chen Kao
論文名稱: 應用時間結構資訊之分佈式語音特徵參數正規化技術於強健性語音辨識之研究
Distribution-based Feature Normalization with Temporal-Structural Information on Robust Speech Recognition
指導教授: 陳柏琳
Chen, Berlin
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2013
畢業學年度: 101
語文別: 英文
論文頁數: 81
中文關鍵詞: 調變頻譜統計圖等化法前後文資訊多項式擬合強健性語音辨識
英文關鍵詞: modulation spectrum, histogram equalization, context information, polynomial fitting, robust speech recognition
論文種類: 學術論文
相關次數: 點閱:280下載:2
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 近年來,在強健性語音辨識的領域中,統計圖等化法(histogram equalization, HEQ)由於其簡單又擁有優良效能的特性,已成為一個十分熱門的研究課題。在本論文中,我們提出兩種優化的統計圖等化法的技術:分別是利用多項式迴歸改進在調變頻譜(modulation spectrum)上統計圖等化法的效能,以及利用空間與時間的前後文資訊打破傳統作用在梅爾倒頻譜係數特徵的統計圖等化法之假設。這些方法有兩個主要的特色:其一是利用高次方的多項式進行語音特徵的正規化,並加入時間與空間(不同維度)上的前後文資訊,打破傳統統計圖等化法假設時間與空間分別獨立的狀況;其二是將時間上的差分資訊引入語音特徵的正規化中,此舉能更巧妙運用前後文資訊,並對語音辨識的效能有一定的提升。本論文使用Aurora-2語料庫來進行驗證不同強健性語音特徵擷取技術在小詞彙語音辨識任務之效能,並在Aurora-4語料庫來進一步驗證不同強健性語音特徵擷取技術在大詞彙語音辨識任務之效能;而這些試驗的結果證實了本論文所提出兩種優化的統計圖等化法的技術,可以有效降低語音辨識的詞錯誤率,並且對其它進階的特徵(如ETSI advanced front end, AFE)也能產生正面的效果。

    Recently, histogram equalization (HEQ) of speech features has received considerable attention in the area of robust speech recognition because of its relative simplicity and good empirical performance. In this thesis, we present a polynomial variant of spectral histogram equalization (SHE) on the modulation spectra of speech features and a novel extension to the conventional HEQ approach conducted on the cepstral domain. Our HEQ methods at least have the following two attractive properties. First, polynomial regression of various orders is employed to efficiently perform feature normalization building upon the notion of HEQ. Second, not only the contextual distributional statistics but also the dynamics of feature values are taken as the input to the presented regression functions for better normalization performance. By doing so, we can to some extent relax the dimension-independence and bag-of-frames assumptions made by the conventional HEQ methods. All experiments were carried out on the Aurora-2 corpus and task and further verified on the Aurora-4 corpus and task. The corresponding results demonstrate that our proposed methods can achieve considerable word error rate reductions over the baseline systems and offer additional performance gains for the AFE-processed features.

    LIST OF FIGURES IX LIST OF TABLES XI CHAPTER 1 INTRODUCTION 1 1.1 Automatic Speech Recognition 2 1.2 Environmental Variations 4 1.3 Robust ASR 9 1.4 Main Contributions 14 1.5 Outline of this Thesis 15 CHAPTER 2 RELATED WORK 17 2.1 Speech Feature Extraction 17 2.2 Distribution-based Feature Normalization 20 2.3 Modulation Spectrum 25 CHAPTER 3 EXPERIMENT SETUP 28 3.1 Speech Corpus 28 3.2 Experiment Design 33 3.3 Baseline Results 35 CHAPTER 4 POLYNOMIAL-FIT SPECTRAL HISTOGRAM EQUALIZATION 39 4.1 Formulation of PSHE 41 4.2 Results 48 CHAPTER 5 EXTENDED POLYNOMIAL-FIT HISTOGRAM EQUALIZATION 53 5.1 Formulation of EPHEQ 54 5.2 Results 63 CHAPTER 6 CONCLUSIONS AND FUTURE WORKS 73 BIBLIOGRAPHY 75

    [1] B.-H. Juang and S. Furui, “Automatic recognition and understanding of spoken language - a first step toward natural human-machine communication,” Proceedings of the IEEE, vol. 88, no. 8, pp. 1142–1165, Aug. 2000.
    [2] J. Droppo and A. Acero, “Environmental robustness,” in Springer handbook of speech processing, 1st ed., J. Benesty, M. M. Sondhi, and Y. Huang, Eds. Springer, 2008, ch. 33, pp. 653–679.
    [3] Y. Gong, “Speech recognition in noisy environments: a survey,” Speech Communication, vol. 16, no. 3, pp. 261–291, 1995.
    [4] D. O’Shaughnessy, “Invited paper: Automatic speech recognition: History, methods and challenges,” Pattern Recognition, vol. 41, no. 10, pp. 2965–2979, 2008.
    [5] F. Jelinek, “The dawn of statistical asr and mt,” Computer Linguistics, vol. 35, no. 4, pp. 483–494, 2009.
    [6] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of Machine Learning Research, vol. 3, pp. 1157–1182, Mar. 2003.
    [7] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.
    [8] H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,” Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.
    [9] H. A. Bourlard and N. Morgan, Connectionist Speech Recognition: A Hybrid Approach. Norwell, MA, USA: Kluwer Academic Publishers, 1993.
    [10] M. Gales and S. Young, “The application of hidden markov models in speech recognition,” Foundations and Trends in Signal Processing, vol. 1, no. 3, pp. 195–304, Jan. 2007.
    [11] D. Pearce, H. G. Hirsch, and D. Gmbh, “The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions,” in Proc. ISCA Workshop on ASR, 2000.
    [12] Y. Hu and P. Loizou, “Subjective comparison of speech enhancement algorithms,” in Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, vol. 1, 2006.
    [13] H. G. Hirsch, “Experimental framework for the performance evaluation of speech recognition front-ends on a large vocabulary task,” ETSI STQ-Aurora DSR Working Group, Tech. Rep. AU/384/02, 2002.
    [14] M. Cooke, P. Green, L. Josifovski, and A. Vizinho, “Robust automatic speech recognition with missing and unreliable acoustic data,” Speech Communication, vol. 34, pp. 267–285, 2001.
    [15] D. Macho, L. Mauuary, B. Noe, Y. M. Cheng, D. Ealey, D. Jouvet, H. Kelleher, D. Pearce, and F. Saadoun, “Evaluation of a noise-robust dsr front-end on aurora databases,” in Proc. Annu. Conf. of the Int. Speech Communication Association, 2002.
    [16] M. W. Hoffman, T. D. Trine, K. M. Buckley, and D. J. V. Tasell, “Robust adaptive microphone array processing for hearing aids: Realistic speech enhancement,” The Journal of the Acoustical Society of America, vol. 96, no. 2, pp. 759–770, Apr. 1994.
    [17] I. Soon and S. Koh, “Low distortion speech enhancement,” IEEE Proceedings of Vision, Image and Signal Processing, vol. 147, no. 3, pp. 247–253, 2000.
    [18] D. Pei and C. Zhigang, “Compensation of speech enhancement distortion for robust speech recognition,” in Proc. IEEE Region 10 Conf. on Computers, Communications, Control and Power Engineering, vol. 1, oct. 2002, pp. 449–452.
    [19] S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 27, no. 2, pp. 113–120, 1979.
    [20] Y. Ephraim and D. Malah, “Speech enhancement using a minimum mean-square error log-spectral amplitude estimator,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 33, no. 2, pp. 443–445, 1985.
    [21] K. Hermus, P. Wambacq, and H. Van hamme, “A review of signal subspace speech enhancement and its application to noise robust speech recognition,” EURASIP Journal on Applied Signal Processing, vol. 2007, no. 1, pp. 195–209, Jan. 2007.
    [22] R. Lippmann, E. Martin, and D. Paul, “Multi-style training for robust isolated-word speech recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, vol. 12, 1987, pp. 705–708.
    [23] J. L. Gauvain and L. Chin-Hui, “Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains,” Speech and Audio Processing, IEEE Transactions on, vol. 2, no. 2, pp. 291–298, 1994.
    [24] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models,” Computer Speech & Language, vol. 9, no. 2, pp. 171–185, 1995.
    [25] P. Moreno, B. Raj, and R. Stern, “A vector taylor series approach for environment-independent speech recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, vol. 2, 1996, pp. 733 –736.
    [26] M. J. Gales, “Model based techniques for noise robust speech recognition,” Ph.D. dissertation, Cambridge University, 1995.
    [27] L. Deng, A. Acero, M. Plumpe, and X. Huang, “Large-vocabulary speech recognition under adverse acoustic environments,” in Proc. Int. Conf. on Spoken Language Processing, 2000.
    [28] J. Wu, Q. Huo, and D. Zhu, “An environment compensated maximum likelihood training approach based on stochastic vector mapping,” in Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, vol. 1, 2005, pp. 429–432.
    [29] R. Haeb-Umbach, D. Geller, and H. Ney, “Improvements in connected digit recognition using linear discriminant analysis and mixture densities,” in Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, vol. 2, 1993, pp. 239–242.
    [30] N. Kumar and A. G. Andreou, “Heteroscedastic discriminant analysis and reduced rank hmms for improved speech recognition,” Speech Communication, vol. 26, no. 4, pp. 283–297, Dec. 1998.
    [31] M. Sakai, N. Kitaoka, and S. Nakagawa, “Generalization of linear discriminant analysis used in segmental unit input hmm for speech recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, vol. 4, 2007, pp. 333–336.
    [32] G. Saon, M. Padmanabhan, R. Gopinath, and S. Chen, “Maximum likelihood discriminant feature spaces,” in Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, vol. 2, 2000, pp. 1129–1132.
    [33] H. Hermansky and N. Morgan, “Rasta processing of speech,” IEEE Transactions on Speech and Audio Processing, vol. 2, no. 4, pp. 578–589, 1994.
    [34] C.-P. Chen, K. Filali, and J. A. Bilmes, “Frontend post-processing and backend model enhancement on the aurora 2.0/3.0 databases,” in Proc. Annu. Conf. of the Int. Speech Communication Association, 2002.
    [35] B. Kollmeier and R. Koch, “Speech enhancement based on physiological and psychoacoustical models of modulation perception and binaural interaction,” The Journal of the Acoustical Society of America, vol. 95, no. 3, pp. 1593–1602, 1994.
    [36] X. Xiao, E. S. Chng, and H. Li, “Normalization of the speech modulation spectra for robust speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 8, pp. 1662–1674, 2008.
    [37] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 29, no. 2, pp. 254–272, 1981.
    [38] O. Viikki and K. Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition,” Speech Communucation, vol. 25, no. 1-3, pp. 133–147, 1998.
    [39] A. de la Torre, J. C. Segura, C. Benitez, A. M. Peinado, and A. J. Rubio, “Non-linear transformations of the feature space for robust speech recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, 2002, pp. 401–404.
    [40] B. Chen and S.-H. Lin, “Distribution-based feature compensation for robust speech recognition,” in Recent Advances in Robust Speech Recognition Technology. Bentham Science Publishers, 2011, ch. 10, pp. 155–168.
    [41] A. de la Torre, A. M. Peinado, J. C. Segura, J. L. Perez-Cordoba, M. C. Benitez, and A. J. Rubio, “Histogram equalization of speech representation for robust speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 13, no. 3, pp. 355–366, 2005.
    [42] D. P. Ibm, S. Dharanipragada, and M. Padmanabhan, “A nonlinear unsupervised adaptation technique for speech recognition,” in Proc. Int. Conf. on Spoken Language Processing, 2000, pp. 556–559.
    [43] B. Chen, W.-H. Chen, S.-H. Lin, and W.-Y. Chu, “Robust speech recognition using spatial-temporal feature distribution characteristics,” Pattern Recognition Letter, vol. 32, no. 7, pp. 919–926, 2011.
    [44] H.-J. Hsieh, J.-W. Hung, and B. Chen, “Exploring joint equalization of spatial-temporal contextual statistics of speech features for robust speech recognition,” in Proc. Annu. Conf. of the Int. Speech Communication Association, 2012.
    [45] V. Joshi, R. Biligi, U. S., L. Garcia, and C. Benitez, “Sub-band level histogram equalization for robust speech recognition,” in Proc. Annu. Conf. of the Int. Speech Communication Association, 2011.
    [46] S.-H. Lin, Y.-M. Yeh, and B. Chen, “Exploiting polynomial-fit histogram equalization and temporal average for robust speech recognition,” in Proc. Annu. Conf. of the Int. Speech Communication Association. ISCA, 2006.
    [47] S.-H. Lin, Y.-M. Yeh, and B. Chen, “Investigating the use of speech features and their corresponding distribution characteristics for robust speech recognition,” in Proc. IEEE Workshop on Automatic Speech Recognition Understanding, 2007, pp. 87 –92.
    [48] N. Kanedera, T. Arai, H. Hermansky, and M. Pavel, “On the importance of various modulation frequencies for speech recognition,” in Proc. European Conf. on Speech Communication and Technology, 1997.
    [49] L.-C. Sun, C.-W. Hsu, and L.-S. Lee, “Modulation spectrum equalization for robust speech recognition,” in IEEE Workshop on Automatic Speech Recognition Understanding, 2007, pp. 81–86.
    [50] B. C. Wen-Yi Chu, Jeih-Weih Hung, “Modulation spectrum factorization for robust speech recognition,” in Proc. APSIPA Annual Summit and Conference, 2011.
    [51] A. Oppenheim and R. Schafer, “From frequency to quefrency: a history of the cepstrum,” IEEE Signal Processing Magazine, vol. 21, no. 5, pp. 95–106, 2004.
    [52] J. Picone, “Signal modeling techniques in speech recognition,” Proceedings of the IEEE, vol. 81, no. 9, pp. 1215–1247, 1993.
    [53] S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the measurement of the psychological magnitude pitch,” The Journal of the Acoustical Society of America, vol. 8, no. 3, pp. 185–190, 1937.
    [54] Y.-H. Suk, S.-H. Choi, and H.-S. Lee, “Cepstrum third-order normalisation method for noisy speech recognition,” Electronics Letters, vol. 35, no. 7, pp. 527–528, 1999.
    [55] C.-W. Hsu and L.-S. Lee, “Higher order cepstral moment normalization (hocmn) for robust speech recognition,” in Proc. Annu. Conf. of the Int. Speech Communication Association. ISCA, 2004.
    [56] C.-W. Hsu and L.-S. Lee, “Higher order cepstral moment normalization for improved robust speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 2, pp. 205–220, 2009.
    [57] T. Acharya and A. Ray, Image Processing: Principles and Applications. Wiley, 2005.
    [58] D. V. Compernolle, “Noise adaptation in a hidden markov model speech recognition system,” Computer Speech and Language, vol. 3, no. 2, pp. 151–168, 1989.
    [59] F. Hilger and H. Ney, “Quantile based histogram equalization for noise robust large vocabulary speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 3, pp. 845–854, 2006.
    [60] S.-H. Lin, B. Chen, and Y.-M. Yeh, “Exploring the use of speech features and their corresponding distribution characteristics for robust speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 1, pp. 84–94, 2009.
    [61] C. E. Shannon, “Communication in the presence of noise,” Proceedings of the IRE, vol. 37, no. 1, pp. 10–21, 1949.
    [62] L.-C. Sun and L.-S. Lee, “Modulation spectrum equalization for improved robust speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 3, pp. 828–843, 2012.
    [63] R. Leonard, “A database for speaker-independent digit recognition,” in Proc. IEEE Int. Conf. on Acoustics, Speech, Signal Processing, vol. 9, 1984, pp. 328–331.
    [64] D. B. Paul and J. M. Baker, “The design for the wall street journal-based csr corpus,” in HLT Proc. of the workshop on Speech and Natural Language. Association for Computational Linguistics, 1992, pp. 357–362.
    [65] S. J. Young, G. Evermann, M. J. F. Gales, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, and P. C. Woodland, The HTK Book, version 3.4. Cambridge, UK: Cambridge University Engineering Department, 2006.

    下載圖示
    QR CODE