簡易檢索 / 詳目顯示

研究生: 顏必成
Yan, Bi-Cheng
論文名稱: 探索調變頻譜特徵之低維度結構應用於強健性語音辨識
Exploring Low-dimensional Structures of Modulation Spectra for Robust Speech Recognition
指導教授: 陳柏琳
Chen, Berlin
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 英文
論文頁數: 83
中文關鍵詞: 調變頻譜強健性語音辨識流形學習法稀疏表示法低秩表示法
英文關鍵詞: robust speech recognition, manifold learning, sparse representation, low-rank representation, modulation spectrum
DOI URL: https://doi.org/10.6345/NTNU202202490
論文種類: 學術論文
相關次數: 點閱:107下載:14
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語音強健技術(Robustness)在自動化語音辨識系統(Automatic Speech Recognition, ASR)中扮演著相當重要的角色,尤其是環境的影響(Environment effect )下,更能突顯其重要性。近年來的研究指出,探索語音特徵的低維度結構(Low-dimensional Structure)有助於萃取出較具有強健性的語音特徵。有鑒於上述觀點,我們研究多種考量語音特徵固有(Intrinsic)的低維度結構,並找尋俱有特定結構的子空間以涵蓋原本高維度的語音特徵空間,以此希望能獲得較具強健性的語音特徵。
    在本篇論文中,我們探索了一系列的低維度結構方法並應用在語音條變頻譜域(Modulation Spectra),希望能淬煉出強健性語音特徵。首先,我們使用基於稀疏表示(Sparse Representation)的方法來廣泛地分析高維度語音特徵,再從中去找出一冗贅(Residual)的基底(Basis)後並加以去除。接著我們提出了基於低秩表示法(Low-rank Representation)來探索語音條變頻譜的子空間結構,從而減輕噪音所造成的負面影響。最後,我們探索語音特徵調變頻譜上固有的幾何低維度流形結構(Geometric Low-dimensional Manifold Structures),希望能將帶有噪音的音訊投影到此流形結構上,以獲得更具有強健性的語音特徵。此外,為了獲得更好的語音辨識效能,我們將所提出的方法與常見的語音正規化特徵結合,其結果都有良好的表現。所有實驗都在Aurora-4數據庫和任務上進行和驗證。

    Developments of noise robustness techniques are vital to the success of automatic speech recognition (ASR) systems in face of varying sources of environmental interference. Recent studies have shown that exploring low dimensionality of speech features can yield good robustness. Along this vein, researches on low dimensional structures, which considers the intrinsic structures of speech features residing in some low-dimensional subspaces, has gained considerable interest from the ASR community. In this thesis, we have explored a family of the low-dimensional structure methods at modulation domain, in hope to obtain more noise-robust speech features. The general line of this research is divided into three significant aspects. First, sparse representation based methods are utilized to remove some residual bases from the modulation spectra of speech features Second, we propose a novel use of the LRR-based method to discover the subspace structures of modulation spectra, thereby alleviating the negative effects of noise interference. Third, we endeavor to explore the intrinsic geometric low-dimensional manifold structures inherent in modulation spectra of speech features, in the hope to obtain more noise-robust speech features. Furthermore, we also extensively compare our approaches with several well-practiced feature-based normalization methods. All experiments were conducted and verified on the Aurora-4 database and task.

    LIST OF FIGURES IX LIST OF TABLES X CHAPTER 1 1 INTRODUCTION 1 1.1 AUTOMATIC SPEECH RECOGNITION 3 1.2 ROBUSTNESS TECHNIQUE FOR ASR 5 1.2.1 Feature normalization 6 1.2.2 Feature enhancement 7 1.2.3 Model adaptation 7 1.3 MAIN CONTRIBUTION 8 1.4 OUTLINE OF THIS THESIS 9 CHAPTER 2 12 RELATED WORK 12 2.1 SPEECH FEATURE EXTRACTION 12 2.1.1 Spectral Shaping 14 2.1.2 Spectral Analysis 16 2.1.3 Coefficient Transformation 17 2.2 DISTRIBUTION-BASED FEATURE NORMALIZATION 17 2.2.1 Statistical Moments Normalization 18 2.2.2 Histogram Equalization 20 2.3 MODULATION SPECTRUM 23 2.3.1 Definition and Properties of Modulation Spectrum 24 2.3.2 A Normalization Framework of Modulation Spectrum 24 CHAPTER 3 26 EXPERIMENT SETUP 26 3.1 SPEECH DATASET 26 3.2 ACOUSTIC MODEL SETTINGS 29 3.3 PERFORMANCE EVALUATION 30 3.4 BASELINE EXPERIMENTS 30 3.4.1 Clean-condition training and multi-condition training 30 3.4.2 Further evaluation for widely-used feature normalization methods 32 CHAPTER 4 35 SPARSE REPRESENTATION 35 4.1 FORMULATION OF SPARSE REPRESENTATION 36 4.1.1 Dictionary learning 37 4.1.2 Sparse coding 40 4.2 THE EXPERIMENTS OF SPARSE REPRESENTATION 43 4.3 THE RESULTS OF SPARSE REPRESENTATION 44 CHAPTER 5 49 LOW-RANK REPRESENTATION 49 5.1 FORMULATION OF LOW-RANK REPRESENTATION 51 5.2 THE EXPERIMENTS OF LOW-RANK REPRESENTATION 53 5.3 THE RESULTS OF LOW-RANK REPRESENTATION 55 CHAPTER 6 61 MANIFOLD LEARNING 61 6.1 FORMULATION OF GRAPH REGULARIZED-BASED MANIFOLD LEARNING 63 6.2 THE EXPERIMENTS OF GRAPH REGULARIZED-BASED MANIFOLD LEARNING 67 6.3 THE RESULTS OF GRAPH REGULARIZED-BASED MANIFOLD LEARNING 68 CHAPTER 7 75 CONCLUSIONS AND FUTURE WORKS 75 BIBLIOGRAPHY 77

    [1] J. Droppo, and A. Acero, “Environmental robustness,” springer handbook of speech processing. Springer Berlin Heidelberg, pp. 653–680, 2008.
    [2] D. Yu, and L. Deng. Automatic speech recognition: A deep learning approach. Springer, 2014.
    [3] Y. He, G. Sun, and J. Han, “Spectrum enhancement with sparse coding for robust speech recognition,” Digital Signal Processing, 43: 59–70, 2015.
    [4] M. L. Seltzer, Y. Dong, and Y. Wang, “An investigation of deep neural networks for noise robust speech recognition,” in Proc. ICASSP, pp. 7398–7402, 2013.
    [5] B. Atal, "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification." JASA, vol. 55, pp. 1304-1312, 1974.
    [6] O. Viikki, B. David, and K. Laurila. “A recursive feature vector normalization approach for robust speech recognition in noise.” in Proc. ICASSP, vol. 2, pp.733-736, 1998.
    [7] O. Viikki, and K. Laurila. “Cepstral domain segmental feature vector normalization for noise robust speech recognition.” Speech Commun., vol. 25, pp. 133–147, 1998.
    [8] S. Molau, H. Florian, and H. Ney. “Feature space normalization in adverse acoustic conditions.” in Proc. ICASSP, vol. 1. pp.656-659, 2003.
    [9] D. Macho, L. Mauuary, B. Noé. “Evaluation of a noise-robust DSR front-end on Aurora databases.” in Proc Interspeech. 2002.
    [10] J. Li, M. L. Seltzer, and Y. Gong. “Improvements to VTS feature enhancement.” in Proc. ICASSP, pp. 4677-4680, 2012.
    [11] S. Boll. “Suppression of acoustic noise in speech using spectral subtraction.” IEEE Trans. on acoustics, speech, and signal processing, vol.27, pp. 113-120, 1979.
    [12] T. F. Quatieri, “Discrete-time speech signal processing: principles and practice.” Pearson Education India, 2006.
    [13] R. Lippmann, E. Martin, and D. Paul, “Multi-style training for robust isolated- word speech recognition,” in Proc. ICASSP, vol. 12, pp. 705–708, 1987.
    [14] J. L. Gauvain and L. Chin-Hui, “Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains,” IEEE Trans. on speech and audio processing, vol. 2, no. 2, pp. 291–298, 1994.
    [15] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models,” Computer Speech & Language, vol. 9, no. 2, pp. 171–185, 1995.
    [16] A. Jansen, and P. Niyogi. “Intrinsic Fourier analysis on the manifold of speech sounds.” in Proc. ICASSP, vol. 1, pp.241-244, 2006.
    [17] K. N. Stevens, “Acoustic phonetics”, Vol. 30. MIT press, 2000.
    [18] A. Oppenheim and R. Schafer, “From frequency to quefrency: a history of the cepstrum.” IEEE Signal Processing Magazine, vol. 21, no. 5, pp. 95–106, 2004.
    [19] S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.
    [20] H. Hermansky, “Perceptual linear predictive (plp) analysis of speech,” Journal of the Acoustical Society of America, vol. 87, no. 4, pp. 1738–1752, 1990.
    [21] J. Picone, “Signal modeling techniques in speech recognition,” Proceedings of the IEEE, vol. 81, no. 9, pp. 1215–1247, 1993.
    [22] S. S. Stevens, J. Volkmann, and E. B. Newman, “A scale for the measurement of the psychological magnitude pitch,” The Journal of the Acoustical Society of America, vol. 8, no. 3, pp. 185–190, 1937.
    [23] S. Furui, “Cepstral analysis technique for automatic speaker verification,” IEEE Trans. on Acoustics, Speech and Signal Processing, vol. 29, no. 2, pp. 254–272, 1981.
    [24] Y.H. Suk, S.H. Choi, and H.-S. Lee, “Cepstrum third-order normalisation method for noisy speech recognition,” Electronics Letters, vol. 35, no. 7, pp. 527–528, 1999.
    [25] C.W. Hsu and L.S. Lee, “Higher order cepstral moment normalization (hocmn) for robust speech recognition,” in Proc. Annu. Conf. of the Int. Speech Communication Association. ISCA, 2004.
    [26] A. de la Torre, J. C. Segura, C. Benitez, A. M. Peinado, and A. J. Rubio, “Non-linear transformations of the feature space for robust speech recognition,” in Proc. ICASSP, 2002, pp. 401–404.
    [27] A. de la Torre, A. M. Peinado, J. C. Segura, J. L. Perez-Cordoba, M. C. Benitez, and A. J. Rubio, “Histogram equalization of speech representation for robust speech recognition,” IEEE Trans. on Speech and Audio Processing, vol. 13, no. 3, pp. 355–366, 2005.
    [28] S.H. Lin, B. Chen, and Y.M. Yeh, “Exploring the use of speech features and their corresponding distribution characteristics for robust speech recognition,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 17, no. 1, pp. 84–94, 2009.
    [29] F. Hilger and H. Ney, “Quantile based histogram equalization for noise robust speech recognition,” in Proc. Eurospeech, vol. 2, pp. 1135–1138, 2001.
    [30] S. H. Lin, Y. M. Yeh, and B. Chen, “Exploiting polynomial-fit histogram equalization and temporal average for robust speech recognition,” in Proc. ICSLP, pp. 2522–2525, 2006.
    [31] J. W. Hung, H. J. Hsieh and B. Chen, “Robust speech recognition via enhancing the complex-valued acoustic spectrum in modulation domain,” IEEE/ACM Trans. on Audio, Speech, and Language Processing, vol. 24, no. 2, pp. 236-251, 2016.
    [32] Y. C. Kao et al., “Effective modulation spectrum factorization for robust speech recognition,” in Proc. INTERSPEECH, pp. 2724–2728, 2014.
    [33] W. Y. Chu, J. W. Hung, “Modulation spectrum factorization for robust speech recognition,” in Proc. APSIPA, 2011.
    [34] N. Kanedera, T. Arai, H. Hermansky, and M. Pavel, “On the importance of various modulation frequencies for speech recognition,” in Proc. European Conf. on Speech Communication and Technology, 1997.
    [35] N. Parihar and J. Picone, “Aurora working group: DSR front end LVSCR evaluation au/384/02,” in Institute for Signal and Information Processing Report, 2002.
    [36] D. B. Paul and J. M. Baker, “The design for the wall street journal-based csr corpus,” in HLT Proc. of the workshop on Speech and Natural Language. Association for Computational Linguistics, pp. 357–362, 1992.
    [37] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” J. Mach. Learn. Res., vol. 11, pp. 19–60, 2010.
    [38] P. Nirmala, L. R. Sulochana, and N. Rethnasamy, “Centrality measuresbased algorithm to visualize a maximal common induced subgraph in large communication networks,” Knowl. Inf. Syst., vol. 46, pp. 213–239, 2015.
    [39] M. Yuan and Y. Lin, “Model selection and estimation in regression with grouped variables,” J. Roy. Statist. Soc. B (Statist. Methodol.), vol. 68, pp. 49–67, 2006.
    [40] X. Zhu, S. Zhang, Z. Jin, Z.Zhang, and Z. Xu, “Missing value estimation for mixed-attribute data sets,” IEEE Trans. Knowl. Data Eng., vol. 23, pp. 110–121, Jan. 2011.
    [41] R. Jenatton, J.-Y. Audibert, and F. Bach, “Structured variable selection with sparsity-inducing norms,” J. Mach. Learn. Res., vol. 12, pp. 2777–2824, Feb. 2011.
    [42] Z. Zhang et al., “A survey of sparse representation: algorithms and applications,” IEEE Transactions on Content Mining, vol. 3, pp. 490–530, 2015.
    [43] M. Aharon, E. Michael, and A. Bruckstein, “K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. on Signal Processing, vol. 54, no. 11, pp. 4311–4322, 2006
    [44] C. Lu, S. Jiaping, and J. Jia, “Online robust dictionary learning,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 415–422, 2013.
    [45] Y. Emre and J. F. Gemmeke, “Noise-robust speech recognition with exemplar-based sparse representations using Alpha-Beta divergence,” in Proc. ICASSP, pp. 5502–5506, 2014.
    [46] J. F. Gemmeke, V. Tuomas, and A. Hurmalainen, “Exemplar-based sparse representations for noise robust automatic speech recognition,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2067–2080, 2011.
    [47] D.P. Wipf and B.D. Rao, “Sparse Bayesian learning for basis selection,” IEEE Transactions on Signal Processing, vol. 52, pp. 2153–2164, 2004.
    [48] Y, Mehrdad et al., “Parametric dictionary design for sparse coding,” IEEE Transactions on Signal Processing, vol. 57, pp. 4800–4810, 2009.
    [49] M. Stéphane and Z. Zhang, “Matching pursuits with time-frequency dictionaries,” IEEE Transactions on signal processing, vol. 41, pp. 3397–3415, 1993.
    [50] P. Yagyensh et al., “Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition,” in Proc. of Conference Record of The Twenty-Seventh Asilomar Conference on Signals, Systems and Computers, 1993.
    [51] B. Efron, T. Hastie, I. Johnstone, I. Johnstone, and R. Tibshirani, “Least angle regression,” The Annals of statistics, vol. 32, no. 2, pp. 407–499, 2004.
    [52] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” Adv. Neural Inf. Process. Syst., vol. 20, pp. 801–808, 2007.
    [53] P.O. Hoyer, “Non-negative sparse coding,” in Proc. of Neural Networks for Signal Processing, 2002.
    [54] B. C. Yan, C. H. Shih, S. H. Liu and B. Chen, "The use of dictionary learning approach for robustness speech recognition," International Journal of Computational Linguistics and Chinese Language Processing, Vol. 21, No. 2, pp. 35-54, 2016.
    [55] G. Liu et al., “Robust recovery of subspace structures by low-rank representation,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 35, pp. 171–184, 2013.
    [56] G. Liu et al., “Robust subspace segmentation by low-rank representation” in Proc. ICML, 2010.
    [57] G. Luyet, et al., “Low-rank representation of nearest neighbor phone posterior probabilities to enhance DNN acoustic modeling,” No. EPFL-REPORT-218116. Idiap, 2016.
    [58] P. Dighe, et al, “Exploiting low-dimensional structures to enhance dnn based acoustic modeling in speech recognition.” in Proc. ICASSP, pp. 5690-5694, 2016.
    [59] J.E. Candès, et al. “Robust principal component analysis?,” Journal of the ACM, vol. 3, 2011.
    [60] M. Belkin, and P. Niyogi. "Laplacian eigenmaps and spectral techniques for embedding and clustering." in Proc. NIPS, pp.585-591, 2002.
    [61] X. He, and P. Niyogi. "Locality preserving projections." in Proc. NIPS, pp.153-160, 2004.
    [62] S. Roweis, and L. Saul. "Nonlinear dimensionality reduction by locally linear embedding." Science, 290, pp. 2323-2326, 2000.
    [63] D. Cai, et al. "Graph regularized nonnegative matrix factorization for data representation." IEEE Trans. Pattern Analysis and Machine Intelligence, pp. 1548–1560, 2011.
    [64] M. Zheng, et al. "Graph regularized sparse coding for image representation." IEEE Transactions on Image Processing, vol. 20, pp. 1327-1336, 2011
    [65] A. Jansen, and P. Niyogi, “Intrinsic Fourier analysis on the manifold of speech sounds,” in Proc. ICASSP, vol. 1, pp. 241–244, 2006.
    [66] K. N. Stevens, Acoustic phonetics, MIT press, 2000.
    [67] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, 401, pp.788–791, 1999.

    下載圖示
    QR CODE