研究生: |
洪孝宗 |
---|---|
論文名稱: |
聲調特徵擷取技術與其在中文聲調辨識應用之研究 An Empirical Study on Tonal Feature Extraction Techniques and Their Applications in Mandarin Tone Recognition |
指導教授: | 陳柏琳 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2014 |
畢業學年度: | 102 |
語文別: | 中文 |
論文頁數: | 53 |
中文關鍵詞: | 聲調辨識 、聲調特徵擷取 、線性預估係數 |
英文關鍵詞: | Tone recognition, Tonal feature, Linear Predictive Coefficients |
論文種類: | 學術論文 |
相關次數: | 點閱:345 下載:31 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文探討不同層次的聲調特徵(Tone Features)的擷取對於中文聲調辨識相關應用的影響。聲調特徵概略地分為音框與各發音層次組合而成;音框層次的聲調資訊多以基頻數值表示,再以音素或音節等區間的統計量做為聲調特徵。
為了更強健地使用音高(Pitch)資訊,本論文探究多種音高表示法與正規化方法;音高表示法包含基頻變化頻譜(Fundamental Frequency Variation Spectrum, FFV Spectrum)、發聲機率(Probability Of Voicing, POV)和高維度梅爾倒頻譜係數(High-order Mel-frequency Cepstral Coefficients, HMFCC)等,而正規化方法包含平均值與變異數等化法(Mean and Variance Normalization, MVN)和統計圖等化法(Histogram Equalization, HEQ)。本論文亦提出以線性預估係數(Linear Predictive Coefficients, LPC)近似正規化互相關函數(Normalized Cross Correlation Function, NCCF)曲線,藉此完整地表達音框層次的音高資訊。此外,本論文比較了數種不同子區間與跨區間的音高統計量,包含本論文提出的子區間音高偏度(Skewness)與峰度(Kurtosis)特徵。最後嘗試不同的機器學習分類器,如支持向量機(Support Vector Machine, SVM)與深層類神經網路(Deep Neural Network, DNN),並結合前述的聲調特徵進行聲調辨識。
實驗以公視廣播新聞語料庫(MATBN Corpus)和臺灣師範大學華語學習者語音語料庫(NTNU-MAS Corpus)進行驗證,其結果顯示吾人提出之方法在聲調辨識應用有良好表現。
This thesis delves into the extraction of tonal features with different levels of granularity, as well as their applications to Mandarin tone recognition. In the most general sense, tonal features could be extracted at either the frame level or the pronunciation-interval level. For the former, tonal features are usually embodied with the instantaneous pitch information of each frame, while for the latter, tonal features are typically represented as an ensemble of different pitch-related statistical features calculated from the pronunciation interval of interest (like phone, syllable or sub-intervals of them).
In order to robustly drive the pitch information of each frame for use in Mandarin tone recognition, we investigate not only various pitch estimation methods (such as fundamental frequency variation spectrum (FFV Spectrum), probability of voicing (POV) and high-order Mel-frequency Cepstral Coefficients (HMFCC)) but also various pitch normalization mechanisms (such as mean and variance normalization (MVN) and histogram equalization (HEQ)). In particular, we present a novel use of linear predictive coefficients (LPC) to approximate the curve of the normalized cross correlation function (NCCF) so that the frame-level pitch information can be more faithfully rendered. In addition, we compare the utilities of several pitch-related statistical features calculated within or among sub-intervals of a syllable, including our proposed features that are derived based on the skewness and kurtosis of pitch values. Furthermore, we also leverage different machine-learning techniques, such as support vector machine (SVM) and deep neural network (DNN), to work in concert with the aforementioned tonal features for Mandarin tone recognition.
Empirical evaluations performed on the MATBN corpus and the NTNU-MAS corpus seem to demonstrated that our presented tonal feature extraction methods hold good promise for Mandarin tone recognition and are very competitive with existing methods.
[1] 林燾、王理嘉, 語音學教程. 臺北: 五南圖書出版有限公司, 1995.
[2] Dinoj Surendran and Gina-Anne Levow, "Can voice quality improve mandarin tone recognition?," in Proc. ICASSP, Las Vegas, pp. 4177-4180, 2008.
[3] Ruo-Xiao Yang, "The phonation factor in the categorical perception of mandarin tones," in Proc. of ICPhS XVII, Hong Kong, 2011.
[4] David Talkin, "A robust algorithm for pitch tracking (RAPT)," in Speech coding and synthesis.: Elsevier Science, 1995, vol. 495, p. 518.
[5] Paul Boersma, "Praat, a system for doing phonetics by computer.," Glot International, vol. 5, no. 9/10, pp. 341-345, Jun 2001.
[6] 趙元任, 中國話的文法. 台北: 敦煌書局, 1981.
[7] 古鴻炎、張小芬、吳俊欣, "仿趙氏音高尺度之基週軌跡正規化方法及其應用," 於 第十六屆自然語言與語音處理研討會, 台北, 2004.
[8] Si Wei, Hai-Kun Wang, Qing-Sheng Liu, and Ren-Hua Wang, "CDF-matching for automatic tone error detection in mandarin call system," in Proc. ICASSP, vol. 4, Honolulu, pp. IV–205-IV–208, 2007.
[9] Yow-Bang Wang and Lin-Shan Lee, "Mandarin tone recognition using affine-invariant prosodic features and tone posteriorgram," in Proc. INTERSPEECH, Makuhari, Chiba, Japan, pp. 2850-2853, 2010.
[10] Lawrence R. Rabiner, Michael J. Cheng, AARON E. Rosenberg, and Carol A. McGonegal, "A Comparative Performance Study of Several Pitch Detection Algorithms," IEEE Trans. Acoust., Speech, Signal Process., vol. 24, no. 5, pp. 399-418, Oct 1976.
[11] Lawrence R. Rabiner and Ronald W. Schafer, Digital Processing of Speech Signals.: Pearson Education, 1978.
[12] B. Gold and Lawrence R. Rabiner, "Parallel Processing Techniques for Estimating Pitch Periods of Speech in the Time Domain," Journal of the Acoustical Society of America, vol. 46, no. 2B, pp. 442-448, August 1969.
[13] Man Mohan Sondhi, "New methods of pitch extraction," IEEE Trans. Audio Electroacoust, vol. 16, no. 2, pp. 262-266, Jun 1968.
[14] John D. Markel, "The SIFT algorithm for fundamental frequency estimation," IEEE Trans. Audio Electroacoust, vol. 20, no. 5, pp. 367-377, Dec 1972.
[15] John J. Dubnowski, Ronald W. Schafer , and Lawrence R. Rabiner, "Real-time digital hardware pitch detector," IEEE Trans. Acoust., Speech, Signal Process, vol. 24, no. 1, pp. 2-8, Feb 1976.
[16] Byung Suk Lee and Daniel P. W. Ellis, "Noise robust pitch tracking by subband autocorrelation classification," in Proc. INTERSPEECH, Portland, Oregon, USA, 2012.
[17] Jinsong Zhang and Keikichi Hirose, "Tone nucleus modeling for Chinese lexical tone recognition," Speech Communication, vol. 42, no. 3–4, pp. 447-466, Apr. 2004.
[18] Jin-Song Zhang, Satoshi Nakamura, and Keikichi Hirose, "Tone nucleus-based multi-level robust acoustic tonal modeling of sentential F0 variations for Chinese continuous speech tone recognition," Speech Communication, vol. 46, no. 3-4, pp. 440-454, July 2005.
[19] Fujisaki Hiroya, Keikichi Hirose, Pierre Halle, and Haitao Lei, "Analysis and modeling of tonal features in polysyllabic words and sentences of the standard chinese," in Proc. ICSLP, Kobe, 1990, pp. 841–844.
[20] Geoffrey E. Hinton, Simon Osindero, and Yee-Whye Teh, "A fast learning algorithm for deep belief nets," Neural Computation, vol. 18, no. 7, pp. 1527-1554, July 2006.
[21] Geoffrey Everest Hinton and Ruslan Salakhutdinov, "Reducing the dimensionality of data with neural networks," Science, vol. 313, pp. 504-507, July 2006.
[22] Neville Ryant, Jiahong Yuan, and Mark Liberman, "Mandarin tone classification without pitch tracking," in Proc. ICASSP, Florence, Italy, pp. 4868-4872, 2014.
[23] Ye Tian, Jian-Lai Zhou, Chu Min, and Eric Chang, "Tone recognition with fractionized models and outlined features," in Proc. ICASSP, Montreal, 2004.
[24] Lawrence Rabiner, "On the use of autocorrelation analysis for pitch detection," Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 25, no. 1, pp. 24-33, Feb. 1977.
[25] M. Ross, H. Shaffer, A. Cohen, R. Freudberg, and H. Manley, "Average magnitude difference function pitch extractor," Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 22, no. 5, pp. 353-362, 1974.
[26] Pegah Ghahremani et al., "A Pitch Extraction Algorithm Tuned for Automatic Speech Recognition," in Proc ICASSP, Florence, 2014.
[27] Fumitada Itakura, "Minimum prediction residual principle applied to speech recognition," Acoustics, Speech and Signal Processing, IEEE Transactions on, vol. 23, no. 1, pp. 67-72, Feb. 1975.
[28] Steven B. Davis and Paul Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Trans. Acoust., Speech, Signal Process, vol. 28, no. 4, pp. 357-366, Aug 1980.
[29] Chang-Han Hank Huang and Frank Seide, "Pitch tracking and tone features for Mandarin speech recognition," in Proc. ICASSP, Istanbul, 2000.
[30] Lei He and Jie Hao, "A tone recognition framework for continuous mandarin speech," in Proc. ICSLP, Pittsburgh, Pennsylvania, pp. 1575-1578, 2006.
[31] F. Plante, G.F. Meyer, and W.A Ainsworth, "A pitch extraction reference database," in EUROSPEECH, Madrid, 1995.
[32] Kornel Laskowski, Matthias Wölfel, Mattias Heldner, and Jens Edlund, "Computing the fundamental frequency variation spectrum in conversational spoken dialogue systems," in Proc Acoustics, Paris, 2008, pp. 3305-3310.
[33] Kornel Laskowski, Mattias Heldner Heldner, and Jens Edlund, "The fundamental frequency variation spectrum," in Proc. Fonetik, Gothenburg, pp. 29-32, 2008.
[34] Hao Chao, Zhanlei Yang, and Wenju Liu, "Improved tone modeling by exploiting articulatory features for mandarin speech recognition," in Proc. ICASSP, Kyoto, pp. 4741-4744, 2012.
[35] Pui-Fung WONG and Man-Hung Siu, "Decision tree based tone modeling for Chinese speech recognition," in Proc. ICASSP, vol. 1, 2004.
[36] 熊玉雯、宋曜廷, "華語學習者語音語料庫之建置與錯誤分析," 於 語言特徵分析工作坊, 臺北, 2014.
[37] Hsin-Min Wang, Berlin Chen, Jen-Wei Kuo, and Shih-Sian Cheng, "MATBN: A mandarin chinese broadcast news corpus," International Journal of Computational Linguistics & Chinese Language Processing, vol. 10, no. 2, pp. 219-236, June 2005.
[38] Malcolm Slaney, Elizabeth Shriberg, and Jui-Ting Huang, "Pitch-gesture modeling using subband autocorrelation change detection," in Proceedings of INTERSPEECH 2013, Lyon, pp. 1911-1915, 2013.
[39] Paul Boersma, "Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound," in Proc. IFA, Amsterdam, pp. 97-110, 1993.
[40] Arturo Camacho, SWIPE: A sawtooth waveform inspired pitch estimator for speech and music. Gainesville: University of Florida, 2007.
[41] Speech Signal Processing Toolkit (SPTK). [Online]. http://sp-tk.sourceforge.net/
[42] Daniel Povey et al., "The kaldi speech recognition toolkit," in Proc ASRU, Hawaii, 2011.
[43] Chih-Chung Chang and Chih-Jen Lin, "LIBSVM: A library for support vector machines," ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 3, pp. 1-27, April 2011.
[44] Jinsong Zhang and Keikichi Hirose, "Tone nucleus modeling for chinese lexical tone recognition," Speech Communication, vol. 42, no. 3–4, pp. 447–466, Apr. 2004.
[45] Shilei Zhang, Shi Qin, Stephen M. Chu, and Yong Qin, "Main vowel domain tone modeling with lexical and prosodic analysis for Mandarin ASR," in Proc. ICASSP, Taipei, pp. 4561-4564, 2009.
[46] Lei He and Jie Hao, "A tone recognition framework for continuous mandarin Speech," in Proc. INTERSPEECH, Pittsburgh, pp. 1575-1578, 2006.
[47] Dinoj Surendran and Gina-Anne Levow, "Can voice quality improve mandarin tone recognition?," in In Proc. ICASSP, pp. 4177-4180, 2008.