研究生: |
陳黃威 |
---|---|
論文名稱: |
改善豐富文脈模型於中文語音合成之研究 A Study of Enhanced Rich Context Modeling Techniques for Mandarin Speech Synthesis |
指導教授: | 陳柏琳 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2014 |
畢業學年度: | 102 |
語文別: | 中文 |
論文頁數: | 75 |
中文關鍵詞: | 基於隱藏式馬可夫模型之語音合成 、豐富文脈模型之語音合成 、起始語音參數序列 、潛藏語意分析 、空間向量模型 |
英文關鍵詞: | Hidden Markov Model Based Speech Synthesis, Rich Context Models Based Speech Synthesis, Initial Speech Parameter Sequence, Latent Semantic Analysis, Vector Space Model |
論文種類: | 學術論文 |
相關次數: | 點閱:224 下載:18 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文中,我們首先回顧三種不同的合成技術:串接式語音合成(Concantenative Speech Synthesis)、統計模型式語音合成(Statistical Model-Based Speech Synthesis)以及混和式語音合成(Hybrid-Based Speech Synthesis)。本論文以統計模型式語音合成做為主要研究方向,並介紹兩種技術:基於隱藏式馬可夫模型之語音合成(Hidden Markov Model-Based Speech Synthesis, HMM-Based Speech Synthesis)與使用豐富文脈模型(Rich Context Model-Based)之隱藏式馬可夫模型語音合成。本論文將上述兩種技術應用至中文語音合成當中,並將針對豐富文脈模型之語音合成進行改良,提出使用潛藏語意分析(Latent Semantic Analysis, LSA)分析出文脈(Context)的潛藏韻律,希望藉由其潛藏的韻律從訓練語料庫當中選擇韻律上相似的模型,以便獲得較為優良起始語音參數向量序列(Initial Speech Parameter Vectors Sequence)並使用語音參數產生演算法(Speech Parameter Generation Algorithm)來產生目標語句之語音參數向量序列,並用於實際合成。本論文實驗將使用新釋出的台北科技大學中文電子書語音資料庫(NTUT-AB01-CH)作為語音合成之訓練資料,實驗結果將以一系列的主觀與客觀測驗來評斷統計式語音合成架構本論文所提出之方法與既有方法之長處。
In this thesis, we first provide a brief review of three mainstream frameworks for speech synthesis, namely, concatenative speech synthesis, statistical model-based speech synthesis and hybrid-based speech synthesis. Then, we focus our attention exclusively on comparing two important instantiations of the statistical model-based framework and their applications to Mandarin Chinese speech synthesis, which are the hidden Markov model-based method and the rich context model-based method respectively. In addition, we also explore the use of latent semantic analysis (LSA) to discover both lexical and prosodic cues inherent in the contextual descriptions of training speech utterances, with the hope that they can subsequently be used to obtain a good initialization for estimating the observation vector sequence of an utterance to be synthesized. A series of subjective and objective evaluations are conducted, using the newly released NTUT-AB01-CH corpus, to validate the performance merits of the aforementioned various methods stemming from the statistical model-based framework.
[1] P. Taylor, "Text segmentation and organisation," in Text-to-Speech Synthesis, Cambridge, 2012, p. 52–77.
[2] R. L. Rivest, "Learning decision lists," Machine Learning, vol. 2, no. 3, p. 229–246, 1987.
[3] A. Voutilainen, "Part-of-speech tagging," in The Oxford Handbook of Computational Linguistics, 2003, p. 219–232.
[4] E. Brill, "Part-of-speech tagging," in Handbook of Natural Language Processing, 2006, p. 403–414.
[5] J. Kupiec, "Robust part-of-speech tagging using a hidden Markov model," Computer Speech & Language, vol. 6, no. 3, p. 225–242, 1992.
[6] H. Schmid, "Probabilistic part-of-speech tagging using decision trees," in Proceedings of International Conference on New Methods in Language Processing, 1994.
[7] P. F. Brown, P. V. deSouza, R. L. Mercer, V. J. Della Pietra and J. C. Lai, "Class-based n-gram models of natural language," Computational Linguistics, vol. 18, no. 3, p. 467–479, 1992.
[8] M. Ostendorf and K. Ross, "Prediction of abstract prosodic labels for speech synthesis," Computer Speech & Language, vol. 10, no. 3, p. 155–185, 1996.
[9] C.–Y. Chiang, C.–C. Tang, H.–M. Yu, Y.–R. Wang and S.–H. Chen, "An investigation on the Mandarin prosody of a parallel multi-speaking rate speech corpus," in Proceedings of International Conference on Speech Database and Assessments, 2009.
[10] Y. Yu, D. Li and X. Wu, "Prosodic Modeling with Rich Syntactic Conext in HMM-Based Mandarin Speech Synthesis," in Proceedings of China Summit & Internation Conference on Signal and Information Processing , 2013.
[11] P. Taylor, "Synthesis techniques base on vocal-tract models," in Text-to-Speech Synthesis, Cambridge, p. 387–411, 2012.
[12] B. S. Atal and S. L. Hanauer, "Speech analysis and synthesis by linear prediction of the speech wave," The Journal of the Acoustical Society of America, vol. 50, no. 2B, p. 637–655, 1971.
[13] L. Carroll, "Linear prediction synthesis," in An Introduction to Text-to-Speech Synthesis, T. Dutoit, Ed., 1997, p. 201–228.
[14] C. P. Browman, L. Goldstein, J. A. S. Kelso, P. Rubin and E. Saltzman, "Articulatory synthesis from underlying dynamics," The Journal of the Acoustical Society of America, vol. 75, no. S1, p. S22, 1984.
[15] P. Taylor, "Synthesis by concatenation and signal-processing modifcation," in Text-to-Speech Synthesis, Cambridge, p. 412–434, 2012
[16] E. Moulines and F. Charpentier, "Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones," Speech Communication, p. 453–467, 1990.
[17] C. Un and D. T. Magill, "The residual-excited linear prediction vocoder with transmission rate below 9.6 kbits/s," IEEE Transactions on Communications, vol. 23, no. 12, p. 1466–1474, 1975.
[18] A. J. Hunt and A. W. Black, "Unit selection in concatenative speech synthesis system using a large speech database," in Proceedings of the International Conference on Speech and Language Processing, 1996.
[19] T. Masuko, K. Tokuda, T. Kobayashi and S. Imai, "Speech synthesis from HMMs using dynamic features," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1996.
[20] J. Matoušek, Z. Hanzlíček and D. Tihelka, "Hybrid syllable/triphone speech synthesis," in Proceedings of Interspeech, 2005.
[21] Z. Ling and R. Wang, "HMM-Based unit selection using frame sized speech segments," in Proceedings of Interspeech, 2006.
[22] Z. Yan, Y. Qian and F. K. Soong, "Rich context modeling for high quality HMM-based TTS," in Proceedings of Interspeech, 2009.
[23] Z. Yan, Y. Qian and F. K. Soong, "Rich-context unit selection (RUS) approach to high quality TTS," in Proceeding of International Conference on Acoustics, Speech, and Signal Processing, 2010.
[24] S. Takamichi, T. Toda, Y. Shiga, H. Kawai, S. Sakti and S. Nakamura, "An evaluation of parameter generation methods with rich context modeling in HMM-based speech synthesis," in Proceedings of Interspeech, 2012.
[25] S. Takamichi, T. Toda, Y. Shiga, S. Sakti, G. Neubig and S. Nakamura, "Improvements to HMM-based speech synthesis based on parameter generation with rich context models," in Proceedings of Interspeech, 2013.
[26] S. Takamichi, T. Toda, Y. Shiga, S. Sakti, G. Neubig and S. Nakamura, "Parameter generation methods with rich context models for high-quality and flexible text-to-speech synthesis," IEEE Journal of Selected Topics in Signal Processing, vol. 8, no. 2, p. 239–250, 2014.
[27] H. Zen, A. Senior and M. Schuster, "Statistical parametric speech synthesis using deep nerual network," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2013.
[28] Y. Qian, Y. Fan and F. Soong, "On the training aspects of deep neural network (DNN) for parametric TTS synthesis," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2014.
[29] S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, R. Harshman, "Using latent semantic analysis to improve access to textual information," in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1988
[30] G. Salton, A. Wong and C. Yang, "A vector space model for automatic indexing," Communications of the ACM, vol. 18, no. 11, p. 613–620, 1975.
[31] P. Taylor, "Unit-selection synthesis," in Text-to-Speech Synthesis, Cambridge, 2012, pp. 474-476.
[32] N. Iwahashi, N. Kaiki and Y. Sagisaka, "Speech segment selection for concatenative synthesis based on spectral distortion minimization," Transactions of the Institute of Electronics, Information and Communication Engineers, p. 1942–1948, 1993.
[33] S.-y. Nakajima and H. Hamada, "Automatic generation of synthesis units based on context oriented clustering," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1988.
[34] Y. Sagisaka, "Speech synthesis by rule using an optimal selection of non-uniform synthesis units," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1988.
[35] Y. Sagisaka, N. Kaiki, N. Iwahashi and K. Mimura, "ATR μ-talk speech synthesis system," in Proceedings of the International Conference on Speech and Language Processing 1992, 1992.
[36] G. D. Forney, Jr., "The Viterbi algorithm," Proceeding of the IEEE, vol. 61, no. 3, p. 268–278, 1973.
[37] T. Hirai and S. Tenpaku, "Using 5 ms segments in concatenative speech synthesis," in 5th ISCA Workshop on Speech Synthesis, 2005.
[38] R. E. Donegan and E. Eide, "The IBM trainable speech synthesis system," in Proceedings of the International Conference on Speech and Language Processing, 1998.
[39] R. E. Donovan and P. Woodland, "Automatic speech synthesizer parameter estimation using HMMs," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1995.
[40] G. Möhler and A. Conkie, "Parametric modeling of intonation using vector quantization," in Proceedings of the Third ESCA/IEEE Workshop on Speech Synthesis, 1998.
[41] A. W. Black and P. Taylor, "Assigning phrase breaks from part-to-speech sequences," in Proceedings of Eurospeech 1997, 1997.
[42] R. A. J. Clark, K. Richmond and S. King, "Festival 2 – build your own general purpose unit selection speech synthesiser," in 5th ISCA Workshop on Speech Synthesis, 2004.
[43] T. Portele, K. H. Stober, H. Meyer and W. Hess, "Generation of multiple synthesis inventories by a bootstrapping procedure," in Proceedings of the International Conference on Speech and Language Processing, 1996.
[44] R. E. Donovan, M. Franz, J. S. Sorensen and S. Rouko, "Phrase splicing and varable substitution using the IBM trainable speech synthesis system," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1999.
[45] S. Pearson, N. Kibre and N. Niedzielski, "A synthesis method based on concatenation of demisyllables and a residual excited vocal tract model," in Proceedings of the International Conference on Speech and Language Processing, 1998.
[46] S. F. Chen, "Conditional and joint models for grapheme-to-phoneme conversion," in Proceedings of Eurospeech, 2003.
[47] M. Chu, H. Peng, Y. Zhao, Z. Niu and E. Chang, "Microsoft Mulan – a bilingual TTS system," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2003.
[48] C. K. K. Meng, T. Y. F. Siu and P. C. Ching, "Cu vocal: corpus-based syllable concatenation for chinese speech synthesis across domains and dialects," in Proceedings of the International Conference on Speech and Language Processing, 2002.
[49] J. Xu, T. Choy, M. Dong, C. Guan and H. Li, "On unit analysis for cantonese corpus-based TTS," in Proceedings of Eurospeech, 2003.
[50] M. A. Charleston, "Spectrum: spectral analysis of phylogenetic data," Bioinformatics, vol. 14, no. 1, p. 98–99, 1997.
[51] T. Tomoki, K. Hisashi, T. Minoru and S. Kiyohiro, "Segment selection considering local degradation of naturalness in concatenative speech synthesis," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2003.
[52] S.-y. Nakajima, "English speech synthesis based on multi-layered context oriented clustering," in Proceedings of Eurospeech, 1993.
[53] A. K. Syrdal, "Prosodic effects on listener detection of vowel concatenation," in Proceedings of Eurospeech, 2001.
[54] A. K. Syrdal and A. D. Conkie, "Perceptually-based data-driven join costs: comparing join types," in Proceedings of Eurospeech, 2005.
[55] J. Wouters and M. W. Macon, "Unit fusion for concatenative speech synthesis," in Proceedings of the International Conference on Spoken Language Processing, 2000.
[56] J. Vepa, S. King and P. Taylor, "Objective distance measures for spectral discontinuities in concatenative speech synthesis," in Proceedings of the International Conference on Speech and Language Processing, 2002.
[57] N. K. Shah and P. J. Gemperline, "Combination of the Mahalanobis distance and residual variance pattern recognition techniques for classification of near-infrared reflectance spectra," Analytical Chemistry, vol. 62, no. 5, p. 465–470, 1990.
[58] P. A. Taylor, "Unifying unit selection and hidden markov model speech syntheisis," in Proceedings of the International Conference on Speech and Language Processing, 2006.
[59] K. Tokuda, T. Yoshimura, T. Masuko and T. Kobayashi, "Speech parameter generation algorithms for HMM-based speech synthesis," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 2000.
[60] G. Fant, "The source filter concept in voice production," STL–QPSR, vol. 22, no. 1, p. 21–37, 1981.
[61] T. Fukada, K. Tokuda, T. Kobayashi and S. Imai, "An adaptive algorithm for Mel-cepstral analysis of speech," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1992.
[62] K. Tokuda, T. Kobayashi, T. Masuko and S. Imai, "Mel-generalized cepstral analysis – a unified approach to speech spectral estimation," in Proceeding of the International Conference on Spoken Language Processing, 1994.
[63] A. Oppenheim, R. Schafer and J. Buck, Discrete-Time Signal Processing, 3 ed., Prentice Hall, 1999.
[64] S. Imai, "Cepstral analysis synthesis on the Mel-frequency scale," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1983.
[65] S. Austin, R. Schwartz and P. Placeway, "The forward-backward search algorithm," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1991.
[66] L. A. Liporace, "Maximum likelihood estimation for multivaraite observations of Markov sources," IEEE Transaction Information Theory, no. 28, p. 729–734, 1982.
[67] B.–H. Juang, "Maxiumm-likelihood estimation for the mixture multivariate stachastic observations of Markov chains," AT&T Technical Journal, vol. 5, no. 64, p. 1235–1249.
[68] U. Jensen, R. K. Moore, P. Dalsgaard and B. Lindberg, "Modeling intonation contours at phrase level using continuous density hidden Markov models," Computer Speech and Language, vol. 8, no. 3, p. 247–260, 1994.
[69] G. J. Freji and F. Fallside, "Lexical stress recognition using hidden Markov models," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1988.
[70] J.–C. Chen and J.–S. Jang, "TRUES: tone recognition using extended segments," ACM Transactions on Asian Language Information Processing, vol. 7, no. 3, 2008.
[71] K. Tokuda, T. Masuko, N. Miyazaki and T. Kobayashi, "Hidden Markov models based on multi-space probability distribution for pitch pattern modeling," in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, 1999.
[72] K.–F. Lee, S. Hyamizu, H.–W. Hon, C. Huang, J. Swartz and R. Weide, "Allophone clustering for continuous speech recognition," in Proceeding of International Conference on Acoustics, Speech, and Signal Processing, 1990.
[73] M.–Y. Hwang, X. Huang and F. Alleva, "Predicting unseen triphones with senones," in Proceeding of International Conference on Acoustics, Speech, and Signal Processing, 1993.
[74] S. Young, "The general use of tying in phoneme-based HMM speech recognizers," in Proceeding of International Conference on Acoustics, Speech, and Signal Processing, 1992.
[75] K. Shinoda and T. Watanabe, "Acoustic modeling based on the MDL principle for speech recognition," in Proceedings of Eurospeech, 1997.
[76] C.–H. Lee, E. Giachin, L. Rabiner, R. Pieraccini and A. Rosenberg, "Improved acoustic modeling for large vocabulary continuous speech recognition," Computer Speech and Language, vol. 6, no. 2, p. 103–207, 1992.
[77] S. Young, J. J. Odell and P. Woodland, "Tree-based state tying for high accuracy acoustic modeling," in Proceeding of Human Langauge Technology, 1994.
[78] J. Rissanen, "Universal coding, information, prediction, and estimation," IEEE Transactions on Information Theroy, vol. 30, no. 4, p. 629–636, 1984.
[79] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi and T. Kitamura, "Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis," in Proceedings of Eurospeech, 1999.
[80] T. Yoshimura, "Simultaneous modeling of phonextic and prosodic parameters, and characteristic conversion for HMM-based text-to-speech systems," Ph.D thesis, 2002.
[81] K. Tokuda, H. Zen and A. W. Black, "An HMM-based speech synthesis system applied to English," in Proceeding of IEEE Software Stability at Work, 2002.
[82] Y. Wu and R. Wang, "Minimum generation error training for HMM-based speech synthesis," in Proceeding of International Conference on Acoustics, Speech, and Signal Processing, 2006.
[83] T. Toda and K. Tokuda, "Speech parameter generation algorithm conisdering global variance for HMM-based speech synthesis," in Proceedings of Interspeech, 2005.
[84] Y. Wu, "Investigations on HMM-based speech synthesis," Ph.D. thesis, 2006.
[85] J. Sivla and S. Narayanan, "Upper bound Kullback–Leibler divergence for transient hidden Markov models," IEEE Transactions on Signal Processing, vol. 56, no. 9, p. 4176–4188, 2008.
[86] T. Mizutani and T. Kagoshima, "Concatenative speech synthesis using The plural unit selection and fusion method," IEICE Transaction, vol. E88–D, no. 11, p. 2565–2572, 2005.
[87] "北科大電子書語音語料庫(NTUT-AB01)," [Online]. Available: http://www.aclclp.org.tw/use_mat_c.php#ntut.
[88] "Chinese Knowledge Information Processing," [Online]. Available: http://ckip.iis.sinica.edu.tw/CKIP/engversion/index.htm.
[89] D. Talkin, A robust algorithm for pitch tracking (RAPT), W. B. Kleijn and K. K. Paliwal, Eds., Amsterdam: Elsevier, 1995, p. 495–518.