研究生: |
林奕儒 Lin, Yi-Ju |
---|---|
論文名稱: |
結合韻律特徵與聲學特徵於錯誤發音檢測與診斷之研究 Mispronunciation Detection and Diagnosis Combining Prosodic Features and Phonetic Features |
指導教授: |
陳柏琳
Chen, Berlin |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 中文 |
論文頁數: | 70 |
中文關鍵詞: | 電腦輔助發音訓練 、多任務學習 、自動語音辨識 、錯誤發音檢測 、錯誤發音診斷 、韻律特徵 、深層類神經網路 |
英文關鍵詞: | computer assisted pronunciation training, mispronunciation detection, mispronunciation diagnosis, acoustic models, deep neural networks, multi-task learning, prosodic features |
DOI URL: | http://doi.org/10.6345/THE.NTNU.DCSIE.003.2019.B02 |
論文種類: | 學術論文 |
相關次數: | 點閱:195 下載:6 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文探討韻律特徵應用多任務深層網路模型於錯誤發音檢測及診斷(mispronunciation detection and diagnosis, MDD)之研究。電腦輔助發音訓練(computer assisted pronunciation training, CAPT)之目的在於透過電腦自動地指正外語學習者的發音問題;其在程序上大致可分為錯誤發音檢測(mispronunciation detection)與錯誤發音診斷(mispronunciation diagnosis)等兩個階段。本論文主要探討 1.)韻律特徵與聲學特徵結合後對於錯誤發音檢測與診斷的幫助。 2.)希望利用多任務深層網路模型解決資料正例反例不平衡之問題。 3.)結合基於相似度的評分(likelihood-based scoring,GOP)以及基於分類器評分(classification-based scoring)的方法達到更好的檢測結果以及診斷結果。 實驗結果顯示,聲學特徵對於錯誤發音檢測任務較有幫助;而韻律特徵對錯誤發音診斷任務有較好的助益。
The main idea of this thesis is to discuss the assists of the multi-task deep neural network model and prosody characteristics in mispronunciation detection and diagnosis (MDD). The purpose of computer assisted pronunciation training (CAPT) is to help second-language (L2) learners automatically correcting the mistaken pronunciation. Computer assisted pronunciation training can be divided into mispronunciation detection and mispronunciation diagnosis. This paper mainly focuses on three aspects. First, we explore the benefits using the combined features of prosodic and phonetic characteristic in mispronunciation detection and diagnosis task. Second, we use multi-task learning models to help solving the data unbalanced problem. Last but not least, we combine likelihood-based scoring (GOP) method and classification-based scoring method in order to achieve better detection and diagnosis results. The result of experiments shows that phonetic features work better when we need to detect the mispronunciation. On the contrary, prosodic features are more helpful to mispronunciation diagnosis task.
[Atal, 1974] B. S. Atal, “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” The Journal of the Acoustical Society of America, vol. 55, no. 6, pp. 1304–1312, 1974.
[Bergstra et al., 2010] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. W. Farley and Y. Bengio. “Theano: A CPU and GPU math expression compiler,” in Proceedings of the Python for Scientific Computing Conference, 2010.
[Bishop, 2006] C.M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
[Black et al., 2015] M. P. Black, D. Bone, Z. I. Skordilis, R. Gupta, W. Xia, P. Papadopoulos, S. N. Chakravarthula, B. Xiao, M. V. Segbroeck, J. Kim, P. G. Georgiou and S. S. Narayanan, ”Automated evaluation of non-native English pronunciation quality: combining knowledge- and data-driven features at multiple time scales,” in Proceedings of the International Conference on Speech Communication and Technology, 2015.
[Brefeld et al., 2005] U. Brefeld, C. Buscher and T. Scheffer, “Multiview dicriminative sequential learning,” in Proceedings of the European Conference on Machine Learning, 2005.
[Chen and Jang, 2010] L. Y. Chen and J. S. R. Jang, “Automatic pronunciation scoring using learning to rank and DP-based score segmentation,” in Proceedings of the International Conference on Speech Communication and Technology, 2010.
[Chen and Jang, 2012] L. Y. Chen and J. S. R. Jang, “Improvement in automatic pronunciation scoring using additional basic scores and learning to rank,” in Proceedings of the International Conference on Speech Communication and Technology, 2012.
[Chen and Jang, 2015] L. Y. Chen and J. S. R. Jang, “Automatic pronunciation scoring with score combination by learning to rank and class-normalized DP-based quantization,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 23, no. 11 pp. 787–797, 2015.
[Davis and Mermelstein, 1980] S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustic, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.
[Dembczynski et al., 2011] K. Dembczynski, W. Waegeman, W. Cheng and E. Hullermeier, “An exact algorithm for F-measure maximization,” Advances in Neural Information Processing Systems, 2011.
[Demenko et al., 2009] G. Demenko, A. Wagner, N. Cylwik and O. Jokisch. “An audiovisual feedback system for acquiring L2 pronunciation and L2 prosody,” in Proceedings of the International Symposium on Languages, Applications and Technologies, 2009.
[Fant, 1973] G. Fant, Speech Sounds and Features. Cambridge, MA, MIT Press, 1973.
[Franco et al., 1999] H. Franco, L. Neumeyer, M. Ramos and H. Bratt, “Automatic detection of phone-level mispronunciation for language learning,” in Proceedings of the European Conference on Speech Communication and Technology, 1999.
[Fujino et al., 2008] A. Fujino, H. Isozaki and J. Suzuki, “Multi-label text categorization with model combination based on F1-score maximization,” in Proceedings of the International Joint Conference on Natural Language Processing, 2008.
[Gales, 1998] M.J.F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech and Language, vol. 12, pp. 75–98, 1998.
[Harrison et al., 2008] A.M. Harrison, W.Y. Lau, H. Meng and L. Wang, “Improving mispronunciation detection and diagnosis of learners’ speech with context-sensitive phonological rules based on language transfer,” in Proceedings of the International Conference on Speech Communication and Technology, 2008.
[Abdel-Hamid, Ossamaet al., 2014]Abdel-Hamid, Ossama, Abdel-rahman Mohamed, Hui Jiang, Li Deng, Gerald Penn and Dong Yu. “Convolutional Neural Networks for Speech Recognition.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 22 : 1533-1545, 2014.
[Harrison et al., 2009] A. M. Harrison, W. K. Lo, X. J. Qian and H. Meng, “Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training,” in Proceedings of the International Symposium on Languages, Applications and Technologies, 2009.
[Hönig et al., 2010]Hönig, Florian, Anton Batliner, Karl Weilhammer and Elmar Nöth. “Automatic Assessment of Non-Native Prosody for English as L 2.” , 2010.
[Hinton et al., 2012] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” IEEE Transactions on Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
[Hsu et al., 2016] Y. C. Hsu, M. H. Yang, H. T. Hung and B. Chen, “Mispronunciation detection leveraging maximum performance criterion training of acoustic models and decision functions,” in Proceedings of the International Conference on Speech Communication and Technology, 2016.
[Hu et al., 2013] W. Hu, Y. Qian and F. K. Soong, “A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL),” in Proceedings of the International Conference on Speech Communication and Technology, 2013.
[Hu et al., 2014] W. Hu, Y. Qian and F. K. Soong, “A DNN-based acoustic modeling of tonal language and its application to Mandarin pronunciation training,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2014.
[Hu et al., 2015a] W. Hu, Y. Qian, F. K. Soong and Y. Wang, “Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers,” Speech Communication, vol. 67, pp. 154–166, 2015.
[Hu et al., 2015b] W. Hu, Y. Qian and F. K. Soong, “An improved DNN-based approach to mispronunciation detection and diagnosis of L2 learners’ speech,” in Proceedings of the International Symposium on Languages, Applications and Technologies, 2015.
[Huang et al., 2012] H. Huang, J. Wang and H. Abudureyimu “Maximum F1-score discriminative training for automatic mispronunciation detection in computer-assisted language learning,” in Proceedings of the International Conference on Speech Communication and Technology, 2012.
[Huang et al., 2013] J. T. Huang, J. Li, D. Yu, L. Deng and Y. Gong, “Cross-language knowledgetransfer using multilingual deep neural network with shared hidden layers,” in Proceedings of the International Conference on Speech Communication and Technology, 2013.
[Huang et al., 2015] H. Huang, H. Xu, X. Wang and W. Silamu, “Maximum F1-score discriminative training criterion for automatic mispronunciation detection,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 23, no. 5 pp. 787–797, 2015.
[Ito et al., 2007] A. Ito, Y. L. Lim, M. Suzuki and S. Makino, “Pronunciation error detection for computer-assisted language learning system based on error rule clustering using a decision tree,” in Proceedings of the Japan Conference on Acoustical Science and Technology, vol. 28, no. 2, pp. 131–133, 2007.
[Jiang, 2005] H. Jiang, “Confidence measures for speech recognition: A survey,” Speech Communication, vol. 45, no. 4, pp. 455–470, 2005.
[Kim et al., 1997] Y. Kim, H. Franco and L. Neumeyer, “Automatic pronunciation scoring of specific phone segments for language instruction,” in Proceedings of the European Conference on Speech Communication and Technology, 1997.
[Korkmazsky et al., 2004] F. Korkmazsky, D. Fohr and I. Illina, “Using linear interpolation to improve histogram equalization for speech recognition,” in Proceedings of the International Conference on Spoken Language Processing, 2004.
[Laborde et al., 2016] V. Laborde, T. Pellegrini, L. Fontan, J. Mauclair, H. Sahraoui and J. Farinas, “Pronunciation assessment of Japanese learners of French with GOP scores and phonetic information,” in Proceedings of the International Conference on Speech Communication and Technology, 2016.
[LeCun et al., 2015] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp 436–444, 2015.
[Chen et al., 2016]Chen, Nancy F. and Haizhou Li. “Computer-assisted pronunciation training: From pronunciation scoring towards spoken language learning.” 2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA): 1-7, 2016.
[Lee and Glass, 2012] A. Lee and J. Glass, “A comparison-based approach to mispronunciation detection,” in Proceedings of the International Conference on Spoken Language Technology Workshop, 2012.
[Lee and Glass, 2014] A. Lee and J. Glass, “Context-dependent pronunciation error pattern discovery with limited annotation,” in Proceedings of the International Conference on Speech Communication and Technology, 2014.
[Lee and Glass, 2015] A. Lee and J. Glass, “Mispronunciation detection without nonnative training data,” in Proceedings of the International Conference on Speech Communication and Technology, 2015.
[Lee and Siniscalchi, 2013] C. H. Lee and S. M. Siniscalchi, “An information extraction approach to speech processing: Analysis, detection, verification, and recognition,” in Proceedings of the IEEE, vol. 101, no. 5, pp. 1089–1115, 2013.
[Lee et al., 2013] A. Lee, Y. Zhang and J. Glass, “Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2013.
[Yu, 2016]Yu, Dong, Wayne Xiong, Jasha Droppo, Andreas Stolcke, Guoli Ye, Jinyu Li and Geoffrey Zweig. “Deep Convolutional Neural Networks with Layer-Wise Context Expansion and Attention.” INTERSPEECH , 2016.
[Li et al., 2011] H. Li, S. Huang, S. Wang, B. Xu, “Context-dependent duration modeling with backoff strategy and look-up tables for pronunciation assessment and mispronunciation detection,” in Proceedings of the International Conference on Speech Communication and Technology, 2011.
[Li et al., 2016] W. Li, M. Siniscalchi, N. F. Chen and C. H. Lee, “Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2016.
[Lin et al., 2007] S. H. Lin, Y. M. Yeh and B. Chen, “A comparative study of histogram equalization (HEQ) for robust speech recognition,” International Journal of Computational Linguistics and Chinese Language Processing, Vol. 12, No. 2, pp. 217–238, 2007.
[Lo et al., 2010] W.K. Lo, S. Zhang and H. Meng, “Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system,” in Proceedings of the International Conference on Speech Communication and Technology, 2010.
[Luo et al., 2009] D. Luo, Y. Qiao, N. Minematsu, Y. Yamauchi and K. Hirose, “Analysis and utilization of MLLR speaker adaptation technique for learners’ pronunciation evaluation,” in Proceedings of the International Conference on Speech Communication and Technology, 2009.
[Pedregosa et al., 2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay, “Scikit-learn: machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[Povey et al., 2011] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motl´ıcek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2011.
[Qian et al., 2010] X. Qian, H. M. Meng, and F. K. Soong, “Discriminatively trained acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT),” in Proceedings of the International Conference on Speech Communication and Technology, 2010.
[Qian et al., 2012] X. Qian, H. Meng and F. K. Soong, “The use of DBN-HMMs for mispronunciation detection and diagnosis in L2 English to support computeraided pronunciation training,” in Proceedings of the International Conference on Speech Communication and Technology, 2012.
[Qian et al., 2017] Li, Kun, Xiaojun Qian and Helen M. Meng. “Mispronunciation Detection and Diagnosis in L2 English Speech Using Multidistribution Deep Neural Networks.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017.
[Strik et al.,2009] Strik, H., Truong, K., De Wet, F., and Cucchiarini, C., "Comparing different approaches for automatic pronunciation error detection.", Speech communication51.10, 2009.
[Alex et al.,2013] Graves, Alex, Abdel-rahman Mohamed and Geoffrey E. Hinton. “Speech recognition with deep recurrent neural networks.” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing ,2013.
[Qian et al., 2016] X. Qian, H. Meng, and F. K. Soong, “A two-pass framework for mispronunciation detection & diagnosis in computer-aided pronunciation training,” IEEE Transactions on Audio, Speech, and Language Processing, 2016.
[Rand, 1971] W. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical Association, vol. 66, no. 336, pp. 846–850, 1971.
[Parveen et al, 2003] Parveen, S., and Green, P., "Multitask learning in connectionist robust ASR using recurrent neural networks", Eighth European Conference on Speech Communication and Technology, 2003
[Sak, Hasim et al., 2014]Sak, Hasim, Andrew W. Senior and Françoise Beaufays. “Long short-term memory recurrent neural network architectures for large scale acoustic modeling.” Interspeech, 2014.
[Sharma et al., 2011] Z. Ge, S. Sharma, M. Smith, “Adaptive frequency cepstral coefficients for word mispronunciation detection,” in Proceedings of the International Congress on Image and Signal Processing, 2011.
[Sim, 2009] K. C. Sim, “Improving phone verification using state-level posterior features and support vector machine for automatic mispronunciation detection,” in Proceedings of the International Symposium on Languages, Applications and Technologies, 2009.
[Siniscalchi et al., 2013] S. M. Siniscalchi, J. Reed, T. Svendsen, and C. H. Lee, “Universal attribute characterization of spoken languages for automatic spoken language recognition,” Computer Speech and Language, vol. 27, no. 1, pp. 209–227, 2013.
[Zhang, Y and Yang Q, 2017]Zhang, Yu and Qiang Yang. “A Survey on Multi-Task Learning.” CoRR abs/1707.08114 : n. pag,2017.
[Stevens, 2000] K. N. Stevens, Acoustic Phonetics. Cambridge, MA, MIT Press, 2000.
[Strik et al., 2007] H. Strik, K. Truong, F. D. Wet and C. Cucchiarini, “Comparing classifiers for pronunciation error detection,” in Proceedings of the International Conference on Speech Communication and Technology, 2007.
[Swietojanski and Renals, 2014] P. Swietojanski and S. Renals “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models,” in Proceedings of the International Conference on Spoken Language Technology Workshop, 2014.
[Truong et al., 2005] K. Truong, A. Neri, F. D. Wet, C. Cucchiarini and H. Strik, “Automatic detection of frequent pronunciation errors made by L2-learners,” in Proceedings of the International Conference on Speech Communication and Technology, 2005.
[Viikki and Laurila, 1998] O. Viikki and K. Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition,” Speech Communication, vol. 25, pp. 133–147, 1998.
[Viterbi, 1967] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Transactions on Information Theory, vol. 13, no. 2, pp. 260–269. 1967.
[Wang and Lee, 2012] Y.B. Wang and L.S. Lee, “Improved approaches of modeling and detecting error patterns with empirical analysis for computer-aided pronunciation training,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2012.
[Wang and Lee, 2015] Y. B. Wang and L. S. Lee, “Supervised detection and unsupervised discovery of pronunciation error patterns for computer-assisted language learning,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 23, no. 3, pp. 564–579, 2015.
[Wei et al., 2009] S. Wei, G. Hu, Y. Hu and R.H. Wang, “A new method for mispronunciation detection using support vector machine based on pronunciation space models,” Speech Communication, vol. 51, pp. 896–905, 2009.
[Wessel et al., 2001] F. Wessel, R. Schlüter, K. Macherey and H. Ney, “Confidence measures for large vocabulary continuous speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 9, no. 3, pp. 288–298, 2001.
[Witt and Young, 2000] S. M. Witt and S. J. Young, “Phone-level pronunciation scoring and assessment for interactive language learning,” Speech Communication, vol. 30, no. 2–3, pp. 95–108, 2000.
[Witt, 2012] S. M. Witt, “Automatic error detection in pronunciation training: Where we are and where we need to go,” in Proceedings of the International Symposium on Computer Architecture, 2012.
[Ye et al., 2012] N. Ye, K. Chai, W. Lee and H. Chieu, “Optimizing F-measures: A tale of two approaches,” in Proceedings of the International Conference on Machine Learning, 2012.
[Young et al., 2006] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. C. Woodland, The HTK book, Version 3.4, 2006.
[Li, Wei et al., 2017]Li, Wei, Nancy F. Chen, Sabato Marco Siniscalchi and Chin-Hui Lee. “Improving Mispronunciation Detection for Non-Native Learners with Multisource Information and LSTM-Based Deep Models.” INTERSPEECH , 2017.
[Yu and Deng, 2014] D. Yu and L. Deng, Automatic Speech Recognition: A Deep Learning Approach. Springer, 2014.
[Zhang et al., 2008] F. Zhang, C. Huang, F. K. Soong, M. Chu, and R. H. Wang, “Automatic mispronunciation detection for Mandarin,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2008.
[Zhang et al., 2011] C. Zhang, Y. Liu, and C. H. Lee, “Detection-based accented speech recognition using articulatory features,” in Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2011.
[Qian et al., 2018] Qian, X., Meng, H., and Soong, F., "A two-pass framework of mispronunciation detection and diagnosis for computer-aided pronunciation training.", IEEE/ACM Transactions on Audio, Speech and Language Processing, 2016
[Mao et al., 2018] Shaoguang Mao , Zhiyong Wu, Runnan Li , Xu Li , Helen Meng and Lianhong Cai, “Applying Multitask Learning to Acoustic-Phonemic Model For Mispronunciation Detection and Dagnosis in L2 English Speech,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2018.
[Chan et al., 2015] Chan, William and Ian Lane. “Deep Recurrent Neural Networks for Acoustic Modelling.” CoRR abs/1504.01482 (2015): n. pag.
[楊明翰等,2015] 楊明翰、許曜麒、洪孝宗和陳柏琳,”華語錯誤發音檢測使用多種鑑別式訓練之深層類神經網路聲學模型與邏輯回歸分析,” in Proceedings of the Conference on Technologies and Applications of Artificial Intelligence,2015。
[許曜麒等,2015] 許曜麒、楊明翰、洪孝宗、熊玉雯、宋曜廷和陳柏琳,”融合多種深層類神經網路聲學模型與分類技術於華語錯誤發音檢測之研究,” in Proceedings of the Conference on Computational Linguistics and Speech Processing,2015。
[許曜麒,2015]” 錯誤發音檢測使用評估尺度相關訓練準則” , 2015。
[楊明翰,2015]” 改善類神經網路聲學模型經由結合多任務學習與整體學習於會議與音辨識之研究” , 2015。
[林奕儒,2018]” 改善類神經網路聲學模型經由結合多任務學習與整體學習於會議與音辨識之研究,” in The Conference on Technologies and Applications of Artificial Intelligence,2018。