簡易檢索 / 詳目顯示

研究生: 許曜麒
Hsu, Yao-Chi
論文名稱: 錯誤發音檢測使用評估尺度相關訓練準則
Mispronunciation Detection with Evaluation Metric-related Training Criteria
指導教授: 陳柏琳
Chen, Berlin
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2016
畢業學年度: 104
語文別: 中文
論文頁數: 80
中文關鍵詞: 電腦輔助發音訓練錯誤發音檢測錯誤發音診斷聲學模型深層類神經網路
英文關鍵詞: computer assisted pronunciation training, mispronunciation detection, mispronunciation diagnosis, acoustic models, deep neural networks
DOI URL: https://doi.org/10.6345/NTNU202203621
論文種類: 學術論文
相關次數: 點閱:222下載:32
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 錯誤發音檢測(mispronunciation detection)與錯誤發音診斷(mispronunciation diagnosis)為電腦輔助發音訓練系統的一部分,它們能輔助第二外語學習者準確地找出語句中錯誤發音的部位以增進學習者的口說熟練度。本論文延續過去學者的研究,大致可將貢獻分為三點:1) 我們透過最佳化評估尺度相關訓練法則估測深層類神經網路聲學模型的參數以及發音檢測決策函數之參數。2) 可以發現聲學模型經過我們的方法訓練後,後續的錯誤發音診斷任務之效能也得到改善。3) 我們將錯誤發音診斷視為分類任務,並利用過去學者所提出的蘊含豐富資訊之特徵以提升錯誤發音診斷的效果。一系列的實驗將建立在華語錯誤發音檢測與診斷任務,從實驗中可以觀察到我們提出的方法之優點。

    Mispronunciation detection and diagnosis are part and parcel of a computer assisted pronunciation training (CAPT) system, collectively facilitating second-language (L2) learners to pinpoint erroneous pronunciations in a given utterance so as to improve their spoken proficiency. This thesis presents a continuation of such a general line of research and the major contributions are three-fold. First, we propose an effective training approach that estimates the deep neural network based acoustic models involved in the mispronunciation detection process by optimizing an objective directly linked to the ultimate evaluation metric. Second, we investigate the extent to which, the subsequent mispronunciation diagnosis can benefit from using these specifically trained acoustic models. Third, we recast mispronunciation diagnosis as a classification problem and leverage a rich set of features for the idea to work. A series of experiments on a Mandarin mispronunciation detection and diagnosis task seem to show the performance merits of the proposed methods.

    第1章 緒論 1 1.1 研究背景與動機 1 1.2 自動語音辨識 2 1.2.1 特徵擷取 3 1.2.2 聲學模型 4 1.2.3 語言模型 6 1.2.4 語言解碼 7 1.3 電腦輔助發音訓練 7 1.3.1 錯誤發音的類型 8 1.3.2 錯誤發音檢測基於聲學模型之發音特徵 9 1.3.3 錯誤發音檢測基於韻律特徵 11 1.3.4 回饋 11 1.3.5 評估標準 12 1.4 本論文研究內容與貢獻 13 1.5 論文架構 14 第2章 文獻探討 15 2.1 發音優劣評估(goodness of pronunciation) 16 2.2 對數音素事後機率(log phone posterior) 19 2.3 對數音素狀態事後機率(log senone posterior) 21 2.4 基於聲學模型之發音檢測特徵擷取 22 2.5 錯誤發音檢測之分類模型 24 2.5.1 邏輯迴歸分類器 24 2.5.2 多層邏輯迴歸分類器 25 2.5.3 支持向量機 27 2.6 錯誤發音診斷 27 第3章 最大化錯誤發音檢測評估尺度之鑑別式訓練 29 3.1 F度量目標函數 29 3.2 最大化F度量鑑別式訓練 31 3.3 R度量目標函數 34 3.4 最大化R度量鑑別式訓練 35 第4章 錯誤發音診斷 37 4.1 最小化熵正則項 37 4.2 監督式錯誤發音診斷訓練 38 第5章 實驗環境設定 41 5.1 華語學習者口語語料庫 41 5.2 聲學模型訓練 43 5.3 錯誤發音檢測評估方式 45 第6章 發音檢測實驗之結果探討 48 6.1 發音檢測特徵於分類模型之實驗 50 6.2 基於門檻值(thresholding based)之最大化F度量鑑別式訓練 51 6.3 基於門檻值(thresholding based)之最大化R度量鑑別式訓練 57 6.4 基於分類器(classification based)之最大化F度量鑑別式訓練 58 6.5 額外特徵探討 60 6.6 錯誤發音診斷實驗 63 第7章 結論與未來展望 67 參考文獻 70

    [Atal, 1974] B. S. Atal, “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” The Journal of the Acoustical Society of America, vol. 55, no. 6, pp. 1304–1312, 1974.
    [Bergstra et al., 2010] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. W. Farley and Y. Bengio. “Theano: A CPU and GPU math expression compiler,” in Proceedings of the Python for Scientific Computing Conference, 2010.
    [Bishop, 2006] C.M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
    [Black et al., 2015] M. P. Black, D. Bone, Z. I. Skordilis, R. Gupta, W. Xia, P. Papadopoulos, S. N. Chakravarthula, B. Xiao, M. V. Segbroeck, J. Kim, P. G. Georgiou and S. S. Narayanan, ”Automated evaluation of non-native English pronunciation quality: combining knowledge- and data-driven features at multiple time scales,” in Proceedings of the International Conference on Speech Communication and Technology, 2015.
    [Brefeld et al., 2005] U. Brefeld, C. Buscher and T. Scheffer, “Multiview dicriminative sequential learning,” in Proceedings of the European Conference on Machine Learning, 2005.
    [Chen and Jang, 2010] L. Y. Chen and J. S. R. Jang, “Automatic pronunciation scoring using learning to rank and DP-based score segmentation,” in Proceedings of the International Conference on Speech Communication and Technology, 2010.
    [Chen and Jang, 2012] L. Y. Chen and J. S. R. Jang, “Improvement in automatic pronunciation scoring using additional basic scores and learning to rank,” in Proceedings of the International Conference on Speech Communication and Technology, 2012.
    [Chen and Jang, 2015] L. Y. Chen and J. S. R. Jang, “Automatic pronunciation scoring with score combination by learning to rank and class-normalized DP-based quantization,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 23, no. 11 pp. 787–797, 2015.
    [Davis and Mermelstein, 1980] S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustic, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.
    [Dembczynski et al., 2011] K. Dembczynski, W. Waegeman, W. Cheng and E. Hullermeier, “An exact algorithm for F-measure maximization,” Advances in Neural Information Processing Systems, 2011.
    [Demenko et al., 2009] G. Demenko, A. Wagner, N. Cylwik and O. Jokisch. “An audiovisual feedback system for acquiring L2 pronunciation and L2 prosody,” in Proceedings of the International Symposium on Languages, Applications and Technologies, 2009.
    [Fant, 1973] G. Fant, Speech Sounds and Features. Cambridge, MA, MIT Press, 1973.
    [Franco et al., 1999] H. Franco, L. Neumeyer, M. Ramos and H. Bratt, “Automatic detection of phone-level mispronunciation for language learning,” in Proceedings of the European Conference on Speech Communication and Technology, 1999.
    [Fujino et al., 2008] A. Fujino, H. Isozaki and J. Suzuki, “Multi-label text categorization with model combination based on F1-score maximization,” in Proceedings of the International Joint Conference on Natural Language Processing, 2008.
    [Gales, 1998] M.J.F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Computer Speech and Language, vol. 12, pp. 75–98, 1998.
    [Harrison et al., 2008] A.M. Harrison, W.Y. Lau, H. Meng and L. Wang, “Improving mispronunciation detection and diagnosis of learners’ speech with context-sensitive phonological rules based on language transfer,” in Proceedings of the International Conference on Speech Communication and Technology, 2008.
    [Harrison et al., 2009] A. M. Harrison, W. K. Lo, X. J. Qian and H. Meng, “Implementation of an extended recognition network for mispronunciation detection and diagnosis in computer-assisted pronunciation training,” in Proceedings of the International Symposium on Languages, Applications and Technologies, 2009.
    [Hinton et al., 2012] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath and B. Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,” IEEE Transactions on Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.
    [Hsu et al., 2016] Y. C. Hsu, M. H. Yang, H. T. Hung and B. Chen, “Mispronunciation detection leveraging maximum performance criterion training of acoustic models and decision functions,” in Proceedings of the International Conference on Speech Communication and Technology, 2016.
    [Hu et al., 2013] W. Hu, Y. Qian and F. K. Soong, “A new DNN-based high quality pronunciation evaluation for computer-aided language learning (CALL),” in Proceedings of the International Conference on Speech Communication and Technology, 2013.
    [Hu et al., 2014] W. Hu, Y. Qian and F. K. Soong, “A DNN-based acoustic modeling of tonal language and its application to Mandarin pronunciation training,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2014.
    [Hu et al., 2015a] W. Hu, Y. Qian, F. K. Soong and Y. Wang, “Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers,” Speech Communication, vol. 67, pp. 154–166, 2015.
    [Hu et al., 2015b] W. Hu, Y. Qian and F. K. Soong, “An improved DNN-based approach to mispronunciation detection and diagnosis of L2 learners’ speech,” in Proceedings of the International Symposium on Languages, Applications and Technologies, 2015.
    [Huang et al., 2012] H. Huang, J. Wang and H. Abudureyimu “Maximum F1-score discriminative training for automatic mispronunciation detection in computer-assisted language learning,” in Proceedings of the International Conference on Speech Communication and Technology, 2012.
    [Huang et al., 2013] J. T. Huang, J. Li, D. Yu, L. Deng and Y. Gong, “Cross-language knowledgetransfer using multilingual deep neural network with shared hidden layers,” in Proceedings of the International Conference on Speech Communication and Technology, 2013.
    [Huang et al., 2015] H. Huang, H. Xu, X. Wang and W. Silamu, “Maximum F1-score discriminative training criterion for automatic mispronunciation detection,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 23, no. 5 pp. 787–797, 2015.
    [Ito et al., 2007] A. Ito, Y. L. Lim, M. Suzuki and S. Makino, “Pronunciation error detection for computer-assisted language learning system based on error rule clustering using a decision tree,” in Proceedings of the Japan Conference on Acoustical Science and Technology, vol. 28, no. 2, pp. 131–133, 2007.
    [Jiang, 2005] H. Jiang, “Confidence measures for speech recognition: A survey,” Speech Communication, vol. 45, no. 4, pp. 455–470, 2005.
    [Kim et al., 1997] Y. Kim, H. Franco and L. Neumeyer, “Automatic pronunciation scoring of specific phone segments for language instruction,” in Proceedings of the European Conference on Speech Communication and Technology, 1997.
    [Korkmazsky et al., 2004] F. Korkmazsky, D. Fohr and I. Illina, “Using linear interpolation to improve histogram equalization for speech recognition,” in Proceedings of the International Conference on Spoken Language Processing, 2004.
    [Laborde et al., 2016] V. Laborde, T. Pellegrini, L. Fontan, J. Mauclair, H. Sahraoui and J. Farinas, “Pronunciation assessment of Japanese learners of French with GOP scores and phonetic information,” in Proceedings of the International Conference on Speech Communication and Technology, 2016.
    [LeCun et al., 2015] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp 436–444, 2015.
    [Lee and Glass, 2012] A. Lee and J. Glass, “A comparison-based approach to mispronunciation detection,” in Proceedings of the International Conference on Spoken Language Technology Workshop, 2012.
    [Lee and Glass, 2014] A. Lee and J. Glass, “Context-dependent pronunciation error pattern discovery with limited annotation,” in Proceedings of the International Conference on Speech Communication and Technology, 2014.
    [Lee and Glass, 2015] A. Lee and J. Glass, “Mispronunciation detection without nonnative training data,” in Proceedings of the International Conference on Speech Communication and Technology, 2015.
    [Lee and Siniscalchi, 2013] C. H. Lee and S. M. Siniscalchi, “An information extraction approach to speech processing: Analysis, detection, verification, and recognition,” in Proceedings of the IEEE, vol. 101, no. 5, pp. 1089–1115, 2013.
    [Lee et al., 2013] A. Lee, Y. Zhang and J. Glass, “Mispronunciation detection via dynamic time warping on deep belief network-based posteriorgrams,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2013.
    [Li et al., 2011] H. Li, S. Huang, S. Wang, B. Xu, “Context-dependent duration modeling with backoff strategy and look-up tables for pronunciation assessment and mispronunciation detection,” in Proceedings of the International Conference on Speech Communication and Technology, 2011.
    [Li et al., 2016] W. Li, M. Siniscalchi, N. F. Chen and C. H. Lee, “Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2016.
    [Lin et al., 2007] S. H. Lin, Y. M. Yeh and B. Chen, “A comparative study of histogram equalization (HEQ) for robust speech recognition,” International Journal of Computational Linguistics and Chinese Language Processing, Vol. 12, No. 2, pp. 217–238, 2007.
    [Lo et al., 2010] W.K. Lo, S. Zhang and H. Meng, “Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system,” in Proceedings of the International Conference on Speech Communication and Technology, 2010.
    [Luo et al., 2009] D. Luo, Y. Qiao, N. Minematsu, Y. Yamauchi and K. Hirose, “Analysis and utilization of MLLR speaker adaptation technique for learners’ pronunciation evaluation,” in Proceedings of the International Conference on Speech Communication and Technology, 2009.
    [Pedregosa et al., 2011] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay, “Scikit-learn: machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
    [Povey et al., 2011] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motl´ıcek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2011.
    [Qian et al., 2010] X. Qian, H. M. Meng, and F. K. Soong, “Discriminatively trained acoustic model for improving mispronunciation detection and diagnosis in computer-aided pronunciation training (CAPT),” in Proceedings of the International Conference on Speech Communication and Technology, 2010.
    [Qian et al., 2012] X. Qian, H. Meng and F. K. Soong, “The use of DBN-HMMs for mispronunciation detection and diagnosis in L2 English to support computeraided pronunciation training,” in Proceedings of the International Conference on Speech Communication and Technology, 2012.
    [Qian et al., 2016] X. Qian, H. Meng, and F. K. Soong, “A two-pass framework for mispronunciation detection & diagnosis in computer-aided pronunciation training,” IEEE Transactions on Audio, Speech, and Language Processing, 2016.
    [Rand, 1971] W. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the American Statistical Association, vol. 66, no. 336, pp. 846–850, 1971.
    [Sharma et al., 2011] Z. Ge, S. Sharma, M. Smith, “Adaptive frequency cepstral coefficients for word mispronunciation detection,” in Proceedings of the International Congress on Image and Signal Processing, 2011.
    [Sim, 2009] K. C. Sim, “Improving phone verification using state-level posterior features and support vector machine for automatic mispronunciation detection,” in Proceedings of the International Symposium on Languages, Applications and Technologies, 2009.
    [Siniscalchi et al., 2013] S. M. Siniscalchi, J. Reed, T. Svendsen, and C. H. Lee, “Universal attribute characterization of spoken languages for automatic spoken language recognition,” Computer Speech and Language, vol. 27, no. 1, pp. 209–227, 2013.
    [Stevens, 2000] K. N. Stevens, Acoustic Phonetics. Cambridge, MA, MIT Press, 2000.
    [Strik et al., 2007] H. Strik, K. Truong, F. D. Wet and C. Cucchiarini, “Comparing classifiers for pronunciation error detection,” in Proceedings of the International Conference on Speech Communication and Technology, 2007.
    [Swietojanski and Renals, 2014] P. Swietojanski and S. Renals “Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models,” in Proceedings of the International Conference on Spoken Language Technology Workshop, 2014.
    [Truong et al., 2005] K. Truong, A. Neri, F. D. Wet, C. Cucchiarini and H. Strik, “Automatic detection of frequent pronunciation errors made by L2-learners,” in Proceedings of the International Conference on Speech Communication and Technology, 2005.
    [Viikki and Laurila, 1998] O. Viikki and K. Laurila, “Cepstral domain segmental feature vector normalization for noise robust speech recognition,” Speech Communication, vol. 25, pp. 133–147, 1998.
    [Viterbi, 1967] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Transactions on Information Theory, vol. 13, no. 2, pp. 260–269. 1967.
    [Wang and Lee, 2012] Y.B. Wang and L.S. Lee, “Improved approaches of modeling and detecting error patterns with empirical analysis for computer-aided pronunciation training,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2012.
    [Wang and Lee, 2015] Y. B. Wang and L. S. Lee, “Supervised detection and unsupervised discovery of pronunciation error patterns for computer-assisted language learning,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 23, no. 3, pp. 564–579, 2015.
    [Wei et al., 2009] S. Wei, G. Hu, Y. Hu and R.H. Wang, “A new method for mispronunciation detection using support vector machine based on pronunciation space models,” Speech Communication, vol. 51, pp. 896–905, 2009.
    [Wessel et al., 2001] F. Wessel, R. Schlüter, K. Macherey and H. Ney, “Confidence measures for large vocabulary continuous speech recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 9, no. 3, pp. 288–298, 2001.
    [Witt and Young, 2000] S. M. Witt and S. J. Young, “Phone-level pronunciation scoring and assessment for interactive language learning,” Speech Communication, vol. 30, no. 2–3, pp. 95–108, 2000.
    [Witt, 2012] S. M. Witt, “Automatic error detection in pronunciation training: Where we are and where we need to go,” in Proceedings of the International Symposium on Computer Architecture, 2012.
    [Ye et al., 2012] N. Ye, K. Chai, W. Lee and H. Chieu, “Optimizing F-measures: A tale of two approaches,” in Proceedings of the International Conference on Machine Learning, 2012.
    [Young et al., 2006] S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, X. Liu, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev and P. C. Woodland, The HTK book, Version 3.4, 2006.
    [Yu and Deng, 2014] D. Yu and L. Deng, Automatic Speech Recognition: A Deep Learning Approach. Springer, 2014.
    [Zhang et al., 2008] F. Zhang, C. Huang, F. K. Soong, M. Chu, and R. H. Wang, “Automatic mispronunciation detection for Mandarin,” in Proceedings of the International Conference on Acoustics, Speech and Signal Processing, 2008.
    [Zhang et al., 2011] C. Zhang, Y. Liu, and C. H. Lee, “Detection-based accented speech recognition using articulatory features,” in Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2011.
    [楊明翰等,2015] 楊明翰、許曜麒、洪孝宗和陳柏琳,”華語錯誤發音檢測使用多種鑑別式訓練之深層類神經網路聲學模型與邏輯回歸分析,” in Proceedings of the Conference on Technologies and Applications of Artificial Intelligence,2015。
    [許曜麒等,2015] 許曜麒、楊明翰、洪孝宗、熊玉雯、宋曜廷和陳柏琳,”融合多種深層類神經網路聲學模型與分類技術於華語錯誤發音檢測之研究,” in Proceedings of the Conference on Computational Linguistics and Speech Processing,2015。

    下載圖示
    QR CODE