研究生: |
羅天宏 Lo, Tien-Hong |
---|---|
論文名稱: |
探討聲學模型化技術與半監督鑑別式訓練於語音辨識之研究 Investigating Acoustic Modeling and Semi-supervised Discriminative Training for Speech Recognition |
指導教授: |
陳柏琳
Chen, Berlin |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2019 |
畢業學年度: | 107 |
語文別: | 中文 |
論文頁數: | 102 |
中文關鍵詞: | 半監督式學習 、鑑別式訓練 、整體學習 、遷移學習 、自動語音辨識 、聲學模型 、LF-MMI |
英文關鍵詞: | semi-supervised training, discriminative training, transfer learning, ensemble learning, automatic speech recognition, acoustic model, LF-MMI |
DOI URL: | http://doi.org/10.6345/THE.NTNU.DCSIE.004.2019.B02 |
論文種類: | 學術論文 |
相關次數: | 點閱:231 下載:39 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來鑑別式訓練(Discriminative training)的目標函數Lattice-free maximum mutual information (LF-MMI)在自動語音辨識(Automatic speech recognition, ASR)的聲學模型(Acoustic model)訓練上取得重大的突破。儘管LF-MMI在監督式環境下斬獲最好的成果,然而在半監督式環境下的研究成果仍然有限。在常見的半監督式方法─自我訓練(Self-training)中,種子模型(Seed model)常因為語料有限而效果不佳。再者,因為LF-MMI屬於鑑別式訓練之故,較易受到標記正確與否的影響。基於上述,本論文將半監督式訓練拆解成兩個問題:1)如何提升種子模型的效能,以及2)如何利用未轉寫(無人工標記)語料。針對第一個問題,我們使用兩種方法可分別對應到是否具存有額外資料的情況,其一為遷移學習(Transfer learning),使用技術為權重遷移(Weight transfer)和多任務學習(Multitask learning);其二為模型合併(Model combination),使用技術為假說層級合併(Hypothesis-level combination)和音框層級合併(Frame-level combination)。針對第二個問題,基於LF-MMI目標函數,我們引入負條件熵(Negative conditional entropy, NCE)與保留更多假說空間的詞圖監督(Lattice for supervision)。在一系列於互動式會議語料(Augmented multi-party interaction, AMI)的實驗結果顯示,不論是利用領域外資料(Out-of-domain data, OOD)的遷移學習或多樣性互補的模型合併皆可提升種子模型的效能,而NCE與詞圖監督則能運用未轉寫語料降改善錯誤率(Word error rate, WER)與詞修復率(WER recovery rate, WRR)。
More recently, a novel objective function of discriminative acoustic model training, namely Lattice-free maximum mutual information (LF-MMI), has been proposed and achieved the new state-of-the-art in automatic speech recognition (ASR). Although LF-MMI shows excellent performance in various ASR tasks with supervised training settings, its performance is often significantly degraded when with semi-supervised settings. This is because LF-MMI shares a common deficiency of discriminative training criteria, being sensitive to the accuracy of the corresponding transcripts of training utterances. In view of the above, this thesis explores two questions to LF-MMI with a semi-supervised training setting: the first one is how to improve the seed model and the second one is how to use untranscribed training data. For the former, we investigate several transfer learning approaches (e.g. weight transfer and multitask learning) and the model combination (e.g. hypothesis-level combination and frame-level combination). The distinction between the above two methods is whether extra training data is being used or not. On the other hand, for the second question, we introduce negative conditional entropy (NCE) and lattice for supervision, in conjunction with the LF-MMI objective function. A series of experiments were conducted on the Augmented Multi-Party Interaction (AMI) benchmark corpus. The experimental results show that transfer learning using out-of-domain data (ODD) and model combination based on complementary diversity can effectively improve the performance of the seed model. The pairing of NCE and lattice for supervision can improve the word error rate (WER) and WER recovery rate (WRR).
[1] L. Rabiner and B.-H. Juang, “Fundamentals of speech recognition,” Englewood Cliffs: PTR Prentice Hall, vol. 14, 1993.
[2] L. Besacier, E. Barnard, A. Karpov, and T. Schultz, “Automatic speech recognition for under-resourced languages: A survey,” Speech Communication, vol. 16, pp. 85-100, 2014.
[3] D. Fohr, O. Mella, and I. Illina, "New Paradigm in Speech Recognition: Deep Neural Networks," in Proc. ICIS, 2017.
[4] R. P. Lippmann, “Speech Recognition by machines and humans,” Speech Communication, vol. 22, no. 1, pp. 1-15, 1997.
[5] V. Valtchev, J. J. Odell, P. C. Woodland, and S. J. Young, “Lattice-based discriminative training for large vocabulary speech recognition,” in Proc. ICASSP, 1996.
[6] V. Valtchev, J. J. Odell, P. C. Woodland, and S. J. Young, “MMIE training of large vocabulary recognition systems,” Speech Communication, vol. 22, no. 4, pp. 303-314, 1997.
[7] P. C. Woodland, and D. Povey, “Large scale discriminative training of hidden markov models for speech recognition,” Computer Speech & Language, vol. 6, no. 1, pp. 25-47, 2002.
[8] L. Bahl, P. Brown, P. De Souza and R. Merce, “Maximum mutual information estimation of hidden markov model parameters for speech recognition,” in Proc. ICASSP, 1986.
[9] B. H. Juang, W. Hou and C. H. Lee, “Minimum classification error rate methods for speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 5, no. 3, pp. 257-265, 1997.
[10] D. Povey, and P. C. Woodland, “Minimum phone error and i-smoothing for improved discriminative training,” in Proc. ICASSP, 2002.
[11] J. Kaiser, B. Horvat and Z. Kacic, “A novel loss function for the overall risk criterion based discriminative training of HMM models,” in Proc. ICSLP, 2000.
[12] M. Gibson, and T. Hain, “Hypothesis spaces for minimum bayes risk training in large vocabulary speech recognition,” in Proc. Interspeech, 2006.
[13] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon and K. Visweswariah, “Boosted MMI for model and feature-space discriminative training,” in Proc. ICASSP, 2008.
[14] A. Graves, S. Fernández, F. Gomez and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. ICML, 2006.
[15] D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur., “Purely sequence-trained neural networks for ASR Based on Lattice-Free MMI,” in Proc. Interspeech, 2016.
[16] G. Pundak and T. N. Sainath, “Lower frame rate neural network acoustic models,” in Proc. Interspeech, 2016.
[17] G. Zavaliagkos, M. H. Siu, T. Colthurst and J. Billa,“Using untranscribed training data to improve performance,” in Proc. ICSLP, 1998.
[18] K. Vesely, M. Hannemann, and L. Burget, “Semi-supervised training of deep neural networks,” in Proc. ASRU, 2013.
[19] F. Grezl, and M. Karafiát, “Semi-supervised bootstrapping approach for neural network feature extractor training,” in Proc. ASRU, 2013.
[20] P Zhang, Y. Liu, and T. Hain., "Semi-supervised DNN training in meeting recognition." in Proc. SLT, 2014.
[21] L. Lamel, J.-L. Gauvain, and G. Adda., “Lightly supervised and unsupervised acoustic model training,” Computer Speech & Language, vol. 16, no.1, pp. 115-129, 2002.
[22] H.Y. Chan, and P.Woodland, “Improving broadcast news transcription by lightly supervised discriminative training,” in Proc. ICASSP, 2004.
[23] S. H. Liu, F. H. Chu, S. H. Lin, and B. Chen, “investigating data selection for minimum phone error training of acoustic models,” in Proc. ICME, 2007.
[24] Pan, S. J., and Yang, Q, “A survey on transfer learning,” IEEE Transactions on Knowledge and Data Engineering, vol .22, no. 10, pp. 1345-1359, 2010.
[25] J. G. Fiscus, “A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER),” in Proc. ASRU, 1997.
[26] G. Evermann, P. C. Woodland, “Posterior probability decoding, confidence estimation and system combination,” in Proc. STW, 2000.
[27] H. Xu, D. Povey, L. Mangu, J. Zhu, “Minimum bayes risk decoding and system combination based on a recursion for edit distance,” Computer Speech and Language, vol. 25, no. 4, pp. 802-828, 2011.
[28] G. Hinton, O. Vinyals, and J. Dean "Distilling the knowledge in a neural network," in arXiv, 2015.
[29] R. Sahraeian, and D. Van Compernolle, “Using weighted model averaging in distributed multilingual dnns to improve low resource ASR,” Procedia Computer Science, vol. 6, pp. 152-158, 2016.
[30] Rich. Caruana, "Multitask learning," Machine learning, vol. 28, no. 1, pp. 41-75, 1997.
[31] L. Deng, and J. C. Plat, “Ensemble deep learning for speech recognition,” in Proc. Interspeech, 2014.
[32] A. Nadas, “A decision theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 31, no. 4, pp. 814-817, 1983.
[33] Y. Grandvalet, and Y. Bengio., “Semi-supervised learning by entropy minimization,” in Proc. NIPS, 2005.
[34] J.-T. Huang, and M. H.-Johnson, “Semi-supervised training of gaussian mixture models by conditional entropy minimization,” in Proc. ISCA. 2010.
[35] V. Manohar, H. Hadian, D. Povey, and S. Khudanpur, "Semi-supervised training of acoustic models using lattice-free MMI," in Proc. ICASSP, 2018.
[36] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015.
[37] I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, …, and M. Kronenthal, “The ami meeting corpus,” in Proc. ICMTBR, 2005.
[38] T. G. Dietterich, “Ensemble methods in machine learning,” in International workshop on MCS, 2000.
[39] A. Senior, H. Sak, F. C. Quitry, T. Sainath, and K. Rao, “Acoustic modelling with CD CTC-SMBR LSTM RNNs,” in Proc. ASRU, 2015.
[40] K. Vesely, M. Hannemann, and L. Burget, “Semisupervised training of Deep Neural Networks,” in Proc. ASRU, 2013.
[41] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deep neural network features and semi-supervised training for low resource speech recognition,” in Proc. ICASSP, 2013.
[42] P. Zhang, Y. Liu, and T. Hain, “Semi-supervised dnn training in meeting recognition,” in Proc. SLT, 2014.
[43] L. Mathias, G. Yegnanarayanan, and J. Fritsch, “Discriminative training of acoustic models applied to domains with unreliable transcripts.” in Proc. ICASSP, 2005.
[44] K. Yu, M. Gales, L. Wang, and P. C. Woodland, “Unsupervised training and directed manual transcription for LVCSR,” Speech Communication, vol. 11, no. 8, pp. 652-663, 2010.
[45] X. Cui, J. Huang, and J.-T. Chien, “Multi-view and multiobjective semi-supervised learning for large vocabulary continuous speech recognition,” in Proc. ICASSP, 2011.
[46] L. R. Rabiner, “A tutorial on hidden markov models and selected applications in speech recognition,” in Proc. of the IEEE, vol. 77, no. 2, pp. 257-286, 1989.
[47] S. Katz, "Estimation of probabilities from sparse data for the language model component of a speech recognizer," IEEE Transactions on acoustics, speech, and signal processing, vol. 35, no. 3, pp. 400-401, 1987.
[48] H. Ney, U. Essen, and R. Kneser, “On structuring probabilistic dependences in stochastic language modeling,” Computer Speech and Language, vol. 8, no. 1, pp. 1-38, 1994.
[49] S. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Transactions on acoustics, speech, and signal processing, vol. 28, no. 4, pp. 357-366. 1980.
[50] A. J. Viterbi, “Error bounds for convolutional codes and an asymptotically optimum decoding algorithm,” IEEE Transactions on Information Theory, vol. 13, no. 3, pp. 260-269, 1967.
[51] S. Ortmanns, H. Ney, and X. Aubert, “A word graph algorithm for large vocabulary continuous speech recognition,” Computer Speech and Language, vol. 11, no. 1, pp. 11-72, 1997.
[52] B.-H. Juang and L.R. Rabiner, “The segmental k-means algorithm for estimating parameters of hidden Markov models,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 38, no. 9, pp. 1639-1641, 1990.
[53] L. Deng and D. Yu, “Deep learning: methods and applications,” Foundations and Trends in Signal Processing, vol. 7, no. 4, pp. 197-387, 2014.
[54] D. Rumelhart, G. Hinton, and R. J.Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, pp. 533-536, 1986.
[55] A. R. Mohamed, G. E. Dahl, and G. Hinton, “Acoustic modeling using deep belief networks,” IEEE Transactions on Audio, Speech & Language Processing, vol. 20, no. 1, pp. 14-22, 2012.
[56] G. E. Dahl, D. Yu, L. Deng, and A. Acero, “Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on audio, speech, and language processing, vol. 20, no. 1, pp. 30-42, 2012.
[57] T. Robinson and F. Fallside, “A recurrent error propagation network speech recognition system,” Computer Speech & Language, vol. 5, pp. 259-274, 1991.
[58] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Parallel distributed processing: explorations in the microstructure of cognition, vol. 1. foundations,” 1986.
[59] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp.1735-1780, 1997.
[60] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent neural networks,” in Proc. ICML, 2013.
[61] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano, and K. J. Lang, “Phoneme recognition using time-delay neural networks,” IEEE Transactions on Acoustics Speech, and Signal Processing, vol. 37, pp. 393-404, 1989.
[62] F. Seide, G. Li, and D. Yu, “Conversational speech transcription using context-dependent deep neural networks,” in Proc. Interspeech, 2011.
[63] V. N. Vapnik, “An overview of statistical learning theory,” IEEE transactions on neural networks, vol. 10, no. 5, pp. 988-999, 1999.
[64] S. Thrun, and L. Pratt, “Learning to learn,” Springer Science & Business Media, 2012.
[65] D. Wang and T. F. Zheng, “Transfer learning for speech and language processing,” in Proc. APSIPA, 2015.
[66] J. Neto, L. Almeida, M. Hochberg, C. Martins, L. Nunes, S. Renals, and T. Robinson, “Speaker adaptation for hybrid hmm-ann continuous speech recognition system,” in Proc. ECSCT, 1995.
[67] K. Yao, D. Yu, F. Seide, H. Su, L. Deng, and Y. Gong, “Adaptation of context-dependent deep neural networks for automatic speech recognition,” in Proc. SLT, 2012.
[68] R. Gemello, F. Mana, S. Scanzio, P. Laface, and R. De Mori, “Linear hidden transformations for adaptation of hybrid ann/hmm models,” Speech Communication, vol. 49, no. 11, pp. 827–835, 2007.
[69] Z. Huang, J. Li, S. M. Siniscalchi, I.-F. Chen, J. Wu, and C.-H. Lee, “Rapid adaptation for deep neural networks through multi-task learning,” in Proc. ACISCA, 2015.
[70] G. Heigold, V. Vanhoucke, A. Senior, P. Nguyen, M. Ranzato, M. Devin, and J. Dean, “Multilingual acoustic models using distributed deep neural networks,” in Proc. ICASSP, 2013.
[71] J.-T. Huang, J. Li, D. Yu, L. Deng, and Y. Gong, “Crosslanguage knowledge transfer using multilingual deep neural network with shared hidden layers,” in Proc. ICASSP, 2013.
[72] F. Grezl, E. Egorova, and M. Karafiat, “Study of large data resources for multilingual training and system porting,” Procedia Computer Science, vol. 81, pp. 15-22, 2016.
[73] P. Wu and T. G. Dietterich, “Improving svm accuracy by training on auxiliary data sources,” in Proc. ICML, 2004.
[74] J. Jiang, CX. Zhai, “Iinstance weighting for domain adaptation in nlp,” in Proc. A MACL, 2007
[75] W. Dai, Q. Yang, G. Xue, and Y. Yu, “Boosting for transfer learning,” in Proc. ICML, 2007.
[76] X. Liao, Y. Xue, and L. Carin, “Logistic regression with an auxiliary data source,” in Proc. ICML, 2005.
[77] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying, “A spectral regularization framework for multi-task structure learning,” in Proc. ACNIPS, vol. 8, pp. 25-32, 2008.
[78] S.-I. Lee, V. Chatalbashev, D. Vickrey, and D. Koller, “Learning a meta-level prior for feature relevance from multiple related tasks,” in Proc. ICML, 2007
[79] U. Ruckert and S. Kramer, “Kernel-based inductive transfer,” in Proc. MLKDDEC, 2008.
[80] H. Lee, A. Battle, R. Raina, and A. Y. Ng, “Efficient sparse coding algorithms,” in Proc. NIPS, 2007.
[81] L. Mihalkova, T. Huynh, and R. J. Mooney, “Mapping and revising markov logic networks for transfer learning,” in Proc. AAAI, 2007.
[82] S. Ramachandran and R. J. Mooney, “Theory refinement of bayesian networks with hidden variables,” in Proc. ICML, 1998.
[83] L. Mihalkova and R. J. Mooney, “Transfer learning by mapping with minimal target data,” in Proc. AAAI, 2008.
[84] J. Davis and P. Domingos, “Deep transfer via second-order markov logic,” in Proc. AAAI, 2008.
[85] F. Li, S. J. Pan, O. Jin, Q. Yang, and X. Zhu, “Cross-domain co-extraction of sentiment and topic lexicons,” in Proc. ACL, 2012.
[86] N. D. Lawrence and J. C. Platt, “Learning to learn with the informative vector machine,” in Proc. ICML, 2004.
[87] E. Bonilla, K. M. Chai, and C. Williams, “Multi-task gaussian process prediction,” in Proc. NIPS, 2008.
[88] A. Schwaighofer, V. Tresp, and K. Yu, “Learning gaussian process kernels via hierarchical bayes,” in Proc. NIPS, 2005.
[89] T. Evgeniou and M. Pontil, “Regularized multi-task learning,” in Proc. SIGKDD, 2004.
[90] J. Gao, W. Fan, J. Jiang, and J. Han, “Knowledge transfer via multiple model local structure mapping,” in Proc. SIGKDD, 2008.
[91] T. Joachims, “Transductive inference for text classification using support vector machines,” in Proc. ICML, 1999.
[92] B. Zadrozny, “Learning and evaluating classifiers under sample selection bias,” in Proc. ICML, 2004.
[93] J. Huang, A. Gretton, K. M. Borgwardt, B. Schölkopf, and A. J. Smola, “Correcting sample selection bias by unlabeled data,” in Proc. NIPS, 2007.
[94] B. Scholkopf, and A. J. Smola,”Learning with kernels: support vector machines, regularization, optimization, and beyond,” MIT press, 2001.
[95] S. J. Pan, J. T. Kwok, and Q. Yang, “Transfer learning via dimensionality reduction,” in Proc. AAAI, 2008.
[96] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199-210, 2011.
[97] X. Shi, W. Fan, Q. Yang, and J. Ren, “Relaxed transfer of different classes via spectral partition,” in Proc. ECML-PKDD, 2009.
[98] M. M. H. Mahmud and S. R. Ray, “Transfer learning using Kolmogorov complexity: Basic theory and empirical evaluations,” in Proc. NIPS, 2008.
[99] E. Eaton, M. desJardins, and T. Lane, “Modeling transfer relationships between learning tasks for improved inductive transfer,” in Proc. ECML, 2008.
[100] M. T. Rosenstein, Z. Marx, and L. P. Kaelbling, “To transfer or not to transfer,” in Proc. NIPS, 2005.
[101] S. Ben-David and R. Schuller, “Exploiting task relatedness for multiple task learning,” in Proc. ACLT, 2003.
[102] B. Bakker and T. Heskes, “Task clustering and gating for Bayesian multitask learning,” Journal of Machine Learning Reserch, vol. 4, pp. 83-99, 2003.
[103] A. Argyriou, A. Maurer, and M. Pontil, “An algorithm for transfer learning in a heterogeneous environment,” in Proc. ECML-PKDD, 2008.
[104] N. N. Pise and P. Kulkarni, “A survey of semi-supervised learning methods,” IEEE International conference on Computational Intelligence and Security, pp.30-34, 2008.
[105] X. Zhu, “Semi-Supervised Learning Literature Survey,” Computer Science, University of Wisconsin-Madison 2.3, 2006.
[106] V. J. Prakash and L. M. Nithya, “A survey on semi-supervised learning techniques,” in arXiv, 2014.
[107] B. Settles, “Active learning”, Synthesis Lectures on Artificial Intelligence and Machine Learning, Morgan & Claypool Publishers, pp. 1-114, 2012.
[108] S. Walker, M. Pedersen, I. Orife, and J. Flaks, “Semi-supervised model training for unbounded conversational speech recognition,” in arXiv, 2017.
[109] P. Ghahremani, V. Manohar, H. Hadian, D. Povey, and S. Khudanpur, “Investigation of transfer learning for ASR using LF-MMI trained neural networks,” in Proc. ASRU, 2017.
[110] S. Kumar, M. Mohri, and A. Talwalkar, “Ensemble Nystrom¨ method,” in Proc. NIPS, 2009.
[111] P.-S. Huang, H. Avron, T. N. Sainath, V. Sindhwani, B. Ramabhadran, “Kernel methods match deep neural networks on timit,” in Proc. ICASSP, 2014.
[112] X. Zhang, D. Povey, S. Khudanpur, “A Diversity-Penalizing Ensemble Training Method for Deep Learning,” in Proc. Interspeech, 2015.
[113] T. G. Dietterich, “An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization,” Machine Learning, vol. 40, no. 2, pp. 139-157, 2000.
[114] D. Povey et al., “The Kaldi speech recognition toolkit,” in Proc. ASRU, 2011.
[115] J. Ma and R. Schwartz, “Unsupervised versus supervised training of acoustic models,” in Proc. ISCA, 2008.
[116] D. B. Paul, and J. M. Baker. "The design for the Wall Street Journal-based CSR corpus," in Proc. DARPA SLS Workshop, 1992.
[117] N. Dehak, P. J. Kenny, R. Dehak, P. Du-´mouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788-798, 2011.