研究生: |
蔡孟庭 TSAI, Meng-Ting |
---|---|
論文名稱: |
語音合成技術應用於電腦輔助發音訓練之研究 A Study of Speech Synthesis Techniques for Computer-Assisted Pronunciation Training |
指導教授: |
陳柏琳
Chen, Berlin |
口試委員: |
陳柏琳
Chen, Berlin 陳冠宇 Chen, Kuan-Yu 曾厚強 Tseng, Hou-Chiang |
口試日期: | 2024/01/24 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2024 |
畢業學年度: | 112 |
語文別: | 中文 |
論文頁數: | 73 |
中文關鍵詞: | 語音合成 、電腦輔助發音訓練 、黃金語音 、自動發音評估 |
英文關鍵詞: | text-to-speech, computer-assisted pronunciation training, golden speaker, automatic pronunciation assessment |
研究方法: | 實驗設計法 |
DOI URL: | http://doi.org/10.6345/NTNU202400327 |
論文種類: | 學術論文 |
相關次數: | 點閱:78 下載:5 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在語言學習領域中,聽力與口語訓練的方法可概分為跟讀法、聆聽並重複法與回聲法三種方式。這些方法的核心概念均是在聆聽目標語言的發音後,進而模仿其語調並覆述相同內容。然而在實際的學習環境中,要找到可作為學習對象的母語 (First Language, L1) 語者存在諸多限制,例如偏鄉地區師資缺乏、成本高昂、不利於個人化進度安排等。另外研究也指出,當第二語言 (English as a Second Language / Second Language, ESL / L2) 學習者在聆聽與模仿標準語音時,若語音標的的語者特徵與自己較為接近,對於發音技巧的訓練更為有益。這種結合 L2 學習者語者特性與 L1 語者口音特性的語音段落稱為「黃金語者 (Golden Speaker)」。為了解決上述問題,本研究選擇英語作為合成目標語言,以產生 L2 英語學習者的黃金標準語音。除嘗試改進合成結果並提出適用於發音學習情境的合成語音評估框架,也證實合成語音可以改善錯誤發音。研究並將此合成語音言應用於電腦輔助發音訓練領域,驗證 L2 學習者原始語音與合成語音之間動態時間校正差異量可有效作為發音評估的預測特徵,並藉由合成語音提高自動發音評估的準確率,進而促進學習者與教學者在電腦輔助發音訓練情境的學習及工作效益。
In the field of language learning, listening and speaking training methods can be broadly categorized into three approaches: shadowing, listen-and-repeat, and echoing. The core concepts underlying these methods are to listen to the pronunciation of the target language and subsequently imitate its intonation and restate the same content. However, in practical learning environments, there are many constraints in finding native speaker, also known as L1 speakers, as learning targets, such as a lack of qualified instructors in rural areas, high costs, and challenges in accommodating personalized progress schedules. In addition, research indicates that when English as a Second Language/Second Language (ESL/L2) learners listen to and imitate standard pronunciations, the more aligned the speaker characteristics of the pronunciation model are with those of the learners, the more advantageous the pronunciation skills training becomes. This study chooses English as the synthesized target language to generate golden standard pronunciations for L2 English learners. Apart from attempting to improve the synthesis results and proposing a synthesis speech evaluation framework applicable to pronunciation training scenarios, this study confirms that synthesized speech can improve erroneous pronunciations. The study further extends the application of this synthesized speech, validating that the dynamic time warping cost between the original speech of L2 learners and the synthesized speech can effectively serve as predictive features for pronunciation assessment. Through the utilization of synthesized speech, the aim is to enhance the accuracy of automatic pronunciation assessment, thereby fostering improved learning and work efficiency for both learners and instructors in the context of computer-assisted pronunciation training.
[1] S. Lev-Ari and B. Keysar, “Why don't we believe non-native speakers? The Influence of Accent on Credibility,” Journal of experimental social psychology, vol. 46, pp. 1093–1096, 2010.
[2] A. Foucart and R. Hartsuiker, “Are foreign-accented speakers that ‘incredible’? The Impact of the Speaker's Indexical Properties on Sentence Processing,” Neuropsychologia, vol. 158, 2021.
[3] J. Sweller, “Cognitive load during problem solving: effects on learning,” Cognitive Science, vol. 12, pp.257–285, 1988.
[4] C. S. Watson and D. Kewley-Port, “Advances in computer-based speech training: Aids for the profoundly hearing impaired,” The Volta Review, vol. 91, pp. 29–45, 1989.
[5] M. Jilka and G. Möhler, “International foreign accent: speech technology and foreign language teaching,” in Proc. ESCA Workshop on Speech Technology in Language Learning, pp. 115–118, 1998.
[6] K. Probst, Y. Ke, and M. Eskenazi, “Enhancing foreign language tutors–In search of the golden speaker,” Speech Communication, vol. 31, pp. 161–173, 2002.
[7] K. Nagano and K. Ozawa, “English speech training using voice conversion,” in Proc. the First International Conference on Spoken Language Processing, vol. 90, pp. 1169–1172, 1990.
[8] A. Silpachai, I. Rehman, T. A. Barriuso, J. Levis, E. Chukharev-Hudilainen, G. Zhao, and R. Gutierrez-Osuna, “Effects of voice type and task on L2 learners’ awareness of pronunciation errors,” in Proc. INTERSPEECH, pp. 1952–1956, 2021.
[9] D. Felps, H. Bortfeld, and R. Gutierrez-Osuna, “Foreign accent conversion in computer assisted pronunciation training,” Speech Communication, vol. 51, no. 10, pp. 920–932, 2009.
[10] S. Aryal, D. Felps, and R. Gutierrez-Osuna, “Foreign accent conversion through voice morphing,” in Proc. INTERSPEECH, pp. 3077–3081, 2013.
[11] C. Watson and D. Kewley-Port, “Advances in computer-based speech training: Aids for the profoundly hearing impaired,” The Volta Review, vol. 91, pp. 29–45, 1989.
[12] K. Nagano and K. Ozawa, “English speech training using voice conversion,” in Proc. ICSLP, pp. 1169–1172, 1990.
[13] C. Blanco, “2023 Duolingo language report,” [Online], Available: https://blog.duolingo.com/2023-duolingo-language-report/, 2023.
[14] M. Bain, J. Huh, T. Han, and A. Zisserman, “WhisperX: Time-accurate speech transcription of long-form audio,” in INTERSPEECH, pp. 4489–4493, 2023.
[15] E. Casanova, C. Shulby, E. Gölge, N. M. Müller, F. S. de Oliveira, A. Candido Jr., A. da Silva Soares, S. M. Aluisio, and M. A. Ponti, “SC-GlowTTS: An efficient zero-shot multi-speaker text-to-speech model,” in Proc. INTERSPEECH, pp. 3645–3649, 2021.
[16] Tarepan, “SpeechMOS: Easy-to-use speech MOS predictors,” [Online], Available: https://github.com/tarepan/SpeechMOS.
[17] E. Casanova, J. Weber, C. D. Shulby, A. C. Junior, E. Gölge, and M. A. Ponti, “YourTTS: Towards zero-shot multi-speaker TTS and zero-shot voice conversion for everyone,” in Proc. ICML, vol. 162, pp. 2709–2720, 2022.
[18] P. M. Rogerson-Revell, “Computer-assisted pronunciation training (CAPT): Current issues and future directions,” Relc Journal, vol. 52, no. 1, pp. 189–205, 2021.
[19] D. Korzekwa, J. Lorenzo-Trueba, T. Drugman, and B. Kostek, “Computer-assisted pronunciation training—Speech synthesis is almost all you need,” Speech Communication, vol. 142, pp. 22–33, 2022.
[20] N. F. Chen, and H. Li, “Computer-assisted pronunciation training: From pronunciation scoring towards spoken language learning,” in Proc. APSIPA, pp. 1–7, 2016.
[21] S. M. Witt, “Automatic error detection in pronunciation training: Where we are and where we need to go,” International Symposium on automatic detection on errors in pronunciation training, vol. 1, 2012.
[22] K. Kyriakopoulos, K. Knill, and M. Gales, “Automatic detection of accent and lexical pronunciation errors in spontaneous nonnative English speech,” in Proc. INTERSPEECH, pp. 3052–3056, 2020.
[23] D. Korzekwa, J. L.-Trueba, T. Drugman, S. Calamaro and B. Kostek, “Weakly-supervised word-level pronunciation error detection in non-native English speech,” in Proc. INTERSPEECH, pp. 4408–4412, 2021.
[24] B. -C. Yan, H. -W. Wang, Y. -C. Wang and B. Chen, “Effective graph-based modeling of articulation traits for mispronunciation detection and diagnosis,” in Proc. ICASSP, pp. 1–5, 2023.
[25] S. Witt and S. Young, “Phone-level pronunciation scoring and assessment for interactive language learning,” Speech Communication, vol. 30, no. 2–3, pp. 95–108, 2000.
[26] S. Sudhakara, M. Ramanathi, C. Yarra, and P. Ghosh, “An improved goodness of pronunciation (GoP) measure for pronunciation evaluation with DNN-HMM system considering HMM transition probabilities,” in Proc. INTERSPEECH, pp. 954–958, 2019.
[27] J. Shi, N. Huo, and Q. Jin, “Context-aware goodness of pronunciation for computer-assisted pronunciation training,” in INTERSPEECH, pp. 3057–3061, 2020.
[28] C. Cucchiarini, H. Strik, and L. Boves, “Quantitative assessment of second language learners’ fluency by means of automatic speech recognition technology,” The Journal of the Acoustical Society of America, vol. 107, no. 2, pp. 989–999, 2000.
[29] W. -K. Leung, X. Liu, and H. Meng, “CNN-RNN-CTC based end-to-end mispronunciation detection and diagnosis,” in Proc. ICASSP, pp. 8132–8136, 2019.
[30] T. -H. Lo, Y. -T. Sung, and B. Chen, “Improving end-to-end modeling for mispronunciation detection with effective augmentation mechanisms,” in Proc. APSIPA ASC, pp. 1049–1055, 2021.
[31] W. Ye, S. Mao, F. Soong, W. Wu, Y. Xia, J. Tien, and Z. Wu, “An approach to mispronunciation detection and diagnosis with acoustic, phonetic and linguistic (APL) embeddings,” in Proc. ICASSP, pp. 6827–6831, 2022.
[32] L. Chen, J. Tao, S. Ghaffarzadegan and Y. Qian, “End-to-end neural network based automated speech scoring,” in Proc. ICASSP, pp. 6234–6238, 2018.
[33] T. Cincarek, R. Gruhn, C. Hacker, E. Nöth, and S. Nakamura, “Automatic pronunciation scoring of words and sentences independent from the non-native’s first language,” Computer Speech & Language, vol. 23, no. 1, pp. 65–88, 2009.
[34] L. Ferrer, H. Bratt, C. Richey, H. Franco, V. Abrash, and K. Precoda, “Classification of lexical stress using spectral and prosodic features for computer-assisted language learning systems,” Speech Communication, vol. 69, pp. 31–45, 2015.
[35] K. Li, X. Wu, and H. Meng, “Intonation classification for L2 English speech using multi-distribution deep neural networks,” Computer Speech & Language, vol. 43, pp. 18–33, 2017.
[36] F. -A. Chao, T. -H. Lo, T. -Y. Wu, Y. -T. Sung, and B. Chen, “3M: An effective multi-view, multi-granularity, and multi-aspect modeling approach to English pronunciation assessment,” in Proc. APSIPA ASC, pp. 575–582, 2022.
[37] D. H. Klatt, “Software for a cascade/parallel formant synthesizer,” in The Journal of the Acoustical Society of America, no. 67, pp. 971–995, 1980.
[38] T. Dutoit, V. Pagel, N. Pierret, F. Bataille, and O. van der Vrecken, ”The MBROLA Project: Towards a set of high quality speech synthesizers free of use for non-commercial purposes,” in Proc. ICSLP, vol. 3, 1996.
[39] A. J. Hunt, and A. W. Black., “Unit selection in a concatenative speech synthesis system using a large speech database,” in Proc. ICASSP, vol.1, pp. 373–376, 1996.
[40] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura, “Speech parameter generation algorithms for HMM-based speech synthesis,” in Proc. ICASSP, vol. 3, pp. 1315–1318, 2000.
[41] K. Tokuda, T. Toda, and J. Yamagishi, “Speech synthesis based on hidden Markov models,” in Proc. IEEE 101.5, pp. 1234–1252, 2013.
[42] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. Weiss, N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis, R. Clark, and R. Saurous, “Tacotron: Towards end-to-end speech synthesis,” in Proc. INTERSPEECH, pp. 4006–4010, 2017.
[43] J. Shen, R. Pang, R. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry-Ryan, R. Saurous, Y. Agiomyrgiannakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet on Mel-spectrogram predictions,” in Proc. ICASSP, pp. 4779–4783, 2018.
[44] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. -Y. Liu, “Fastspeech: Fast, robust and controllable text to speech,” in Proc. NeurIPS, pp. 3171–3180, 2019.
[45] J. -M. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. ICML, PMLR, vol. 139, pp. 5530–5540, 2021.
[46] R. Yamamoto, E. Song, and J. -M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP, pp. 6199–6203, 2020.
[47] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high-fidelity speech synthesis,” in Proc. NeurIPS, vol. 33, pp. 17022–17033, 2020.
[48] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T. -Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” in Proc. ICLR, 2021.
[49] X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He, F. Soong, S. Zhao, and T. -Y. Liu, “NaturalSpeech: End-to-end text to speech synthesis with human-level quality,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[50] T. Saeki, S. Maiti, X. Li, S. Watanabe, S. Takamichi, and H. Saruwatari, “Learning to speak from text: Zero-shot multilingual text-to-speech with unsupervised text pretraining,” in IJCAI, pp. 5179–5187, 2023.
[51] C. Gong, X. Wang, E. Cooper, D. Wells, L. Wang, J. Dang, K. Richmond, and J. Yamagishi, “ZMM-TTS: Zero-shot multilingual and multi-speaker speech synthesis conditioned on self-supervised discrete speech representations,” in arXiv preprint arXiv:2312.14398, 2023.
[52] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice cloning with a few samples,” in Proc. NeurIPS, pp. 10040–10050, 2018.
[53] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep Voice 3: 2000-speaker neural text-to-speech,” in Proc. ICLR, pp. 214–217, 2018.
[54] D. Sztahó, G. Szaszák, and A. Beke, “Deep learning methods in speaker recognition: A review,” in arXiv preprint arXiv:1911.06615, 2019.
[55] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, “Speaker verification using adapted Gaussian mixture models,” Digital signal processing, vol. 10, pp. 19–41, 2000.
[56] M. E. Tipping, and M. B. Christopher, “Probabilistic principal component analysis,” Journal of the Royal Statistical Society Series B: Statistical Methodology, vol. 61, no.3, pp. 611–622, 1999.
[57] S. Ioffe, “Probabilistic linear discriminant analysis,” in Proc. European Conference on Computer Vision (ECCV), pp. 531–542, 2006.
[58] E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Proc. ICASSP, pp. 4052–4056, 2014.
[59] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. ICASSP, pp. 5329–5333, 2018.
[60] F. Fang, X. Wang, J. Yamagishi, I. Echizen, M. Todisco, N. Evans, and J. -F. Bonastre, “Speaker anonymization using X-vector and neural waveform models,” in Speech Synthesis Workshop, pp. 155–160, 2019.
[61] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in Proc. ICASSP, pp. 4879–4883, 2018.
[62] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, P. Nguyen, R. Pang, I. L. Moreno, and Y. Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Proc. NeurIPS, vol. 31, 2018.
[63] M. Zhang, Y. Zhou, L. Zhao, and H. Li, “Transfer learning from speech synthesis to voice conversion with non-parallel training data,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1290–1302, 2021.
[64] G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, “L2-ARCTIC: A non-native English speech corpus,” in Proc. INTERSPEECH, pp. 2783–2787, 2018.
[65] S. Ding, G. Zhao, and R. Gutierrez-Osuna, “Improving the speaker identity of non-parallel many-to-many voice conversion with adversarial speaker recognition,” in Proc. INTERSPEECH, pp. 776–780, 2020.
[66] S. Ding, G. Zhao, and R. Gutierrez-Osuna, “Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning,” Computer Speech & Language, no. 72, 2022.
[67] W. Quamer, A. Das, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez-Osuna, “Zero-shot foreign accent conversion without a native reference,” in Proc. INTERSPEECH, pp. 4920–4924, 2022.
[68] S. Liu, D. Wang, Y. Cao, L. Sun, X. Wu, S. Kang, Z. Wu, X. Liu, D. Su, D. Yu, and H. Meng, “End-to-end accent conversion without using native utterances,” in Proc. ICASSP, pp. 6289–6293, 2020.
[69] G. Zhao, S. Ding, and R. Gutierrez-Osuna, “Converting foreign accent speech without a reference,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2367–2381, 2021.
[70] D. Felps and R. Gutierrez-Osuna, “Developing objective measures of foreign-accent conversion,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 5, pp. 1030–1040, 2010.
[71] J. Badenhorst and F. De Wet, “The limitations of data perturbation for ASR of learner data in under-resourced languages,” in Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech), pp. 44–49, 2017.
[72] V. -V. Eklund, “Data augmentation techniques for robust audio analysis,” MS thesis Tampere University, 2019.
[73] A. Lee, “Language-independent methods for computer-assisted pronunciation training,” Diss. Massachusetts Institute of Technology, 2016.
[74] T. Shi, S. Kasahara, T. Pongkittiphan, N. Minematsu, D. Saito, and K. Hirose, “A measure of phonetic similarity to quantify pronunciation variation by using ASR technology,” in Proc. ICPhS, 2015.
[75] J. Yue, F. Shiozawa, S. Toyama, Y. Yamauchi, K. Ito, D. Saito, and N. Minematsu, “Automatic scoring of shadowing speech based on DNN posteriors and their DTW,” in Proc. INTERSPEECH, pp. 1422–1426, 2017.
[76] Z. Miodonska, M. D. Bugdol, and M. Krecichwost, “Dynamic time warping in phoneme modeling for fast pronunciation error detection,” Computers in Biology and Medicine, vol. 69, pp. 277–285, 2016.
[77] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. NeurIPS, pp. 12449–12460, 2020.
[78] M. Wu, K. Li, W. -K. Leung, and H. Meng, “Transformer based end-to-end mispronunciation detection and diagnosis,” in Proc. INTERSPEECH, pp. 3954–3958, 2021.
[79] M. Bartelds, W. de Vries, F. Sanal, C. Richter, M. Liberman, and M. Wieling, “Neural representations for modeling variation in speech,” Journal of Phonetics, vol. 92, pp. 101–137, 2022.
[80] Q. Zeng, D. Chong, P. Zhou, and J. Yang, “Low-resource accent classification in geographically-proximate settings: A forensic and sociophonetics perspective,” in Proc. INTERSPEECH, pp. 5308–5312, 2022.
[81] K. Ito, L. Johnson, “The LJ speech dataset,” 2017.
[82] C. Veaux, J. Yamagishi, K. MacDonald et al., “Superseded-CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” The Centre for Speech Technology Research (CSTR), 2016.
[83] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “LibriTTS: A corpus derived from LibriSpeech for text-to-speech,” in Proc. INTERSPEECH, pp. 1526–1530, 2019.
[84] J. Kominek and A. W. Black, “The CMU Arctic speech databases,” ISCA workshop on speech synthesis, 2004.
[85] J. Zhang, Z. Zhang, Y. Wang, Z. Yan, Q. Song, Y. Huang, D. Povey, Y. Wang, “speechocean762: An open-source non-native English speech corpus for pronunciation assessment” in Proc. INTERSPEECH, pp. 3710–3714, 2021.
[86] G. Fairbanks, “The rainbow passage,” Voice and Articulation Drillbook 2nd edition, pp. 124–139, 1960.
[87] S. H. Weinberger and S. A. Kunath, “The speech accent archive: Towards a typology of English accents,” in Corpus-based studies in language use, language learning, and language documentation, pp. 265–281, 2021.
[88] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP, pp. 5206–5210, 2015.
[89] ETS, “Linking TOEFL iBT scores to IELTS scores-A research report,” 2010.
[90] G. Zhao, “Foreign accent conversion with neural acoustic modeling,” Doctoral dissertation, 2020.
[91] E. Cooper, C. -I. Lai, Y. Yasuda, F. Fang, X. Wang, N. Chen, and J. Yamagishi, “Zero-shot multi-speaker text-to-speech with state-of-the-art neural speaker embeddings,” in Proc. ICASSP, pp. 6184–6188, 2020.
[92] S. Choi, S. Han, D. Kim, and S. Ha, “Attentron: Few-shot text-to-speech utilizing attention-based variable-length embedding,” in INTERSPEECH, pp. 2007–2011, 2020.
[93] C. Jemine, “Real-time voice cloning,” Master Thesis, 2019.
[94] Resemble AI, “Resemblyzer,” [Online], Available: https://github.com/resemble-ai/Resemblyzer.
[95] C. -H. Chiang, W. -P. Huang, and H. -Y. Lee, “Why we should report the details in subjective evaluation of TTS more rigorously,” in Proc. INTERSPEECH, pp. 5551–5555, 2023.
[96] E. Cooper, W. -C. Huang, T. Toda, J. Yamagishi, “Generalization ability of MOS prediction networks,” in Proc. ICASSP, pp. 8442–8446, 2022.
[97] S. Maiti, Y. Peng, T. Saeki, S. Watanabe, ”SpeechLMScore: Evaluating speech generation using speech language model,” in Proc. ICASSP, pp. 1–5, 2023.
[98] X. Liang, F. Cumlin, C. Schüldt, S. Chatterjee, “DeePMOS: Deep posterior mean-opinion-score of speech,” in Proc. INTERSPEECH, pp. 526–530, 2023.
[99] S. Le Maguer, S. King, and N. Harte, “The limits of the mean opinion score for speech synthesis evaluation,” Computer Speech & Language, no. 84, 2024.
[100] T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, H. Saruwatari, “UTMOS: Utokyo-sarulab system for VoiceMOS challenge 2022,” in Proc. INTERSPEECH, pp. 4521–4525, 2022.
[101] H. S. Heo, B. -J. Lee, J. Huh, and J. S. Chung, “Clova baseline system for the VoxCeleb speaker recognition challenge 2020,” in arXiv preprint arXiv:2009.14153, 2020.
[102] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A large-scale speaker identification dataset,” in Proc. INTERSPEECH, pp. 2616–2620, 2017.
[103] E. Casanova, A. C. Junior, C. Shulby, F. S. de Oliveira, J. P. Teixeira, M. A. Ponti, and S. M. Aluisio, “TTS-Portuguese corpus: A corpus for speech synthesis in Brazilian Portuguese,” Language Resources and Evaluation, vol. 56, no. 3, pp. 1043–1055, 2020.
[104] Munich Artificial Intelligence Laboratories GmbH, “The M-AILABS speech dataset – caito,” [Online]. Available: https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/, 2017.
[105] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł Kaiser, I. Polosukhin, “Attention is all you need,” in Proc. neurNIPS, vol. 30, 2017.
[106] X. Tan, T. Qin, F. Soong, and T. -Y. Liu, “A survey on neural speech synthesis,” in arXiv preprint arXiv:2106.15561, 2021.
[107] C. Spearman, “The proof and measurement of association between two things,” Am. J. Psychol. 15, pp. 72–101, 1904.
[108] K. Pearson, “Notes on regression and inheritance in the case of two parents,” in Proc. of the Royal Society of London. 58, pp. 240–242, 1895.