簡易檢索 / 詳目顯示

研究生: 吳姿儀
Wu, Tzu-I
論文名稱: 英語口說精熟度之自動化評測技術研究
Automated Speaking Assessment Technology: Beyond Holistic Grading
指導教授: 陳柏琳
Chen, Berlin
口試委員: 陳柏琳
Chen, Berlin
陳冠宇
Chen, Kuan-Yu
曾厚強
Tseng, Hou-Chiang
口試日期: 2024/01/24
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 53
英文關鍵詞: Automated Speaking Assessment, prototypical network, English as a Medium of Instruction, BERT, wav2vec 2.0, loss reweighting
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202400358
論文種類: 學術論文
相關次數: 點閱:73下載:13
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • The surge in English Medium Instruction (EMI) in higher education across Taiwan aims to prepare students for a competitive international environment. However, this shift introduces challenges, as students must grasp complex academic concepts in English, a non-native language, which may misrepresent their academic capabilities. Furthermore, instructors face difficulties discerning whether students' learning obstacles stem from language barriers or a lack of subject understanding. Addressing these concerns, we aim to develop a tailored Automated Speaking Assessment (ASA) system, with a focus to Taiwanese students. Our system emphasizes the unique linguistic and academic requirements of Taiwanese EMI settings. We investigate several models including traditional feature-based machine learning models and large pre-trained models, specifically fine-tuned with a Taiwanese EMI-focused dataset. Also, we propose innovative approaches to overcome the scarcity of relevant datasets with prototypical networks and address the issue of data imbalance via loss reweighting technique. By aligning assessment techniques closely with the specific needs of Taiwanese EMI students, our ASA system offers a more effective and contextually appropriate tool for language proficiency assessment in academic settings. The results of the experiments show the effectiveness of the methodologies.

    List of Tables i List of Figures iii Abstract iv 1. Introduction 1 2. Related Work 5 2.1 Mispronunciation Detection and Diagnosis, MDD 5 2.2 Automatic Pronunciation Assessment, APA 6 2.3 Automated Speaking Assessment, ASA 8 3. Methodology 11 3.1 End-to-End Based ASR 11 3.2 Traditional Feature-based Grader 12 3.2.1 Handcrafted Features 14 3.2.1.1 Audio Features: Delivery 15 3.2.1.1.1 Pronunciation 15 3.2.1.1.1.1 Segmental Pronunciation 16 3.2.1.1.1.2 Suprasegmental Pronunciation 17 3.2.1.1.2 Fluency 18 3.2.1.1.2.1 Speech Rate 18 3.2.1.1.2.2 Hesitation 18 3.2.1.2 Text Features: Language Use 19 3.2.1.2.1 Vocabulary 19 3.2.1.2.2 Grammar 20 3.2.1.2.2.1 Part-of-Speech (POS)/ Dependency Parsing (DEP)/ Morphology (MORPH) 20 3.2.1.2.2.2 Grammar Label Information 21 3.3 Wav2vec 2.0.-based Grader 22 3.4 BERT-based Grader 24 3.4.1 Metric-based Classification 24 3.4.2 Loss Re-Weighting 25 4. Experimental Setup and Results 27 4.1 EMI (English as a Medium of Instruction) Dataset 27 4.2 Experimental Results and Analysis 29 4.2.1 Evaluation Metrics 29 4.2.2 Traditional Feature-based Grader 30 4.2.2.1 Experimental Setup 30 4.2.2.2 Feature Importance Analysis 30 4.2.2.3 Experimental Results 33 4.2.3 Wav2vec 2.0-based Grader 36 4.2.3.1 Experimental Setup 36 4.2.3.2 Experimental Results 36 4.2.4 BERT-based Grader 39 4.2.4.1 Experimental Setup 39 4.2.4.2 Experimental Results 39 5. Conclusion 42 6. Reference 44 7. Appendix 51

    [1] N. Li and J. Wu, “Exploring Assessment for Learning Practices in the EMI Classroom in the Context of Taiwanese Higher Education,” Language Education & Assessment, vol. 1, no. 1, pp. 28–44, 2018, doi: 10.29140/lea.v1n1.46.

    [2] W. Hu, Y. Qian, F. K. Soong, and Y. Wang, “Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers,” Speech Communication, vol. 67, pp.154-166, 2015.

    [3] W. -K. Lo, S. Zhang, and H. Meng, “Automatic derivation of phonological rules for mispronunciation detection in a computer-assisted pronunciation training system,” in Proceedings of Interspeech, 2010.

    [4] W. Li, S. M. Siniscalchi, N. F. Chen, and C. -H. Lee, “Improving non-native mispronunciation detection and enriching diagnostic feedback with DNN-based speech attribute modeling,” in Proceedings of ICASSP, 2016.

    [5] N. F. Chen and H. Li, “Computer-assisted pronunciation training: From pronunciation scoring towards spoken language learning,” in Proceedings of APSIPA, 2016, doi: 10.1109/APSIPA.2016.7820782.

    [6] Y. Gong, Z. Chen, I.-H. Chu, P. Chang, and J. Glass, “Transformer-based multi-aspect multigranularity non-native English speaker pronunciation assessment,” in Proceedings of ICASSP, 2022.

    [7] F. -A. Chao, T. -H. Lo, T. -I. Wu, Y. -T. Sung, and B. Chen, “3M: An effective multi-view, multi-granularity, and multi-aspect modeling approach to English pronunciation assessment,” in Proceedings of APSIPA ASC, 2022.

    [8] W.-H. Peng, H.-W. Wang, S. Chen, and B. Chen, "Enhancing automated English speaking assessment for L2 speakers with BERT and wav2vec2.0 fusion," in Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023), 2023.

    [9] Y. Qian, P. Lange, K. Evanini, R. Pugh, R. Ubale, M. Mulholland, and X. Wang, “Neural Approaches to Automated Speech Scoring of Monologue and Dialogue Responses,” in Proceedings of ICASSP, 2019, doi: 10.1109/ICASSP.2019.8683717.

    [10] L. Chen, K. Zechner, S. -Y. Yoon, K. Evanini, X. Wang, A. Loukina, J. Tao, L. Davis, C. M. Lee, M. Ma, R. Mundkowsky, C. Lu, C. W. Leong, and B. Gyawali, “Automated scoring of nonnative speech using the SpeechRaterSM v. 5.0 engine,” ETS Research Report Series, 2018, doi: 10.1002/ets2.12198.

    [11] A. Loukina, K. Zechner, L. Chen, and M. Heilman, “Feature selection for automated speech scoring,” in Proceedings of the Tenth Workshop on Innovative Use of NLP for Building Educational Applications (BEA), 2015, doi: 10.3115/v1/W15-0602.

    [12] X. Xi, D. Higgins, K. Zechner, and D. M. Williamson, “Automated scoring of spontaneous speech using SpeechRaterSM v1.0,” ETS Research Report Series, 2008, doi:10.1002/j.2333-8504.2008.tb02148.x.

    [13] S. Xie, K. Evanini, and K. Zechner, “Exploring content features for automated speech scoring,” in Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2012.

    [14] Z. Yu, V. Ramanarayanan, D. Suendermann-Oeft, X. Wang, K. Zechner, L. Chen, J. Tao, A. Ivanou, and Y. Qian, “Using bidirectional LSTM recurrent neural networks to learn high-level abstractions of sequential features for automated scoring of non-native spontaneous speech,” in Proceedings of 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), 2015.

    [15] H. Craighead, A. Caines, P. Buttery, and H. Yannakoudakis, “Investigating the effect of auxiliary objectives for the automated grading of learner English speech transcriptions,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020.

    [16] R. van Dalen, K. M. Knill, and M. J. F. Gales, “Automatically grading learners' English using a Gaussian process,” in Proceedings of Speech and Language Technology in Education (SLaTE 2015), 2015.

    [17] K. Knill, M. Gales, K. Kyriakopoulos, A. Malinin, A. Ragni, Y. Wang, and A. Caines, “Impact of ASR performance on free speaking language assessment,” in Proceedings of Interspeech, 2018.

    [18] V. Raina, M. J. F. Gales, and K. M. Knill, “Universal adversarial attacks on spoken language assessment systems,” in Proceedings of Interspeech, 2020.

    [19] S. Bannò, K. Knill, M. Matassoni, V. Raina, and M. Gales, “Assessment of L2 oral proficiency using self-supervised speech representation learning,” in Proceedings of 9th Workshop on Speech and Language Technology in Education (SLaTE), 2023.

    [20] S. Bannò and M. Matassoni, “Proficiency assessment of L2 spoken English using wav2vec 2.0,” in Proceedings of 2022 IEEE Spoken Language Technology Workshop (SLT), 2023.

    [21] A. Caines, L. Benedetto, S. Taslimipoor, C. Davis, Y. Gao, O. Andersen, Z. Yuan, M. Elliott, R. Moore, C. Bryant, M. Rei, H. Yannakoudakis, A. Mullooly, D. Nicholls, and P. Buttery, “On the application of Large Language Models for language teaching and assessment technology,” in Proceedings of AIED2023 Workshop: Empowering Education with LLMs - the Next-Gen Interface and Content Generation, 2023.

    [22] S. Ishikawa, “Design of the ICNALE Spoken: A new database for multi-modal contrastive interlanguage analysis,” Learner Corpus Studies in Asia and the World, vol. 2, pp. 63–76, 2014.

    [23] W. -K. Leung, X. Liu and H. Meng, “CNN-RNN-CTC Based End-to-end Mispronunciation Detection and Diagnosis,” in Proceedings of ICASSP, 2019.

    [24] M. Wu, K. Li, W. -K. Leung, and H. Meng, “Transformer Based End-to-End Mispronunciation Detection and Diagnosis,” in Proceedings of Interspeech, 2021, doi: 10.21437/Interspeech.2021-1467.

    [25] T. -H. Lo, S. -Y. Weng, H. -J. Chang, and B. Chen, “An effective end-to-end modeling approach for mispronunciation detection,” in Proceedings of Interspeech, 2020, doi: 10.21437/Interspeech.2020-1605.

    [26] B. -C. Yan, M. -C. Wu, H. -T. Hung, and B. Chen, “An end-to-end mispronunciation detection system for L2 English speech leveraging novel anti-phone modeling,” in Proceedings of Interspeech, 2020.

    [27] B. -C. Yan, H. -W. Wang and B. Chen, “Peppanet: effective mispronunciation detection and diagnosis leveraging phonetic, phonological, and acoustic cues,” in Proceedings of 2022 IEEE Spoken Language Technology Workshop (SLT), 2023.

    [28] B. -C. Yan, H. -W. Wang, Y. -C. Wang and B. Chen, “Effective graph-based modeling of articulation traits for mispronunciation detection and diagnosis,” in Proceedings of ICASSP, 2023, doi: 10.1109/ICASSP49357.2023.10097226.

    [29] H. Ryu, S. Kim, and M. Chung, “A joint model for pronunciation assessment and mispronunciation detection and diagnosis with multi-task learning,” in Proceedings of Interspeech, 2023.

    [30] Y. Liang, K. Song, S. Mao, H. Jiang, L. Qiu, Y. Yang, D. Li, L. Xu, and L. Qiu, “End-to-End word-level pronunciation assessment with MASK pre-training,” in Proceedings of Interspeech, 2023.

    [31] W. Liu, K. Fu, X. Tian, S. Shi, W. Li, Z. Ma, and T. Lee, “An ASR-Free Fluency Scoring Approach with Self-Supervised Learning,” in Proceedings of ICASSP, 2023.

    [32] E. Kim, J. -J. Jeon, H. Seo, and H. Kim, “Automatic pronunciation assessment using self-supervised speech representation learning,” in Proceedings of Interspeech, 2022.

    [33] F. -A. Chao, T. -H. Lo, T. -I. Wu, Y. -T. Sung, and B. Chen, “A hierarchical context-aware modeling approach for multi-aspect and multi-granular pronunciation assessment,” in Proceedings of Interspeech, 2023.

    [34] J. Park, and S. Choi, “Addressing cold start problem for End-to-end automatic speech scoring,” in Proceedings of Interspeech, 2023.

    [35] A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proceedings of the 40th International Conference on Machine Learning, 2023.

    [36] M. Bain, J. Huh, T. Han, and A. Zisserman, “WhisperX: time-accurate speech transcription of long-form Audio,” in Proceedings of Interspeech, 2023.

    [37] H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “pyannote.audio: neural building blocks for speaker diarization,” in Proceedings of ICASSP, 2020.

    [38] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020.

    [39] B. McFee, C. Raffel, D. Liang, D. P.W. Ellis, M. McVicar, E. Battenbergk, and O. Nieto, “Librosa: Audio and music signal analysis in python,” in Proceedings of the 14th python in science conference, 2015.

    [40] T. -I. Wu, T. -H. Lo, F. -A. Chao, Y. -T. Sung, and B. Chen, “A preliminary study on automated speaking assessment of English as a second language (ESL) students,” in Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022), 2022.

    [41] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Machine Learning, 2006, doi: 10.1007/s10994-006-6226-1.

    [42] J. H. Friedman, “Greedy Function Approximation: A Gradient Boosting Machine,” The Annals of Statistics, vol. 29, no. 5, 2001, doi: 10.1214/aos/1013203451.

    [43] J. H. Friedman, “Stochastic Gradient Boosting,” Computer Statistics & Data Analysis, vol. 38, no. 4, pp. 367–378, 2002.

    [44] C. -N. Hsieh, K. Zechner, and X. Xi, “Features measuring fluency and pronunciation,” in Automated Speaking Assessment: Using Language Technologies to Score Spontaneous Speech, K. Zechner and K. Evanini, Eds., 2019, pp. 101–122.

    [45] S. -Y. Yoon, X. Lu, and K. Zechner, “Features measuring vocabulary and grammar,” in Automated Speaking Assessment: Using Language Technologies to Score Spontaneous Speech, K. Zechner and K. Evanini, Eds., 2019, pp. 123–137.

    [46] C. Zhu, T. Kunihara, D. Saito, N. Minematsu and N. Nakanishi, “Automatic Prediction of Intelligibility of Words and Phonemes Produced Orally by Japanese Learners of English,” in Proceedings of 2022 IEEE Spoken Language Technology Workshop (SLT), 2023.

    [47] S. M. Witt and S. J. Young, “Phone-level pronunciation scoring and assessment for interactive language learning,” Speech Communication, vol. 30, no. 2-3, pp. 95-108, 2000, doi: 10.1016/S0167-6393(99)00044-8.

    [48] Y. Wang, M. J. F. Gales, K. M. Knill, K. Kyriakopoulos, A. Malinin, R. C. van Dalen, and M. Rashid, “Towards automatic assessment of spontaneous spoken English,” Speech Communication, 2018, doi: 10.1016/j.specom.2018.09.002.

    [49] K. Zechner, X. Xi and L. Chen, “Evaluating prosodic features for automated scoring of non-native read speech,” in Proceedings of 2011 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), 2011.

    [50] J. Zhu, C. Zhang, and D. Jurgens, “Phone-to-Audio Alignment without Text: A Semi-Supervised Approach,” in Proceedings of ICASSP, 2021.

    [51] B. W. Lee and J. Lee, “LFTK: handcrafted features in computational linguistics,” in Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), 2023.

    [52] S. Uchida and M. Negishi, “Assigning CEFR-J levels to English texts based on textual features,” in Proceedings of the 4th Asia Pacific Corpus Linguistics Conference (APCLC 2018), 2018.

    [53] S. Uchida, “A CEFR-based Textbook Corpus: An attempt to reveal linguistic features of CEFR levels (original in Japanese),” English Corpus Studies, vol. 22, pp. 87–99, 2015.

    [54] W. -N. Hsu, B. Bolte, Y. -H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” in Proceedings of IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.

    [55] H. -S. Tsai, H. -J. Chang, W. -C. Huang, Z. Huang, K. Lakhotia, S. -W. Yang, S. Dong, A. Liu, C. -I. Lai, J. Shi, X. Chang, P. Hall, H. -J. Chen, S. -W. Li, S. Watanabe, A. Mohamed, and H. -Y. Lee, “SUPERB-SG: Enhanced speech processing universal performance benchmark for semantic and generative capabilities,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.

    [56] T. -I. Wu, T. -H. Lo, F. -A. Chao, Y. -T. Sung, and B. Chen, “Effective neural modeling leveraging readability features for automated essay scoring,” in Proceedings of 9th Workshop on Speech and Language Technology in Education (SLaTE), 2023.

    [57] J. Devlin, M. -W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 4171- 4186, doi:10.18653/v1/N19-1423.

    [58] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP), 2020, pp. 38–45, doi: 10.18653/v1/2020.emnlp- demos.6.

    [59] J. Snell, K. Swersky, and R. S. Zemel, “Prototypical networks for few-shot learning,” in Proceedings of NeurIPS, 2017.

    [60] S. Sun, Q. Sun, K. Zhou, and T. Lv, “Hierarchical attention prototypical networks for few-shot text classification,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, doi: 10.18653/v1/D19- 1045.

    [61] Y. Arase, S. Uchida, and T. Kajiwara, “CEFR-Based Sentence Difficulty Annotation and Assessment,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022.

    [62] J. S. Chung, J. Huh, S. Mun, M. Lee, H. -S. Heo, S. Choe, C. Ham, S. Jung, B. -J. Lee, and I. Han, “In Defence of Metric Learning for Speaker Recognition,” in Proceedings of Interspeech, 2020.

    [63] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proceedings of International Conference on Learning Representations (ICLR), 2019.

    [64] W. Wang and A. Nakajima, “Detecting Syllable-Level Pronunciation Stress with A Self-Attention Model,” 2023. [Online]. Available: arXiv:2311.00301.

    下載圖示
    QR CODE