研究生: |
郭承暘 Kuo, Cheng-Yang |
---|---|
論文名稱: |
中英文文句相似度比對效果評估–以學位論文中英文摘要為例 Evaluating Chinese-English Sentence Similarity Methods: A Case Study of Abstracts in Academic Theses |
指導教授: |
曾元顯
Tseng, Yuen-Hsien |
口試委員: |
曾元顯
Tseng, Yuen-Hsien 林頌堅 Lin, Sung-Chien 陳舜德 Chen, Shun-Der |
口試日期: | 2025/01/06 |
學位類別: |
碩士 Master |
系所名稱: |
圖書資訊學研究所 Graduate Institute of Library and Information Studies |
論文出版年: | 2025 |
畢業學年度: | 113 |
語文別: | 中文 |
論文頁數: | 58 |
中文關鍵詞: | 文本相似性計算 、Sentence Transformer 、跨語言文本比對 |
英文關鍵詞: | Text Similarity Computation, Sentence Transformer, Cross-Lingual Text Comparison |
DOI URL: | http://doi.org/10.6345/NTNU202500465 |
論文種類: | 學術論文 |
相關次數: | 點閱:81 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究結合Sentence Transformer模型與餘弦相似度計算,探討其在中英文學術文本相似性檢測中的應用,並比較人工評分與GPT 4o-mini模型在相似度判斷中的一致性。
本研究以NDLTD為語料來源,蒐集 900篇完整中英文摘要的論文,提取 7,478句中文句子與 11,047句英文句子進行分析。使用Sentence Transformer模型將句子轉換為向量表示,並計算餘弦相似度以匹配中英文句子。此外,透過人工評分與 GPT 4o-mini 模型評分進行對比分析,評估模型的準確性與一致性。
研究結果顯示,未設置門檻值時,人工評分認為高相似度句子比例為 79.94%,GPT 4o-mini模型為 69.67%。設定最佳門檻值 0.75(人工評分)與 0.79(GPT 4o-mini評分)後,高相似度句子比例分別下降至 76.8% 和 63.9%。GPT 4o-mini 模型與人工評分整體相關性達到中等水準(皮爾森相關係數 0.76,斯皮爾曼相關係數 0.69),允許一定容錯性時,Custom Weighted Kappa提升至 0.62,顯示 GPT 模型具備模擬人工評分的潛力。
本研究表明,Sentence Transformer模型結合餘弦相似度計算能有效檢測中英文學術文本的語義相似性,而GPT 4o-mini模型作為輔助工具,提供提高審查效率的可能性。未來研究應進一步拓展語料庫多樣性,優化模型語義提取能力,並探索人工與機器評估結合的創新方法。
This study combines the Sentence Transformer model with cosine similarity calculation to explore its application in detecting similarity in Chinese-English academic texts and compares the consistency between manual evaluation and GPT 4o-mini model scoring.
The study uses NDLTD as the data source, collecting 900 theses with complete Chinese and English abstracts, extracting 7,478 Chinese sentences and 11,047 English sentences for analysis. The Sentence Transformer model was used to convert sentences into vector representations, and cosine similarity was calculated to match Chinese and English sentences. Additionally, manual evaluation results were compared with GPT 4o-mini model scoring to assess the accuracy and consistency of the model.
The results show that without a threshold, manual evaluation identified 79.94% of sentences as highly similar, while the GPT 4o-mini model identified 69.67%. After applying optimal thresholds of 0.75 (manual evaluation) and 0.79 (GPT 4o-mini evaluation), the proportion of high-similarity sentences decreased to 76.8% and 63.9%, respectively. The overall correlation between GPT 4o-mini and manual evaluations reached a moderate level (Pearson correlation coefficient = 0.76, Spearman correlation coefficient = 0.69). When allowing for a certain level of error tolerance, the Custom Weighted Kappa index significantly improved to 0.62, demonstrating the GPT 4o-mini model's potential to simulate manual evaluation.
This study demonstrates that the Sentence Transformer model combined with cosine similarity calculation can effectively detect semantic similarity in Chinese-English academic texts. Moreover, the GPT 4o-mini model, as an auxiliary tool, offers the potential to improve evaluation efficiency. Future research should expand the diversity of datasets, optimize the semantic extraction capabilities of models, and explore innovative methods that combine human and machine evaluations to further enhance the accuracy and comprehensiveness of academic text similarity detection.
Agirre, E., Banea, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Mihalcea, R., ... & Wiebe, J. (2016). SemEval-2016 Task 1: Semantic Textual Similarity, Monolingual and Cross-Lingual Evaluation. Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), 497-511.
Artetxe, M., & Schwenk, H. (2019). Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7, 597–610. https://doi.org/10.1162/tacl_a_00288
Conneau, A., Kiela, D., Schwenk, H., Barrault, L., & Bordes, A. (2017). Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 670-680). Association for Computational Linguistics.
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171-4186. https://doi.org/10.18653/v1/N19-1423
Hershcovich, D., Frank, S., Lent, H., de Lhoneux, M., Abdou, M., Brandl, S., ... & Søgaard, A. (2022). Challenges and strategies in cross-cultural NLP. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). https://aclanthology.org/2022.acl-long.482
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
Hosny, M. I., & Shameem, F. (2014). Attitude of students towards cheating and plagiarism: University case study. Journal of Applied Sciences, 14(8), 748-757. https://doi.org/10.3923/jas.2014.748.757
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. https://doi.org/10.48550/arXiv.1405.4053.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781.
Nguyen, D. Q., Nguyen, T. D., Vu, T., Nguyen, A. T., Dang, K., Nguyen, T. N., & Nguyen, M. (2021). Trankit: A light-weight transformer-based toolkit for multilingual natural language processing. arXiv preprint arXiv:2101.03289. https://arxiv.org/abs/2101.03289
OpenAI.(2024) GPT-4o mini: advancing cost-efficient intelligence Retrieved from https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/?utm_source=chatgpt.com
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532-1543. https://doi.org/10.3115/v1/D14-1162
Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (pp. 3980-3990). https://doi.org/10.18653/v1/D19-1410
Roig, M. (2015). Avoiding plagiarism, self-plagiarism, and other questionable writing practices: A guide to ethical writing. Retrieved from https://ori.hhs.gov/sites/default/files/plagiarism.pdf
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks. arXiv preprint arXiv:1409.3215v3.
Tian, L., Wong, D. F., Chao, L. S., Quaresma, P., Oliveira, F., & Yi, L. (2014). UM-Corpus: A large English-Chinese parallel corpus for statistical machine translation. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC-2014), 1837-1842
Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS.In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), 2216-2222.Retrieved fromhttp://www.lrecconf.org/proceedings/lrec2012/pdf/230_Paper.pdf
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. arXiv preprint arXiv:1706.03762.
Wang, F., Khashabi, D., Hajishirzi, H., Sabharwal, A., & Etzioni, O. (2020). Language-agnostic BERT Sentence Embedding. arXiv preprint arXiv:2007.01852. https://arxiv.org/abs/2007.01852
Weber-Wulff, D. (2014). False Feathers: A Perspective on Academic Plagiarism. Springer. https://doi.org/10.1007/978-3-642-39961-9
Williams, A., Nangia, N., & Bowman, S. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1112-1122).