Author: |
劉憶年 Liu, Yi-Nian |
---|---|
Thesis Title: |
應用可讀性預測於中小學國語文教科書及優良課外讀物分類之研究 A Study of Readability Prediction on Elementary and Secondary Chinese Textbook and Excellent Extracurricular Reading Materials Classification |
Advisor: |
陳柏琳
Chen, Berlin |
Degree: |
碩士 Master |
Department: |
資訊工程學系 Department of Computer Science and Information Engineering |
Thesis Publication Year: | 2016 |
Academic Year: | 104 |
Language: | 中文 |
Number of pages: | 50 |
Keywords (in Chinese): | 可讀性 、文本特徵 、逐步迴歸 、支持向量機 |
Keywords (in English): | Readability, Textual Features, Stepwise Regression, Support Vector Machine |
DOI URL: | https://doi.org/10.6345/NTNU202204961 |
Thesis Type: | Academic thesis/ dissertation |
Reference times: | Clicks: 144 Downloads: 25 |
Share: |
School Collection Retrieve National Library Collection Retrieve Error Report |
可讀性(Readability)是指閱讀材料能夠被讀者理解的程度。可讀性高的文章較容易被讀者理解。文章的可讀性與很多因素有關,如:文長、字詞難度、句法結構、內容是否符合讀者的先備知識等,然而表淺的語言特徵無法反映這些複雜的成分。本論文以先前的研究為基礎,更深入的探討不同種類的特徵,包括句法分析(Syntactic Analysis)、詞性標記(Part-of-Speech, POS)、詞表示法(Word Embedding)、語意資訊(Semantic Information)與寫作程度(Well-written)等特徵,分析比對不同類型的特徵與可讀性高低的關聯性。實驗資料分為二部分:其一為中小學國語文教科書,選自98年度台灣三大出版社所出版的1~9年級(共18冊)審定版國中小國語文教科書;其二為優良課外讀物,選自文化部歷屆「中小學生優良課外讀物」獲選書籍。本論文嘗試透過逐步迴歸與支持向量機等兩種方式建立可讀性模型,比較兩者之效能優劣;最後,再將兩者加以結合,以提升預測之正確率。實驗結果顯示,本論文所提出的可讀性特徵相較於傳統所使用的表淺特徵,在文本難易度評估的任務中,能有顯著的效能提升。
Readability is basically concerned with readers’ comprehension of given textual materials: the higher the readability of a document, the easier the document can be understood. It may be affected by various factors, such as document length, word difficulty, sentence structure and whether the content of a document meets the prior knowledge of a reader or not. However, simple surface linguistic features cannot always account for these factors in an appropriate manner. To cater for this, we explore in this study a variety of extra features, including syntactic analysis, parts of speech, word embedding, semantic role features and well-written features. The experimental datasets are composed of two parts: one is textbooks of the Chinese language for elementary and junior high schools (K1 to K9) in Taiwan, compiled from three publishers in the academic year of 2009; the other is excellent extracurricular reading materials for students of elementary and junior high schools, collected by the Ministry of Culture in Taiwan. Two readability prediction models, viz. stepwise regression and support vector machine, are evaluated and compared, while the combination of these two models is also investigated so as to further enhance the accuracy of readability prediction. Experimental results reveal that our proposed approach can yield consistently better performance than traditional ones merely with simple surface linguistic features in evaluating text difficulty.
[1] 宋曜廷、陳茹玲、李宜憲、查日龢、曾厚強、林維駿、張道行、張國恩, “中文文本可讀性探討:指標選取、模型建立與效度驗證”, 中華心理學刊, 55卷, 1期, 75–106, 2013.
[2] A. C. Graesser, D. S. McNamara, M. M. Louwerse, and Z. Cai, “Coh-Metrix: Analysis of Text on Cohesion and Language,” Behavior Research Methods, Instruments, & Computers, vol. 36, no. 2, pp. 193–202, 2004.
[3] 陳世敏, “中文可讀性公式試擬”, 新聞學研究, 8卷, 181–226, 1971.
[4] 楊孝濚, “中文可讀性公式”, 新聞學研究, 8卷, 77–101, 1971.
[5] K. Collins-Thompson, “Computational Assessment of Text Readability: A Survey of Current and Future Research,” Recent Advances in Automatic Readability Assessment and Text Simplification. Special issue of International Journal of Applied Linguistics, vol. 165, no. 2, 97–135, 2014.
[6] “可讀性 - 維基百科,自由的百科全書”, available at: https://zh.wikipedia.org/wiki/%E5%8F%AF%E8%AF%BB%E6%80%A7.
[7] R. J. Kate, X. Luo, S. Patwardhan, M. Franz, R. Florian, R. J. Mooney, S. Roukos, and C. Welty, “Learning to Predict Readability using Diverse Linguistic Features”, Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, 546-554, 2010.
[8] P van Oosten, D. Tanghe and V. Hoste, “Towards an Improved Methodology for Automated Readability Prediction”, Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 2010), Valletta, 775-782, 2010.
[9] “迴歸分析 - 維基百科,自由的百科全書”, available at: https://zh.wikipedia.org/wiki/%E8%BF%B4%E6%AD%B8%E5%88%86%E6%9E%90.
[10] 多變量分析最佳入門實用書:SPSS+LISREL(SEM)(2007)。台北:碁峰資訊。.
[11] 祁亨年, “支持向量機及其應用研究綜述”, 計算機工程, 10期, 6–9, 2004.
[12] “支持向量機器 (Support Vector Machine) 逍遙文工作室”, available at: https://cg2010studio.wordpress.com/2012/05/20/%E6%94%AF%E6%8C%81%E5%90%91%E9%87%8F%E6%A9%9F%E5%99%A8-support-vector-machine/.
[13] “Support Vector Machine - JeromeBlog - 博客园”, available at: http://www.cnblogs.com/jeromeblog/p/3395016.html.
[14] “LIBSVM -- A Library for Support Vector Machines,” available at: http://www.csie.ntu.edu.tw/~cjlin/libsvm/index.html?js=1#svm-toy-js.
[15] “piaip's Using (lib)SVM Tutorial”, available at: http://www.csie.ntu.edu.tw/~piaip/docs/svm/.
[16] “Coh-Metrix Web Tool,” available at: http://tool.cohmetrix.com/.
[17] “Coh-Metrix version 3.0 indices”, available at: http://141.225.42.101/CohMetrixHome/documentation_indices.html.
[18] “中文文本自動化分析系統”, available at: http://210.240.188.161/Chinese_CohMetrix/index.html.
[19] 蔡亞韋、陳文蘭、郭伯臣、廖晨惠、白鎧誌(2013年04月)。, “應用潛在語意於自動化文本分析”.2013第七屆資訊科技國際研討會.
[20] “About WordNet - WordNet - About WordNet,” available at: http://wordnet.princeton.edu/.
[21] “https://zh.wikipedia.org/wiki/WordNet”, available at: https://zh.wikipedia.org/wiki/WordNet.
[22] “英文詞彙知識庫WordNet如何應用於英文學習- 以glycogen(肝糖)為例”, available at: http://epaper.naer.edu.tw/index.php?edm_no=52&content_no=1456.
[23] “中文詞彙網路 Chinese Wordnet”, available at: http://lope.linguistics.ntu.edu.tw/cwn/.
[24] “文本可讀性指標自動化分析系統 2.3”, available at: http://www.chinesereadability.net/CRIE/?LANG=CHT.
[25] “第二講”, available at: http://maliwen.myweb.hinet.net/no2.html.
[26] L. Feng, M. Jansche, M. Huenerfauth, and N. Elhadad, “A Comparison of Features for Automatic Readability Assessment,” 23rd International Conference on Computational Linguistics (COLING 2010), Poster Volume, pp. 276–284, 2010.
[27] 陳惠玉, “認識語法單位”, 台中市國教輔導團電子報, 2004.
[28] “語法分析器 - 維基百科,自由的百科全書”, available at: https://zh.wikipedia.org/wiki/%E8%AA%9E%E6%B3%95%E5%88%86%E6%9E%90%E5%99%A8.
[29] “詞類 - 維基百科,自由的百科全書”, available at: https://zh.wikipedia.org/wiki/%E8%A9%9E%E9%A1%9E.
[30] “Type vs. Token 從Type與Token討論英文學習的問題”, available at: http://chenyi2013.pixnet.net/blog/post/352192739-type-vs.-token-%E5%BE%9Etype%E8%88%87token%E8%A8%8E%E8%AB%96%E8%8B%B1%E6%96%87%E5%AD%B8%E7%BF%92%E7%9A%84%E5%95%8F%E9%A1%8C.
[31] “CKIP Chinese Parser”, available at: http://parser.iis.sinica.edu.tw/.
[32] 張劍、屈丹、李真, “基於詞向量特徵的循環神經網絡語言模型”, 模式識別與人工智能, vol. 28, no. 4, pp. 299–305, 2015.
[33] “Deep Learning in NLP (一)詞向量和語言模型 licstar的博客”, available at: http://licstar.net/archives/328#more-328.
[34] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin. “A neural probabilistic language model,” Journal of Machine Learning Research, vol. 3, pp. 1137–1155, 2003.
[35] “word2vec - Tool for computing continuous distributed representations of words. - Google Project Hosting,” available at: https://code.google.com/p/word2vec/.
[36] T. Mikolov, K. Chen, G. Corrado, and J. Dean. “Efficient Estimation of Word Representations in Vector Space,” In Proceedings of Workshop at ICLR, 2013.
[37] “再談word2vec 學步園”, available at: http://www.xuebuyuan.com/2199643.html.
[38] “tSNE CSV web demo”, available at: http://cs.stanford.edu/people/karpathy/tsnejs/csvdemo.html.
[39] 中文詞知識庫小組。“中文斷詞系統”, available at: http://ckipsvr.iis.sinica.edu.tw/.
[40] 詞庫小組。「句結構樹中的語意角色」。技術報告 13-01。民102年。
[41] A. Louis and A. Nenkova, “What Makes Writing Great? First Experiments on Article Quality Prediction in the Science Journalism Domain,” Transactions of the Association for Computational Linguistics, 1, pp. 341–352, 2013.
[42] 卓淑玲、陳學志、鄭昭明, “台灣地區華人情緒與相關心理生理資料庫—中文情緒詞常模研究”, 中華心理學刊, 55卷, 4期, 493–523, 2013.
[43] “文化部中小學生優良課外讀物推介評選活動 - 第37次”, available at: http://book.moc.gov.tw/book/.