簡易檢索 / 詳目顯示

研究生: 黃幀祥
論文名稱: 使用潛在語意分析建構文本分類模型- 以國小社會科課文為例
Text Classification Model Based on Latent Semantic Analysis: A Case Study of Textbook for Social Studies in Elementary School
指導教授: 張國恩
Chang, Kuo-En
宋曜廷
Sung, Yao-Ting
張道行
Chang, Tao-Hsing
學位類別: 碩士
Master
系所名稱: 資訊教育研究所
Graduate Institute of Information and Computer Education
論文出版年: 2011
畢業學年度: 100
語文別: 中文
論文頁數: 72
中文關鍵詞: 潛在語意分析可讀性文本分類
英文關鍵詞: Latent Semantic Analysis, Readability, Text Classification
論文種類: 學術論文
相關次數: 點閱:375下載:14
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 由於網路的發達和電腦的普及,學生常常透過網路來尋找資料,但往往搜尋結果龐大,且內容涵蓋各個面向,導致學生浪費許多時間在結果中反覆檢閱才得以找出適合程度及目標的文章資訊。可讀性文本分類可以分辨文本所屬的難易層級,讓學生可以選擇適合自己程度的文本,以節省學生尋找適合自己程度的文本的時間。過去可讀性研究多將文本表面特徵代入線性公式求得一個難易度的分數,但是在中文環境底下,語意特徵就比表面特徵來的重要,因此本研究利用潛在語意分析技術分析文本的語意特徵,再以語意特徵作為分類依據對文本進行可讀性的分類。本研究資料採用國小社會科課文,利用每個學期不同主題的特性,透過潛在語意分析技術建置一個社會科的語意空間模型,利用建構好的語意空間模型將未知程度的社會科文章分類至所屬的層級。
    本研究在國小社會科以學期為分類的分類結果,在分析的準確率達79.06%,在分類上可達到不錯的效果。潛在語意分析提供可讀性研究另一個角度的思維,以文本所傳達的「語意」為分析依據,特別適用重視語意的中文環境。

    Due to the well-developed internet and widely usage of computers, internet becomes the tool for student to mine the information they need. But the results are often complex and huge, students waste a lot of time to review the results again and again to find out the text which is suitable to their ability. Readability text classification can identify the difficulty of the text and students can choose the text which is suitable for them in order to save their time. Many studies of readability put surface features into linear formula to obtain a readability score, but in Chinese, the semantic information is more important than in English. By using Latent Semantic Analysis to analyze the semantic features of text, and classify the readability of text by the semantic information. In this study, elementary Social Study textbook has been used as our data. By utilizing the characteristics of the different themes in each semester, we have constructed the semantic space model of elementary Social Study textbook by Latent Semantic Analysis, and apply the model to classify the unknown readability level texts to the class which they should be classified.
    In this study, the accuracy of classification is 79.06%. Latent Sementic Analysis inspires us another point of view on readability of text classification, especially for Chinese text whom importance semantic information more.

    表目錄 v 圖目錄 vi 第一章 緒論 1 第一節 研究背景與動機 1 第二節 研究目的 6 第三節 研究限制 6 第二章 文獻探討 7 第一節 可讀性 7 第二節 分類問題研究 13 第三節 潛在語意分析 18 第四節 綜合分析 20 第三章 研究方法 22 第一節 資料前處理階段 25 第二節 訓練與測試階段 28 第三節 學期概念重要詞彙建置方法 34 第四章 實驗設計 36 第一節 實驗工具 36 第二節 實驗資料 36 第三節 實驗流程 37 第四節 實驗結果 39 第五節 實驗結果之討論 42 第五章 結論與未來發展 50 第一節 結論 50 第二節 未來發展 50 參考文獻 52 一、中文部分 52 二、英文部分 52 附錄一 各folds隨機選取課文之結果 57 附錄二 各學期概念重要詞彙 62 附錄三 九年一貫國小社會科能力指標 67 附錄四 三版本各學期單元名稱 71

    一、中文部分
    于宗先(1960)。臺灣報紙可讀性之研究。報學,2(6),18。
    許菱祥(1986)。中文文法。台北市:大中國圖書公司。
    荊溪昱(1995)。中文國文教材的適讀性研究:適讀年級的推估。教育研究資訊, 3(3),113-127。
    柯華葳、陳明蕾(2009)。中文語意空間建置及心理效度驗證:以潛在語意分析技 術為基礎。中華心理學刊,51(4),397-407。
    張國恩、宋曜廷(2005)。潛在語意分析及概念構圖在文章摘要和理解評量的應 用(3/3)。國家科學委員會專題計畫成果報告(編號:NSC93-2520-S-003-011)。 台北市:行政院國家科學委員會。
    楊孝濚(1978)。影響中文可讀性語文因素的分析。報學,4(7),58-68。
    楊孝濚(1978)。中文可讀性公式。新聞學研究,8,77-102。
    二、英文部分
    Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C.D., and Stamatopoulos, P. (2000). Learning to filter spam e-mail: A comparison of a Naive Bayesian and a memory-based approach. Proceedings of the Workshop on Machine Learning and Textual Information Access, PKDD 2000, Lyon, France, 1– 3.
    Bormuth, J. R. (1966). Readability: A new approach. Reading Research Quarterly, 1(3), 79-132.
    Boser, B.E., Guyon, I.M., & Vapnik, V.N. (1992). A Training Algorithm for Optimal Margin Classifier. Proceedings of the fifth annual workshop on Computational learning theory, 144-152.
    Dale, E., & Chall, J. S. (1949). The concept of readability. Elementary English, 26(23), 19-26.
    Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
    Dubay, W.H. (2004). The Principles of Readability. Costa Mesa, CA: BookSurge Publishing.
    Eleni Miltsakaki , Audrey Troutt, (2008). Real-time web text classification and analysis of reading difficulty. Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications, p.89-97.
    Feng, L., Jansche, M., Huenerfauth, M., & Elhadad, N. (2010). A Comparison of Features for Automatic Readability Assessment. Proceedings of The 23rd International Conference on Computational Linguistics, 276-284.
    Flesch, R. F. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221-233.
    Flesch, R. F. (1979). How to write plain English. New York:Harper and Brothers.
    Foltz, P. W. (2007). Discourse coherence and LSA. In T. Landauer, D. McNamara, D. Simon, & W. Kintsch (Eds.) Handbook of Latent Semantic Analysis. Mahwah, New Jersey: Lawrence Erlbaum Associates.
    Fountas, I.C., & Pinnell, G.S. (1999). Matching books to readers: Using leveled books in guided reading. Portsmouth, NH: Heinemann.
    Gerard Salton, A.Wong, and C. S. Yang. (1975) A vector space model for information retrieval. Communications of the ACM, 18(11), 613–620
    Gunning, R. (1952/1968). The technique of clear writing. New York: McGraw-Hill.
    Graesser, A. C., McNamara, D. D., Louwerse, M. L., & Cai, Z. (2004). Coh-Metrix: Analysis of text on cohesion and language. Behavior Research Methods, Instruments, & Computers, 36, 193-202.
    Hargis, G. (Ed.). (1998). Developing Quality Technical Information: A Handbook for Writers and Editors. NJ: Prentice Hall PTR.
    Kincaid, Jr., J. P., Fishburn, R. P., Rogers, R. L., & Chisson, B. S. (1975). Derivation of new readability formulas (automated readability index, Fog count and Flesch reading ease formula) for navy enlisted Personnel. Pensacola, FL: Navy Training Command Research Branch.
    Kireyev, K., & Landauer, T. K. (2011). Word Maturity: Computational Modeling of Word Knowledge. 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT), 2011.
    Klare, G.R. (1963). The Measurement of Readabiliry: Useful Information for Communicatiors. Journal of Computer Documentation, 24(3).
    Lam, W., Ruiz, M., and Srinivasan, P. (1999). Automatic Text Categorization and Its Application to Text Retrieval. IEEE Transactions on Knowledge and Data Engineering, 11(6), 865-879.
    Landauer, T. K., Foltz, P. W., & Laham, D., (1998). Introduction to latent semantic analysis. Discourse Processes, 25, 259–284.
    Landauer, T. K., Laham, D., Rehder, B., & Schreiner, M. E., (1997). How well can passage meaning be derived without using word order? A comparison of Latent Semantic Analysis and humans. In M. G. Shafto & P. Langley (Eds.), Proceedings of the 19th annual meeting of the Cognitive Science Society (pp. 12-417).
    Landauer, T. K., Dumais, S.T., (1997). A Solution to Plato's Problem: The Latent Semantic Analysis Theoryof Acquisition, Induction, and Representation of Knowledge. Psychological Review, 104(2), 211-240
    Lijun Feng, No´emie Elhadad, and Matt Huenerfauth.(2009). Cognitively motivated features for readability assessment. In The 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2009).
    Liu, X., Croft, W.B., Oh, P., Hart, D. (2004). Automatic recognition of reading levels from user queries. SIGIR '04 Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval (pp. 548–549). New York, NY: ACM.
    Luo Si and Jamie Callan. (2001). A Statistical Model for Scientific Readability, Proceedings of the 10th International Conference on Information and Knowledge Management (CIKM), p. 574-576.
    MacWhinney, B., Bates, E., & Kliegel, R. (1984). Cue validity and sentence interpretation in English, German and Italian. Journal of Verbal Learning and Verbal Behavior, 23, 127–150.
    Màrquez, L., Carreras, X., Litkowski, K.C., Stevenson, S., (2008). Semantic role labeling: an introduction to the special issue. Computational linguistics, 34(2), 161-191
    McLaughlin, G. H. (1969). SMOG grading: A new readability formula. Journal of Reading, 12(8), 639-646.
    Moens, M. F., and Dumortier, J. (2000). Text Categorization: the Assignment of Subject Descriptors to Magazine Articles, Information Processing & Management, 36, 841-861.
    Si, L., Callan, J. (2001). A statistical model for scientific readability. CIKM’01: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 574–576.
    Su, I.R. (2003). Processing sentences in context: A comparison of Chinese and English. In Liou, H., Katchen, J., & Wang, H. (Eds) Lingua Tsing Hua: A 20th Anniversary Commemorative Anthology (p.157-168). Taipei: Crane.
    Su, I.R. (2004). The effect of discourse proceeding with regard to syntactic and semantic cues: A competition model study. Applied psycholinguistics, 25, 587-601.
    Valaki, C. E., Maestu, F., Simos, P. G., Zhang, W., Fernandez, A., Amo, C. M., et al. (2004). Cortical organization for receptive language functions in Chinese, English, and Spanish: A cross-linguistic MEG study. Neuropsychologia, 42(7), 967-979.
    Yan, J., Bracewell, D.B., Kuroiwa, S., & Ren, F. (2007). Chinese semantic dependency analysis: construction of a treebank and its use in classification. Transaction on speech and language processing, 4(2).
    Yiming Yang and Xin Liu. (1999). A re-examination of text categorization methods. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, NY, USA, 42-49.
    Zhijie Liu, Xueqiang Lv, Kun Liu, Shuicai Shi, (2010). Study on SVM Compared with the other Text Classification Methods. Education Technology and Computer Science (ETCS), 2010 Second International Workshop on , vol.1, no., pp.219-222, 6-7.

    下載圖示
    QR CODE