簡易檢索 / 詳目顯示

研究生: 洪琴婷
Chin-Ting Hung
論文名稱: LSI-based Document Retrieval
LSI-based Document Retrieval
指導教授: 邱貴發
Chiou, Guey-Fa
學位類別: 碩士
Master
系所名稱: 資訊教育研究所
Graduate Institute of Information and Computer Education
論文出版年: 2002
畢業學年度: 90
語文別: 英文
論文頁數: 105
中文關鍵詞: latent semantic indexinginformation retrievalsingular value decompositionrelevance feedback
英文關鍵詞: latent semantic indexing, information retrieval, singular value decomposition, relevance feedback
論文種類: 學術論文
相關次數: 點閱:269下載:12
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • Latent Semantic Indexing (LSI) is a retrieval technique that employs Singular Value Decomposition (SVD) and maps each document vector into a lower dimensional space to achieve concept matching. LSI has been proved that it has a better performance than traditional lexical searching methods and has the ability to overcome synonym and polysemy problems. Our purposes were to construct an LSI model to facilitate the retrieving process, and to propose potential uses of LSI in education.
    We used five test collections, two Chinese and three English to verify our LSI model. The standard test collection, MED, was used to verify the correctness of our system, and the collections of ERIC and English educational abstracts were used to test the feasibility of LSI in educational materials; in addition, two Chinese test collections were used to examine the LSI usability on Chinese documents. Our major concerns in the tests were term weighting, stemming, reduction dimensions, and relevance feedback.
    Results showed that the LSI system model worked well not only for English documents but also for character-based Chinese documents. The LSI method could effectively group semantically relevant documents. The better weighting types were log idf, log entropy, log gfidf, tf idf, and tf gfidf. Results also indicated significant improvement in retrieval after stemming. Relevance feedback with different weighting ratio worked well. And the best dimension value in ERIC documents was around 50 or 60. In conclusion, we believed that LSI is a suitable system model for retrieving relevant documents.
    Keywords: latent semantic indexing (LSI), information retrieval (IR), singular value decomposition (SVD), relevance feedback.

    Latent Semantic Indexing (LSI) is a retrieval technique that employs Singular Value Decomposition (SVD) and maps each document vector into a lower dimensional space to achieve concept matching. LSI has been proved that it has a better performance than traditional lexical searching methods and has the ability to overcome synonym and polysemy problems. Our purposes were to construct an LSI model to facilitate the retrieving process, and to propose potential uses of LSI in education.
    We used five test collections, two Chinese and three English to verify our LSI model. The standard test collection, MED, was used to verify the correctness of our system, and the collections of ERIC and English educational abstracts were used to test the feasibility of LSI in educational materials; in addition, two Chinese test collections were used to examine the LSI usability on Chinese documents. Our major concerns in the tests were term weighting, stemming, reduction dimensions, and relevance feedback.
    Results showed that the LSI system model worked well not only for English documents but also for character-based Chinese documents. The LSI method could effectively group semantically relevant documents. The better weighting types were log idf, log entropy, log gfidf, tf idf, and tf gfidf. Results also indicated significant improvement in retrieval after stemming. Relevance feedback with different weighting ratio worked well. And the best dimension value in ERIC documents was around 50 or 60. In conclusion, we believed that LSI is a suitable system model for retrieving relevant documents.
    Keywords: latent semantic indexing (LSI), information retrieval (IR), singular value decomposition (SVD), relevance feedback.

    Contents 1. Introduction…………………………………………………………1 1.1 Problem Statement…………………………………………………1 1.2 Goals…………………………………………………………………1 1.3 Overview of Latent Semantic Indexing………………………2 2. Related work…………………………………………………………4 2.1 Retrieval Techniques……………………………………………4 2.1.1 Information Retrieval and Information Filtering………4 2.1.2 Lexical Pattern Matching Model……………………………6 2.1.3 Boolean Model…………………………………………………6 2.1.4 Probabilistic Model……………………………………………8 2.1.5 Basic Vector Model……………………………………………11 2.2 Vector Space Model………………………………………………15 2.3 Latent Semantic Indexing………………………………………22 2.3.1 The Fundamental Idea…………………………………………22 2.3.2 Advantages of Latent Semantic Indexing…………………23 2.3.3 Representation of Term-Document Matrix…………………24 2.3.4 Singular Value Decomposition………………………………28 2.3.5 Query Projection………………………………………………33 2.3.6 Updating…………………………………………………………34 2.3.7 Downdating………………………………………………………38 2.3.8 Relevance Feedback……………………………………………39 2.4 Potential Uses of LSI for Teaching and Learning………40 2.4.1 Optimal Text for Learning…………………………………41 2.4.2 Coherence and Comprehensibility Measurement…………41 2.4.3 Connecting Students with Each Other and with Relevant Experts…………………………………………………………41 2.4.4 Recommendation System………………………………………42 2.4.5 Cross Language Retrieval…………………………………43 2.4.6 Automatic Writing Assessment………………………………43 2.4.7 Constructing a Summary………………………………………44 2.4.8 Portfolio Assessment…………………………………………45 2.5 LSI-based IR in Chinese Language……………………………45 2.6 Evaluation of LSI-based IR……………………………………47 3. The System…………………………………………………………50 3.1 System Environments……………………………………………50 3.2 System Architecture and Processes…………………………50 3.2.1 Basic Scheme of the System………………………………50 3.2.2 The System Architecture of Retrieving Information…51 3.2.3 Indexing and Clustering Processes………………………52 3.3 System Components………………………………………………68 4. Results and Discussion…………………………………………71 4.1 Test Collections………………………………………………71 4.2 Performance Evaluation in English Collections…………74 4.2.1 Different Weighting Types for the original MED document………………………………………74 4.2.2 Different Weighting Types for the stemmed MED document…75 4.2.3 Comparing original and stemmed MED in Different Weighting Types………………………………76 4.2.4 Different Weighting Types for the original ERIC document…………………………………79 4.2.5 Different Weighting Types for the stemmed ERIC document…………………………………80 4.2.6 Comparing original and stemmed ERIC in Different Weighting Types………………………………81 4.2.7 Reduction of Dimensions in ERIC document………………84 4.2.8 English Educational Abstracts……………………………88 4.2.9 Original and Stemmed English Educational Abstracts…90 4.3 Performance Evaluation in Chinese Collections…………91 4.3.1 Different Weighting Types for the Chinese Document about Health…………………………………………………91 4.3.2 Different Weighting Types for the Chinese Document about environment education……………………………92 4.4 Experiments on Relevance Feedback………………………93 5. Conclusions and Future Studies………………………………95 5.1 Conclusions………………………………………………………95 5.2 Future Studies…………………………………………………96 References………………………………………………………………97 Appendix A: Samples of Test Collections………………………100

    References
    [1] Arthur C. Graesser, Peter Wiemer-Hastings, Katja Wiemer- Hastings, Derek Harter, Natalie Person, and the Tutoring Research Group (2000): Using Latent Semantic Analysis to Evaluate the Contributions of Students in AutoTutor. Interactive Learning Environments; V8, No2, p129-147.
    [2] Baeza-Yates, R. and Ribeiro-Neto, B. (1999). Modern Information Retrieval. New York , Addison-Wesley Longman, ACM press.
    [3] DaeHo Baek, HeuiSeok Lim, HaeChang Rim (2000). Latent Semantic Indexing Model for Boolean Query Formulation. ACM SIGIR’00; p310-312.
    [4] Darrell Laham, Winston Bennett, Jr., Thomas Landauer (2000). An LSA-Based Software Tool for Matching Jobs, People, and Instruction. Interactive Learning Environments; V8, No3, p171-185.
    [5]. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K. and Harshman, R. A. (1990). Indexing By Latent Semantic Analysis. Journal of the American Society for Information Science 41(6): 391-407.
    [6] Dian Irene Witter. Downdating the Latent Semantic Indexing Model for Information Retrieval. MS Thesis, The University of Tennessee. December 1997.
    [7]. Dumais, S. T., Furnas, G. W., Landauer, T. K., Deerwester, S. & Harshman, R. (1988) Using latent semantic analysis to improve access to textual information. Proceedings of the Conference on Human Factors in Computing Systems, CHI. 281-286.
    [8] Eileen Kintsch, Dave Steinhart, Gerry Stahl, Cindy Matthews, Ronald Lamb, and LSA Research Group (2000): Developing Summarization Skills through the Use of LSA-Based Feedback. Interactive Learning Environments; V8, No2, p87-109.
    [9] Gavin W. O'Brien. MS Thesis; The University of Tennessee. Information Management Tools for Updating an SVD-Encoded Indexing Scheme. December 1994.
    [10] G. Salton and C. Buckley.(1990) Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41,p.288-297.
    [11] G. Salton and C. Buckley.(1988) Term-weighting approaches in automatic retrieval. Information Processing & Management, 24(5), p.513-523.
    [12] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw Hill Book Co, New York, 1983.
    [13] H. Park and S. Van Huffel.(1995) Two-way bidiagonalization scheme for downdating the singular value decomposition. Linear Algebra Applications, 222:23-40.
    [14] Hsin-Ping Wu, LSI-based IR in Chinese Documents. Master Thesis, National Taiwan University, June 1997. (In Chinese)
    [15] Jared Freeman, Bryan Thompson and Marvin Cohen (2000). Modeling and Diagnosing Domain Knowledge Using Latent Semantic Indexing. Interactive Learning Environments; V8, No3, p187-209.
    [16] Jwo-Luen Huang, Passage Retrieval Using Latent Semantics Indexing, Master Thesis, National Taiwan University. June 1997. (In Chinese)
    [17] Landauer, T.K., Foltz, P.W., & Laham, D. (1998). An Introduction to Latent Semantic Analysis. Discourse Processes; 25 p.259-284.
    [18] M. Berry, S. Dumais, and G. O'Brien.(1995) Using Linear Algebra for Intelligent Information Retrieval. SIAM Review, 37(4):573-595.
    [19]. Michael W. Berry, Zlatko Drmac, Elizabeth R. Jessup (1999). Matrices, Vector Spaces, and Information Retrieval. Siam Review Society for Industrial and Applied Mathematics. Vol. 41, No. 2, pp. 335–362.
    [20] Ming Gu and Stanley C. Eisenstat. Downdating The Singular Value Decomposition. SIAM J. Matrix Analysis Application, 16(3):793-810, July 1995.
    [21] Nicholas J. Belkin and W. Bruce Croft. Information filtering and information retrieval: two sides of the same coin? COMMUNICATIONS OFTHE ACM; December 1992,Vol.35, No.12. p.30-38.
    [22] Peter W. Foltz, Sara Gilliam, and Scott Kendall (2000). Supporting Content-Based Feedback in On-line Writing Evaluation with LSA. Interactive Learning Environments; V8, No2, p111-127.
    [23] Peter Wiemer-Hastings and Arthur C. Graesser (2000). Select-a-Kibitzer: A Computer Tool that Gives Meaningful Feedback on Student Compositions. Interactive Learning Environments; V8, No2, p149-169.
    [24] Ruth V Small;Stuart Sutton;Makiko Miwa;Claire Urfels;Michael Eisenberg. (1998) Information seeking for instructional planning: An exploratory study. Journal of Research on Computing in Education; Washington; V31, No2, p204.
    [25] S.T. Dumais. (1991) Improving the Retrieval of Information from External Sources. Behavior Research Methods, Instructions & Computers, 23:229-236.
    [26] Shih-Hung Wu, Pey-Ching Yang, Von-Wun Soo. (1998) An Assessment of Character-based Chinese News Filtering Using Latent Semantic Indexing. Computational Linguistics and Chinese Language Processing, vol.3, no.2, pp.61-78.
    [27] Todd A. Letsche. Toward Large-Scale Information Retrieval Using Latent Semantic Indexing, MS Thesis; The University of Tennessee, August 1996.
    [28] Wolfe, M. B., Schreiner, M. E., Rehder, B., Laham, D., Foltz, P. W., Kintsch, W., & Landauer, T. K. (1998). Learning from text: Matching readers and text by Latent
    Semantic Analysis. Discourse Processes; 25 p.309-336.

    QR CODE