簡易檢索 / 詳目顯示

研究生: 陳俊諭
論文名稱: 融入文件關聯與查詢清晰度資訊於虛擬關聯回饋之研究
A Study on Integrating Document Relatedness and Query Clarity Information for Improved Pseudo-Relevance Feedback
指導教授: 陳柏琳
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2014
畢業學年度: 102
語文別: 中文
論文頁數: 70
中文關鍵詞: 虛擬關聯回饋虛擬關聯文件選取馬可夫隨機漫步查詢清晰度查詢模型
英文關鍵詞: pseudo-relevance feedback, pseudo-relevant document selection, Markov random walk, query clarity, query model
論文種類: 學術論文
相關次數: 點閱:114下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 虛擬關連回饋技術能透過虛擬關聯文件選取進行有效虛擬關聯文件以查詢重組,並用於資訊檢索系統中。大部分的資訊檢索系統是簡單的基於初步檢索結果所得到的查詢與文件之關聯分數來挑選用於查詢重組之虛擬關聯文件。故本論文藉由同時考慮文件間之關聯以及查詢與文件間之關聯來進行虛擬關聯文件之選取,而馬可夫隨機漫步(Markov Random Walk)概念之利用,能讓我們對前面所述的關係加以估測,並找到更佳之虛擬關聯文件。在關聯文件選取完成後,基於使用在資訊檢索的查詢模型上,我們亦探討如何有效的將原始查模型與利用虛擬關聯文件資訊之新查詢模型加以結合,而結合之權重則是以所謂的查詢清晰度決定。本論文中之實驗驗證主要進行於Topic Detection and Tracking collection (TDT-2)、Topic Detection and Tracking collection (TDT-3)以及Wall Street Journal (WSJ)語料庫上,而實驗結果顯示本論文所提出之虛擬關聯回饋之各類改進方法能夠提升資訊檢索之效能。

    Pseudo-relevant document selection figures prominently in query reformulation with pseudo-relevance feedback (PRF) for an information retrieval (IR) system. Most of conventional IR systems select pseudo-relevant documents for query reformulation simply based on the query-document relevance scores returned by the initial round of retrieval. In this thesis, we propose a novel method for pseudo-relevant document selection that considers not only the query-document relevance scores but also the relatedness cues among documents. To this end, we adopt and formalize the notion of Markov random walk (MRW) to glean the relatedness cues among documents, which in turn can be used in concert with the query-document relevance scores to select representative documents for PRF. Furthermore, on top of the language modeling (LM) framework for IR, we also investigate how to effectively combine the original query model and new query model estimated from the selected pseudo-relevant documents in a more effective manner by virtue of the so-called query clarity measure. A series of experiments conducted on both the TDT (Topic Detection and Tracking) collection and the WSJ (Wall Street Journal) collection seem to demonstrate the performance merits of our proposed methods.

    1. 緒論 1 1.1. 研究動機 2 1.2. 相關研究 3 1.3. 論文貢獻 6 1.4. 論文章節安排 7 2. 基於語言模型之資訊檢索架構 8 2.1. 單連詞語言模型(Unigram Language Model) 8 2.2. 庫爾貝克-萊伯勒差異量測量法(Kullback-Leibler Divergence) 8 3. 查詢模型簡介 10 3.1. 關聯模型(Relevant Model, RM) 10 3.2. 簡單混和模型(Simple Mixture Model, SMM) 11 3.3. 查詢調整混和模型(Query-Regularized Mixture Model, RMM) 13 3.4. 主題關連模型(Topic Relevant Model, TRM) 15 4. 虛擬關聯文件選取方法簡介 17 4.1. 主動式關聯性、多樣性、密度學習法(Active-RDD) 17 4.2. 進階主動式關聯性、非關聯性、多樣性及密度學習法(Advanced Active-RDD) 19 4.3. 可重疊分群之再取樣法 20 5. 融入文件關聯之虛擬關聯文件選取 21 5.1. 文件關聯性之探討 21 5.2. 以馬可夫隨機漫步估測文件間之關聯 22 6. 實驗設計與結果 25 6.1. 實驗語料庫簡介 25 6.2. 基礎查詢模型實驗結果 26 6.3. 虛擬關聯文件選取方法實驗結果 32 6.4. 新穎查詢模型與額外資訊使用方法之探討 52 我們在本論文中探討了許多虛擬關聯回饋中之方法,而其中有許多部分是有改進或討論的空間的。在本章節中我們也將針對除了虛擬關聯文件選取以外的步驟之改良進行探討,並以實驗來檢視其效果。 52 6.4.1. SMM之改進與實驗結果 52 6.4.2. RMM之改進與實驗結果 54 6.4.3. 考慮查詢清晰度之查詢模型權重調整 57 7. 結論與未來展望 60

    [1] J. Lafferty and C. Zhai, “Document Language Models, Query Models, and Risk Minimization for Information Retrieval,” in ACM Special Interest Group on Information Retrieval, pp. 111-119, 2001.
    [2] R. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley Professional, 2011.
    [3] B. Chen, H.-M. Wang and L.-S. Lee, ”A Discriminative HMM/N-gram-Based Retrieval Approach for Mandarin Spoken Documents,” ACM Transactions on Asian Language Information processing, Vol. 3 ,No. 2, pp. 128–145, 2004.
    [4] J. Ponte and W. B. Croft, “A Language Modeling Approach to Information Retrieval,” in ACM Special Interest Group on Information Retrieval, pp. 275–281, 1998.
    [5] D. Miller, T. Leek and R. Schwartz, ”A Hidden Markov Model Information Retrieval System,” in ACM Special Interest Group on Information Retrieval, pp. 214–221, 1999.
    [6] T. Chia, K. Sim, H. Li and H. Ng,”Statistical lattice-based spoken document retrieval,” in ACM Transactions on Information Systems, Vol. 28, No. 1, pp. 2:1–2:30, 2010.
    [7] C. Manning, P. Raghavan and H. Schütze, Introduction to Information Retrieval, Cambrige Unicersity Press, 2008.
    [8] S. Robertson, S. Walker, M. Beaulieu and M. Gatford, “Okapi at TREC-4,” in the 4th Text Retrieval Conference, pp. 182–191, 1996.
    [9] P. Zhang, D. Song, X. Zhao and Y. Hou, “A Study of Document Weight Smoothness in Pseudo Relevance Feedback,” Alliance of Information and Referral Systems, pp. 527–538, 2010.
    [10] M. Bendersky, W. Croft, “Discovering Key Concept in Verbose Queries,” in ACM Special Interest Group on Information Retrieval, pp.491–498, 2008.
    [11] S. Cronen-Townsend, Y. Zhou and W. Croft, “Predicting Query Performance,” in ACM Special Interest Group on Information Retrieval, pp. 299–302, 2002.
    [12] S. Cronen-Townsend, and W. Croft, “Quantifying Query Ambiguity,” in Proceedings of the second international conference on Human Language Technology Research, pp. 104–109, 2002.
    [13] M. Rorvig, “A New Method of Measurement for Question Difficulty,” in Proceedings of American Society for Information Science, Knowledge Innovations, Vol. 37, pp. 372–378, 2000.
    [14] Y. Lv and C. Zhai, “Query Likelihood with Negative Query Generation,” in ACM international conference on Information and knowledge management, pp. 1799–1803, 2012.
    [15] L.-K. Wang, Z.-W. Li, R. Cai, Y.-X. Zhang, Y.-Z Zhou, L. Yang and L. Zhang, “Query by Document via a Decomposition-based Two-level Retrieval Approach,” in ACM Special Interest Group on Information Retrieval, pp. 505–514, 2011.
    [16] Y. Yang and N. Bansal, “Query by Document,” in ACM International Conference on Web Search and Data Mining, pp. 34–43, 2009.
    [17] M. Najork, “Comparing the Effectiveness of HITS and SALSA,” in ACM International Conference on Information and Knowledge Management, pp. 157–164, 2007.
    [18] Y. Hu, Y. Qian, H. Li, D. Jiang, Jian Pei and Q. Zheng, “Mining Query Subtopics from Search Log Data,” in ACM Special Interest Group on Information Retrieval, pp. 305–314, 2012.
    [19] R. Krestel and P. Fankhauser, “Reranking Web Search Results for Diversity,” Information Retrieval, Vol. 15, No. 5, pp. 458–477, 2012.
    [20] S. Brin,and L. Page, “The Anatomy of a Large-Scale Hypertext Web Search Engine,” in Proceedings International World Wide Web Conference, Vol. 30, Iss. 1–7, pp. 107–117, 1998.
    [21] T. Joachims, “Optimizing Search Engines Using Clickthrough Data,” in ACM Special Interest Group on Knowledge Discovery and Data Mining, pp.133–142, 2002.
    [22] B. Chen, H.-M. Wang and L.-S. Lee, “An HMM/N-gram-Based Linguistic Processing Approach,” in Europe Conference Speech Communication Technology, Vol. 2, pp. 1045–1048. 2001.
    [23] M. Bendersky and W. Croft, “Modeling Higher-Order Term Dependencies in Information Retrieval Using Query Hypergraphs,” in ACM Special Interest Group on Information Retrieval, pp. 941–950, 2012.
    [24] R. Krovetz, “Viewing morphology as an inference process,” in ACM Special Interest Group on Information Retrieval, pp. 191–201, 1993.
    [25] X. Shen and C. Zhai, “Active Feedback in Ad Hoc Information Retrieval,” in ACM Special Interest Group on Information Retrieval, pp. 55–66, 2005.
    [26] .J Carbonell and J. Goldstein, “The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries,” in ACM Special Interest Group on Information Retrieval, pp. 335–336, 1998.
    [27] T. Sakai, T. Manabe and M. Koyama, “Flexible Pseudo-Relevance Feedback via Selective Sampling,” ACM Transactions on Asian Language Information Processing, Vol. 4, No. 2, pp. 111–135, 2005.
    [28] K. Lee, W. Croft and J. Allan, “A Cluster-Based Resampling Method for Pseudo-Relevance Feedback,” in ACM Special Interest Group on Information Retrieval, pp. 234–242, 2008.
    [29] K. Lee and W. Croft, “A Deterministic Resampling Method using Overlapping Document Clusters for Pseudo-Relevance Feedback,” Information Processing and Management, Vol. 40, No. 4, pp. 792–806, 2013.
    [30] Z. Xu, R. Akella and Y. Zhang, “Incorporating Diversity and Density in Active Learning for Relevance Feedback,” in Advances in Information Retrieval, pp. 246–257, 2007.
    [31] Y.-W. Chen, K.-Y. Chen, H.-M. Wang and B. Chen, "Effective Pseudo-Relevance Feedback for Spoken Document Retrieval," in 38th IEEE International Conference on Acoustics, Speech, and Signal Processing, 2013.
    [32] B. Efron, “Bootstrap Methods: A Other Look at Jackknife,” in The Annals of Statistics, Vol. 7 No. 1, pp. 1–26.
    [33] K. Collins-Thompson and J. Callan, “Estimation and Use of Uncertainty in Pseudo-Relevance Feedback,” in ACM Special Interest Group on Information Retrieval, pp. 303–310, 2007.
    [34] R. Nallapati, W. Croft and J. Allan, “Relevant Query Feedback in Statistical Language Modeling,” in ACM International Conference on Information and Knowledge Management, pp. 560–563, 2003.
    [35] V. Oliveira, G. Gomes, F. Belém, W. Brandão, J. Almeida, N. Ziviani and M. Conçalves, “Automatic Query Expansion Based on Tag Recommendation,” in ACM International Conference on Information and Knowledge Management, pp. 1985–1989, 2012.
    [36] K. Hasegawa, M. Takehara, S. Tamura and S. Hayamizu, “Spoken Document Retrieval Using Extended Query Model and Web Documents,” in National Institute of Informatics Testbeds and Community for Information access Research, 2013.
    [37] Y. Lv and C. Zhai, “A Comparative Study of Methods for Estimating Query Language Models with Pseudo Feedback,” in ACM International Conference on Information and Knowledge Management, pp. 1895–1898, 2009.
    [38] V. Lavrenko and W. Croft, “Relevance-Based Language Models,” in ACM Special Interest Group on Information Retrieval, pp. 120–127, 2001.
    [39] C. Zhai and J. Lafferty, “Model-based Feedback in the Language Modeling Approach to Information Retrieval,” in International Conference on Information and Knowledge Management, pp. 403–410, 2001.
    [40] T. Tao and C. Zhai, “Regularized Estimation of Mixture Models for Robust Pseudo-Relevance Feedback,” in ACM Special Interest Group on Information Retrieval, pp. 162–469, 2006.
    [41] J. Xu and W. Croft, “Query Expansion Using Local and Global Document Analysis, ” in ACM Special Interest Group on Information Retrieval, pp. 4–11, 1996.
    [42] J. Allan, “Relevance Feedback with Too Much Data,” in ACM Special Interest Group on Information Retrieval, pp. 337–343, 1995.
    [43] A. Lam-Adesinaa and G. Jones, “Applying Summarization Techniques for Term Selection in Relevance Feedback,” in ACM Special Interest Group on Information Retrieval, pp. 1–9, 2001.
    [44] S.-H. Liu, K.-Y. Chen, H.-M. Wang, W.-L. Hsu and B. Chen, “Improved Sentence Modeling Techniques for Extractive Speech Summarization,” in Conference on Computational Linguistics and Speech Processing, pp. 5–21, 2013.
    [45] A. Tombros and M. Sanderson, “Advantages of Query Biased Summaries in Information Retrieval,” in ACM Special Interest Group on Information Retrieval, pp. 2–10, 1998.
    [46] M. Eflon, P. Organisciak and K. Fenlon,“Improving Retrieval of Short Texts Through Document Expansion,” in ACM Special Interest Group on Information Retrieval, pp. 911–920, 2012.
    [47] D. Yeung, C .Clarke, G. Cormack, T. Lynam and E. Terra, “Task-Specific Query Expansion,” in the 12th Text Retrieval Conference, pp. 810–819, 2004.
    [48] P. Mahdabi, S. Gerani, J. Huang and F. Crestani, “Leveraging Conceptual Lexicon: Query Disambiguation using Proximity Information for Patent Retrieval,” in ACM Special Interest Group on Information Retrieval, pp. 113–122, 2013.
    [49] Y. Lv and C. Zhai, “Positional Language Models for Information Retrieval,” in ACM Special Interest Group on Information Retrieval, pp. 299–306, 2009.
    [50] Y. Lv and C. Zhai, “Positional Language Models for pseudo-relevance feedback,” in ACM Special Interest Group on Information Retrieval, pp. 579–586, 2010.
    [51] J. Miao, J. Huang and Z. Ye, “Proximity-Based Rocchio’s Model for Pseudo Relevance Feedback,” in ACM Special Interest Group on Information Retrieval, pp. 535–544, 2012.
    [52] B. Chen, K.-Y. Chen, P.-N Chen and Y.-W. Chen, “Spoken Document Retrieval With Unsupervised Query Modeling Techniques,” in IEEE Transactions on Audio, Speech and Language Processing, Vol.20, No. 2, pp. 2602–2612, 2012.
    [53] C. van Rijsbergen, “Information Retrieval: Theory and Practice,” Proceedings of the joint IBM/University of Newcastle upon Tyne Seminar on Data Base Systems, pp. 1–14, 1979.
    [54] K. Maxwell and W. Croft, “Compact Query Term Selection Using Topically Related Text,” in ACM Special Interest Group on Information Retrieval, pp. 583–592, 2013.
    [55] K. Collins-Thompson and J. Callan, “Query Expansion Using Random Walk Models,” in ACM International Conference on Information and Knowledge Management, pp. 704–711, 2005.
    [56] K. Toutanova and C. Manning and A. Ng, “Learning Random Walk Models for
    Inducing Word Dependency Distributions,” in the 21st International Conference on Machine Learning, pp. 103–113, 2004.
    [57] L. Maisonnasse, F. Harrathi, C. Roussey and S. Calabretto, “Analysis Combination and Pseudo Relevance Feedback Conceptual Language Model,” in Lecture Notes in Computer Science, Vol. 6242, pp 203–210, 2010.
    [58] E. Meij, D. Trieschnigg, M. Rijke and W. Kraaij, “Conceptual Language Models for Domain-Specific Retrieval,” in Information Processing and Management, Vol. 46, pp. 448–469.
    [59] Y. Lv, C. Zhai and W. Chen, “A Boosting Approach to Improving Pseudo-Relevance Feedback,” in ACM Special Interest Group on Information Retrieval, pp. 165–174, 2011.
    [60] M. Karimzadehgan and C. Zhai, “Exploration-Exploitation Tradeoff in Interactive Relevance Feedback,” in ACM International Conference on Information and Knowledge Management, pp. 1397–1400, 2010.
    [61] H. Wu and H. Fang, “An Incremental Approach to Efficient Pseudo-Relevance Feedback,” in ACM Special Interest Group on Information Retrieval, pp. 553–562, 2013.
    [62] M. Cartright, J. Allan, V. Lavrenko and A. McGregor, “Fast Query Expansion Using Approximations of Relevance Models,” in ACM International Conference on Information and Knowledge Management, pp. 1573–1576, 2010.
    [63] C. Zhai, Statistical Language Models for Information Retrieval, Morgan and Claypool, 2008.
    [64] S. Kullback and R. Leibler, “On Information and Sufficiency,” The Annals of Mathematical Statistics, Vol. 22, No. 1, pp. 79–86, 1951.
    [65] P. Dempster, N. Laird and D. Rubin, “Maximum Likelihood from Incomplete Data via EM algorithm,” Journal of Royal Statist, Serial B, Vol. 39, No. 1, pp.1–38, 1977.
    [66] Hofmann, “Probabilistic Latent Semantic Analysis,” in Uncertainty in Artificial Intelligence, pp. 289–296, 1999.
    [67] T. Hoffmann, “Unsupervised learning by Probabilistic Latent Semantic Analysis”, Machine Learning, Vol. 42, No. 1–2, pp. 177–196, 2001.
    [68] D. Blei, A. Ng and M. Jordan, “Latent Dirichlet Allocation,” Journal of Machine Learning Research, Vol. 3, pp. 993–1022, 2003.
    [69] G. Cao, J. Nie, J. Gao and S. Robertson, “Selecting Good Expansion Terms for Pseudo-Relevance Feedback,” in ACM Special Interest Group on Information Retrieval, pp. 243–250, 2008.
    [70] E. Alpaydin, Introduction to Machine Learning, MIT Press, 2004.
    [71] C. Cieri, S. Strassel, D. Graff, N. Martey, K. Rennert and M. Liberman, “Corpora for Topic Detection and Tracking,” in Topic Detection and Tracking – Event-based Information Organization, Chapter 3, pp. 33–66, 2002.
    [72] M. Porter, “An Algorithm for Suffix Stripping,” Electronic Library and Information Systems, Vol. 14, No. 3, pp. 130–137, 1980.
    [73] J. Garofolo, C. Auzanne and E. Voorhees, “The TREC Spoken Document Retrieval Track: A Success Story,” in Proceeding 8th Text Retrieval Conference, pp. 107–129, 2000.
    [74] P. Zhang, D. Song, J. Wang and Y. Hou, “Bias–Variance Analysis in Estimating True Query Model for Information Retrieval,” in Information Processing and Management, Vol. 50, No. 1, pp. 199–217, 2014.
    [75] S. Hummel, A. Shtok, F. Raiber and O. Kurland, “Clarity Re-Visited,” in ACM Special Interest Group on Information Retrieval, pp. 1039–1040, 2012.
    [76] M. Karimzadehgan and C. Zhai, “A Learning Approach to Optimizing Exploration-Exploitation Tradeoff in Relevance Feedback,” Information Retrieval, Vol. 16, No. 3, pp. 307–330, 2013.

    下載圖示
    QR CODE