簡易檢索 / 詳目顯示

研究生: 彭紹峻
Peng, Shao-Chun
論文名稱: 以MapReduce分散式架構有效率進行相似資料配對搜尋之研究
MapReduce Approach for Efficient Computation of All-Pairs Similarity Search
指導教授: 柯佳伶
Koh, Jia-Ling
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2015
畢業學年度: 103
語文別: 中文
論文頁數: 67
中文關鍵詞: 相似資料配對問題分群篩除方法MapReduce演算法
英文關鍵詞: All Pair Similarity Search(APSS), partition based data filtering, MapReduce algorithm
論文種類: 學術論文
相關次數: 點閱:54下載:7
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 為了有效率計算相似資料配對搜尋問題,本論文提供一個有效篩除不相似配對的方法,並設計成對應的MapReduce平行處理架構。我們提出以最大值出現維度對資料預先分群,並設計改變分群數的方式及平衡各群間資料個數的分群調整方法。根據分群結果,將相似配對計算工作分成群內相似配對以及跨群相似配對;群內相似配對運用反向串列索引方式快速計算,在跨群配對方面,本研究先以計算群代表向量的方式對進行群配對的篩除,計算可能產生相似配對的群配對,再對群配對中資料進行候選配對的組合。此外本研究也運用MapReduce平行處理架構,以平行處理上述各步驟所需執行工作。實驗結果顯示此方法適合採用MapReduce平行處理架構,可比集中式處理有效減少相似資料配對問題的回覆時間。

    For solving the All Pair Similarity Search (APSS) problem efficiently, this thesis aims to provides an effective approach to filter non-similar pairs. Besides, the MapReduce version of the approach is provided to solve the problem in parallel. At first, for each data point, the dimension with the maximum value is used to decide the corresponding group of data partition. An adjusting method is designed to further balance the number of elements in each data group. The similar pairs could be inter-group similar pairs or intra-group similar pairs. For finding the intra-group similar pairs, we apply the inverted list index to improve the efficiency of computing the similarity of data pairs in each group. In addition, for speeding up the computation of finding inter-group similar pairs, the leader-vector is used to represent each group for estimating the upper bound of similarity between each group pair. Only the pairs of groups, whose upper bounds of similarity are larger than the given similarity threshold, need to exactly compute similarity of the inter-group data pairs. Finally, we design the corresponding MapReduce algorithm to perform the proposed approach in parallel. The experimental results show that the proposed partition-based method can fit into the MapReduce programing scheme properly. Moreover, the proposed MapReduce algorithm can effectively reduce the response time of solving the APSS problem than the centralized approach.

    摘要 I Abstract II 誌謝 III 目錄 IV 圖目錄 VI 表目錄 VII 第一章 緒論 1 1.1 研究動機 1 1.2 研究目的 2 1.3論文做法 4 1.4論文架構 5 第二章 文獻探討 6 2.1 運用篩減策略之技術研究 6 2.2 對相似資料進行分群之研究技術 11 2.3 以MapReduce平行架構處理資料相似配對問題 18 第三章 問題定義與方法 21 3.1 問題定義 21 3.2 相似資料配對搜尋處理步驟 21 3.3 系統架構 22 3.4系統方法 24 第四章 結合MapReduce之相似資料配對 37 4.1 MapReduce架構 37 4.2 相似資料配對問題結合MapReduce架構之探討 37 4.3 群內相似配對計算步驟 38 4.4 群候選配對選取步驟 43 4.5 跨群候選配對相似度計算步驟 44 第五章 實驗 48 5.1 實驗及資料集實驗環境 48 5.2 實驗比較方法介紹 49 5.3 實驗與討論 50 5.4實驗結果總結 63 第六章 結論及未來研究方向 65 6.1 結論 65 6.2未來方向 65 參考文獻 66

    [1] M. Alabduljalil, X. Tang and T.Yang. Optimizing Parallel Algorithms for All Pairs Similarity Search. WSDM, 2013.
    [2] D. C. Anastasiu and G. Karypis. L2AP: Fast Cosine Similarity Search With Prefix L-2 Norm Bounds. ICDE, 2014.
    [3] A. Arasu, V. Ganti and R. Kaushik. Efficient Exact Set-Similarity Joins. VLDB, 2006.
    [4] A. Awekar, N. F. Samatova1 and P. Breimyer. Incremental All Pairs Similarity Search for Varying Similarity Thresholds with Reduced I/O Overhead. 3rd SNA-KDD workshop, 2009.
    [5] R. J. Bayardo, Y. Ma and R. Srikant. Scaling Up All Pairs Similarity Search. WWW 2007.
    [6] S. Chaudhuri, V. Ganti and R. Kaushik. A Primitive Operator for Similarity Joins in Data Cleaning. ICDE, 2006.
    [7] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI, 2004.
    [8] G. D. Francisci, C. Lucchese and R. Baraglia. Scaling Out All Pairs Similarity Search with MapReduce. LSDS-IR, 2010.
    [9] A. Gionis and P. Indyky. Similarity Search in High Dimensions via Hashing. 25th VLDB Conference, 1999.
    [10] J. Lin. Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce. SIGIR, 2009.
    [11] A. Metwally and C. Faloutsos. V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors. VLDB Endowment, 2012.
    [12] L. A. Ribeiro and T. H¨arder. Efficient Set Similarity Joins Using Min-prefixes. ADBIS, 2009.
    [13] V. Satuluri. Bayesian Locality Sensitive Hashing for Fast Similarity Search. VLDB, 2012.
    [14] X. Tang, M. Alabduljalil, X. Jin, Tao Yang. Load Balancing for Partition-based Similarity Search. SIGIR, 2014
    [15] Y.Wang, A. Metwally and S. Parthasarathy. Scalable All-Pairs Similarity Search in Metric Spaces. KDD, 2013.
    [16] J. Wang, G. Li and Jianhua Feng Department. Can We Beat the Prefix Filtering An Adaptive Framework for Similarity Join and Search. SIGMOD, 2012.

    下載圖示
    QR CODE