簡易檢索 / 詳目顯示

研究生: 林真伊
Chen-Yi Lin
論文名稱: 以樹首遞迴投影探勘常見子樹結構之研究
A Tree-Projected Pattern Growing Method for Mining Frequent Sub-Trees Efficiently
指導教授: 柯佳伶
Koh, Jia-Ling
學位類別: 碩士
Master
系所名稱: 資訊教育研究所
Graduate Institute of Information and Computer Education
論文出版年: 2004
畢業學年度: 92
語文別: 中文
論文頁數: 95
中文關鍵詞: 具項目標示有序樹常見子樹封閉常見子樹
英文關鍵詞: rooted, labeled, and ordered trees, frequent sub-trees, closed sub-trees
論文種類: 學術論文
相關次數: 點閱:183下載:1
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文針對由具項目標示有序樹結構的資料庫中探勘常見子樹的問題,提出以樹首遞迴投影方式的方法,稱為Tree-Projection演算法,可在探勘過程避免產生不必要之候選子樹。此外,我們擴展Tree-Projection演算法,根據使用者所給定的初始最小支持度門檻值及K值,探勘出資料庫內前K個具最大支持度計數值的封閉常見子樹。依選取常見子樹為樹首做投影的探勘順序不同,及探勘過程中最小支持度門檻值提高與否,發展出DFS、DFS-TC、及SS-TC三種Tree-Projection演算法的變型演算法。由實驗結果顯示,在資料庫特定大小的範圍內,Tree-Projection演算法執行效率不易受最小支持度門檻值及資料產生參數值設定的影響,且效率皆優於之前已提出的常見子樹探勘演算法(TreeMiner演算法) 。此外,以支持度計數值由大而小選取常見子樹為樹首做投影,並結合在探勘過程中提高最小支持度門檻值的SS-TC法,在K遠小於所有封閉常見子樹的情況下,可大量減少資料庫投影次數,大幅提昇Tree-Projection演算法探勘前K個具最大支持度計數值的封閉常見子樹之執行效率。

    A new approach, called Tree-Projection algorithm, is proposed in this thesis for mining embedded frequent sub-trees in a forest of rooted, labeled, and ordered trees. This approach greatly reduces the efforts of generating candidate sub-trees. Moreover, Tree-Projection algorithm is extended to solve an additional problem: mining top-K frequent closed sub-trees, where K is the desired number of closed sub-trees to be mined specified by users. According to the various orders of performing tree-projec-tions and whether the dynamic minimum-support raising strategy is applied, three va-rieties of Tree-Projection algorithm are designed: they are named DFS, DFS-TC, and SS-TC, respectively. The experimental results show that, within a limited database size, the increment of runtime for Tree-Projection algorithm keeps stable under the various minimum-supports setting at run-time and arguments setting of data sets. In most cases, Tree-Projection algorithm outperforms the existing TreeMiner algorithm in terms of the execution time. Furthermore, the SS-TC variety of Tree-Projection al-gorithm adopts support descending order to perform tree-projections and raises minimum-support dynamically. When K is small proportional to the number of all closed sub-trees, the SS-TC variety could discover the top-K frequent closed sub-trees by pruning the unnecessary tree-projections dramatically and provide excellent mining efficiency.

    附表目錄…………………………………………………………………ii 附圖目錄………………………………………………………………...iv 第一章 緒論……………………………………………………………1 1.1 背景與研究動機…………………………………………………………...1 1.2 相關文獻探討……………………………………………………………...2 1.3 論文方法………………………………………………………………….10 1.4 論文架構………………………………………………………………….12 第二章 相關名詞與問題定義………………………………………..13 2.1 名詞定義………………………………………………………………….13 2.2 問題描述………………………………………………………………….20 第三章 樹首遞迴投影之探勘方法…………………………………..21 3.1 前序字串-層級表示法………………………………………………….21 3.2 樹首投影………………………………………………………………….24 3.3 以前序-層級表示法做樹首遞迴投影………………………………….27 3.4 砍除投影資料庫中非常見項目節點之規則…………………………….32 3.5 Tree-Projection演算法探勘流程………………………………………...35 第四章 前K個封閉常見子樹之探勘方法…………………………..51 4.1 判別子樹關係…………………………………………………………….51 4.2 前K個封閉常見子樹的探勘流程………………………………………57 第五章 演算法執行效率評估………………………………………..75 5.1 實驗環境說明…………………………………………………………….75 5.2 資料的產生方式………………………………………………………….75 5.3 實驗評估………………………………………………………………….77 第六章 結論及未來研究方向………………………………………..91 參考文獻………………………………………………………………..93 附表目錄 表3.1 範例樹T轉換成前序字串-層級表示法的結果……………………….…22 表3.2 以各項目節點為樹首對範例樹T做根節點投影的結果………………….28 表3.3 以樹首為(A)投影後第二層遞迴投影的結果……………………………...30 表3.4 以樹首為(A B)投影後第三層遞迴投影的結果…………………………...31 表3.5 資料庫DB經刪除非常見項目節點並轉為前序字串-層級表示法……..38 表3.6 DB以(A)為樹首的投影結果……………………………………………….39 表3.7 經砍除非常見項目後,DB以(A)為樹首的投影結果……………………...40 表3.8 DB以(A B)為樹首的投影結果…………………………………………….40 表3.9 經砍除非常見項目後,DB以(A B)為樹首的投影結果…………………..41 表3.10 DB以(A B C)為樹首的投影結果…………………………………………41 表3.11 DB以(A B C –1 –1 D)為樹首的投影結果………………………………..42 表3.12 DB經砍除非常見項目後,以(A B C –1 –1 D)為樹首的投影結果……..43 表3.13 DB以(A B C –1 –1 D F)為樹首的投影結果……………………………...43 表5.1 實驗資料庫的參數說明……………………………………………………76 表5.2 最小支持度門檻值對Tree-Projection及TreeMiner執行時間的比較….....78 表5.3 資料筆數對Tree-Projection及TreeMiner執行時間的比較………….……80 表5.4 最大分支數對Tree-Projection及TreeMiner執行時間的比較………….....82 表5.5 最大深度對Tree-Projection及TreeMiner執行時間的比較………….……83 表5.6 節點項目的種類個數對Tree-Projection及TreeMiner執行時間的比較.…84 表5.7 K值對DFS、DFS-TC及SS-TC法執行時間的比較………….……………86 表5.8 K值對DFS、DFS-TC及SS-TC法投影次數的比較………….……………86 表5.9 資料筆數對DFS、DFS-TC及SS-TC法執行時間的比較…………………88 表5.10 資料筆數對DFS、DFS-TC及SS-TC法投影次數的比較………………..88 表5.11 最小支持度門檻值對DFS、DFS-TC及SS-TC法執行時間的比較…..….90 表5.12 最小支持度門檻值對DFS、DFS-TC及SS-TC法投影次數的比較.…....90 附圖目錄 圖2.1 子樹定義範例………………………………………………………...…….15 圖2.2 支持度計數值計算範例……………………………………………...…….17 圖3.1 範例樹T…………………………………………………………………….22 圖3.2 樹首投影結果與投影基礎節點的關聯範例………………………………26 圖3.3  Xn’與Y的關係圖………………………………………………………….29 圖3.4 樹首為(A B)時的3-子樹………………………………………………...….31 圖3.5 樹首為(A B –1 C)時的4-子樹……………………………………………...31 圖3.6 刪除節點造成錯誤子樹的範例…………………………………………....33 圖3.7 Tree-Projection演算法的完整探勘流程圖………………………………..36 圖3.8 原始資料庫DB與其編碼方式…………………………………….……….37 圖3.9 DB中以(A)為樹首的常見2-子樹…………………………………...……..39 圖3.10 DB中以(A B)為樹首的常見3-子樹………………………………...…….41 圖3.11 DB中以(A B C)為樹首的常見4-子樹……………………………………42 圖3.12 DB以(A B C –1 –1 D)為樹首的常見5-子樹………………………….….42 圖3.13 DB以(A B –1 D)為樹首的投影結果……………………………………..44 圖3.14 DB以(A B –1 D F)為樹首的投影結果…………………………….……..45 圖3.15 DB以(A B –1 F)為樹首的投影結果……………………………………..45 圖3.16 DB以(A C)為樹首的投影結果……………………………………..…….46 圖3.17 DB以(A C –1 D)為樹首的投影結果……………………………………..47 圖3.18 DB以(A C –1 D F)為樹首的投影結果………………….………………..47 圖3.19 DB以(A C –1 F)為樹首的投影結果………………………………..…….48 圖3.20 DB以(A D)為樹首的投影結果…………………………………………...48 圖3.21 DB以(A D F)為樹首的投影結果……………………………………...….49 圖3.22 DB以(A F)為樹首的投影結果………………………………………..…..49 圖4.1 比對堆疊內所存放的資訊…………………………………………………53 圖4.2 子樹的判別範例…………………………………………………………....53 圖4.3 S1與T的子樹判別過程之堆疊內容……….………………………..…….54 圖4.4 S2與T的子樹判別過程之堆疊內容…………………………………..…..55 圖4.5 S3與T的子樹判別過程之堆疊內容………………………………………56 圖4.6 範例資料庫DB………………………………………………………….….58 圖4.7 完整探勘的常見子樹………………………………………………..……..58 圖4.8 深先搜尋列舉法的探勘流程………………………………………..……..64 圖4.9 前6個支持度計數值最大的封閉常見子樹……………………………….64 圖4.10 深先搜尋並提高最小支持度門檻值法的探勘流程……………………..69 圖4.11 排序搜尋並提高最小支持度門檻值法的探勘流程……………………..73 圖5.1 資料產生範例………………………………………………………………76 圖5.2 最小支持度門檻值對Tree-Projection及TreeMiner執行時間的影響…….78 圖5.3 資料筆數對Tree-Projection及TreeMiner執行時間的影響…………….....80 圖5.4 最大分支數的調整對Tree-Projection及TreeMiner執行時間的影響…….82 圖5.5 最大深度的調整對Tree-Projection及TreeMiner執行時間的影響…….....83 圖5.6 節點項目的種類個數之調整對Tree-Projection及TreeMiner執行時間的影響……………………………………………………………………………84 圖5.7 K值對DFS、DFS-TC及SS-TC法執行時間的影響……………………….86 圖5.8 資料筆數對DFS、DFS-TC及SS-TC法執行時間的影響…………………88 圖5.9 最小支持度門檻值對DFS、DFS-TC及SS-TC法執行時間的影響………90

    [1] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami, “Mining Association Rules between Sets of Items in Large Databases,” In Proceedings of the ACM SIGMOD Conference on Management of Data, 1993.
    [2] Rakesh Agrawal, and Ramakrishnan Srikant, “Fast Algorithms for Mining Asso-ciation Rules,” In Proceedings of the 20th International Conference on Very Large Databases, Santiago, Chile, pp. 487-499, 1994.
    [3] Jiawei Han, Jian Pei, and Yiwen Yin, “Mining Frequent Patterns without Candi-date Generation,” In Proceedings of the 2000 ACM-SIGMOD International Con-ference on Management of Data (SIGMOD'00), Dallas, TX, pp. 1-12, 2000.
    [4] Rakesh Agrawal, and Ramakrishnan Srikant, “Mining Sequential Patterns,” In Proceedings of the 11th International Conference on Data Engineering (ICDE’95), Taiwan, pp. 3-14, 1995.
    [5] Mohammed J. Zaki, “SPADE: An Efficient Algorithm for Mining Frequent Se-quences,” Machine Learning Journal, special issue on Unsupervised Learning (Doug Fisher, ed.), Vol. 42 No. 1/2, pp 31-60, 2001.
    [6] Jay Ayres, J. E. Gehrke, Tomi Yiu, and Jason Flannick, “Sequential PAttern Min-ing Using Bigmaps,” In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Can-ada, 2002.
    [7] Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umesh-war Dayal, and Mei-Chun Hsu, “PrefixSpan: Mining Sequential Patterns Effi-ciently by Prefix-Projected Pattern Growth,” In Proceedings of the 2001 Interna-tional Conference on Data Engineering (ICDE'01), Heidelberg, Germany, pp. 215-224, 2001.
    [8] Mohammed J. Zaki, “Efficiently Mining Frequent Trees in a Forest,” In Proceed-ings of the eighth ACM SIGKDD international conference on Knowledge discov-ery and data mining, pp. 71-80, 2002.
    [9] Tatsuya Asai, Kenji Abe, Shinji Kawasoe, Hiroki Arimura, Hiroshi Sakamoto, and Setsuo Arikawa, “Efficient Substructure Discovery from Large Semi-structured Data,” In Proceedings of the Second SIAM International Conference on Data Min-ing (SDM'02), pp.158-174, 2002.
    [10] K. Wang, and H.Q. Liu, “Discovering Typical Structures of Documents: A Road Map Approach,” In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 146-154, 1998.
    [11] J. Pei, J. Han, and R. Mao, “CLOSET: An Efficient Algorithm for Mining Fre-quent Closed Itemsets,” In Proceedings of the 2000 ACM-SIGMOD International Workshop on Data Mining and Knowledge Discovery (DMKD'00), Dallas, TX, 2000.
    [12] X. Yan, J. Han, and R. Afshar, “CloSpan: Mining Closed Sequential Patterns in Large Datasets,” In Proceedings of the 2003 SIAM International Conference on Data Mining (SDM'03), San Fransisco, CA, May 2003.
    [13] P. Tzvetkov, X. Yan, and J. Han, “TSP: Mining Top-K Closed Sequential Pat-terns,” In Proceedings of the 2003 International Conference on Data Mining (ICDM'03), Melbourne, FL, 2003.

    QR CODE