研究生: |
林真伊 Chen-Yi Lin |
---|---|
論文名稱: |
以樹首遞迴投影探勘常見子樹結構之研究 A Tree-Projected Pattern Growing Method for Mining Frequent Sub-Trees Efficiently |
指導教授: |
柯佳伶
Koh, Jia-Ling |
學位類別: |
碩士 Master |
系所名稱: |
資訊教育研究所 Graduate Institute of Information and Computer Education |
論文出版年: | 2004 |
畢業學年度: | 92 |
語文別: | 中文 |
論文頁數: | 95 |
中文關鍵詞: | 具項目標示有序樹 、常見子樹 、封閉常見子樹 |
英文關鍵詞: | rooted, labeled, and ordered trees, frequent sub-trees, closed sub-trees |
論文種類: | 學術論文 |
相關次數: | 點閱:183 下載:1 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文針對由具項目標示有序樹結構的資料庫中探勘常見子樹的問題,提出以樹首遞迴投影方式的方法,稱為Tree-Projection演算法,可在探勘過程避免產生不必要之候選子樹。此外,我們擴展Tree-Projection演算法,根據使用者所給定的初始最小支持度門檻值及K值,探勘出資料庫內前K個具最大支持度計數值的封閉常見子樹。依選取常見子樹為樹首做投影的探勘順序不同,及探勘過程中最小支持度門檻值提高與否,發展出DFS、DFS-TC、及SS-TC三種Tree-Projection演算法的變型演算法。由實驗結果顯示,在資料庫特定大小的範圍內,Tree-Projection演算法執行效率不易受最小支持度門檻值及資料產生參數值設定的影響,且效率皆優於之前已提出的常見子樹探勘演算法(TreeMiner演算法) 。此外,以支持度計數值由大而小選取常見子樹為樹首做投影,並結合在探勘過程中提高最小支持度門檻值的SS-TC法,在K遠小於所有封閉常見子樹的情況下,可大量減少資料庫投影次數,大幅提昇Tree-Projection演算法探勘前K個具最大支持度計數值的封閉常見子樹之執行效率。
A new approach, called Tree-Projection algorithm, is proposed in this thesis for mining embedded frequent sub-trees in a forest of rooted, labeled, and ordered trees. This approach greatly reduces the efforts of generating candidate sub-trees. Moreover, Tree-Projection algorithm is extended to solve an additional problem: mining top-K frequent closed sub-trees, where K is the desired number of closed sub-trees to be mined specified by users. According to the various orders of performing tree-projec-tions and whether the dynamic minimum-support raising strategy is applied, three va-rieties of Tree-Projection algorithm are designed: they are named DFS, DFS-TC, and SS-TC, respectively. The experimental results show that, within a limited database size, the increment of runtime for Tree-Projection algorithm keeps stable under the various minimum-supports setting at run-time and arguments setting of data sets. In most cases, Tree-Projection algorithm outperforms the existing TreeMiner algorithm in terms of the execution time. Furthermore, the SS-TC variety of Tree-Projection al-gorithm adopts support descending order to perform tree-projections and raises minimum-support dynamically. When K is small proportional to the number of all closed sub-trees, the SS-TC variety could discover the top-K frequent closed sub-trees by pruning the unnecessary tree-projections dramatically and provide excellent mining efficiency.
[1] Rakesh Agrawal, Tomasz Imielinski, and Arun Swami, “Mining Association Rules between Sets of Items in Large Databases,” In Proceedings of the ACM SIGMOD Conference on Management of Data, 1993.
[2] Rakesh Agrawal, and Ramakrishnan Srikant, “Fast Algorithms for Mining Asso-ciation Rules,” In Proceedings of the 20th International Conference on Very Large Databases, Santiago, Chile, pp. 487-499, 1994.
[3] Jiawei Han, Jian Pei, and Yiwen Yin, “Mining Frequent Patterns without Candi-date Generation,” In Proceedings of the 2000 ACM-SIGMOD International Con-ference on Management of Data (SIGMOD'00), Dallas, TX, pp. 1-12, 2000.
[4] Rakesh Agrawal, and Ramakrishnan Srikant, “Mining Sequential Patterns,” In Proceedings of the 11th International Conference on Data Engineering (ICDE’95), Taiwan, pp. 3-14, 1995.
[5] Mohammed J. Zaki, “SPADE: An Efficient Algorithm for Mining Frequent Se-quences,” Machine Learning Journal, special issue on Unsupervised Learning (Doug Fisher, ed.), Vol. 42 No. 1/2, pp 31-60, 2001.
[6] Jay Ayres, J. E. Gehrke, Tomi Yiu, and Jason Flannick, “Sequential PAttern Min-ing Using Bigmaps,” In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Can-ada, 2002.
[7] Jian Pei, Jiawei Han, Behzad Mortazavi-Asl, Helen Pinto, Qiming Chen, Umesh-war Dayal, and Mei-Chun Hsu, “PrefixSpan: Mining Sequential Patterns Effi-ciently by Prefix-Projected Pattern Growth,” In Proceedings of the 2001 Interna-tional Conference on Data Engineering (ICDE'01), Heidelberg, Germany, pp. 215-224, 2001.
[8] Mohammed J. Zaki, “Efficiently Mining Frequent Trees in a Forest,” In Proceed-ings of the eighth ACM SIGKDD international conference on Knowledge discov-ery and data mining, pp. 71-80, 2002.
[9] Tatsuya Asai, Kenji Abe, Shinji Kawasoe, Hiroki Arimura, Hiroshi Sakamoto, and Setsuo Arikawa, “Efficient Substructure Discovery from Large Semi-structured Data,” In Proceedings of the Second SIAM International Conference on Data Min-ing (SDM'02), pp.158-174, 2002.
[10] K. Wang, and H.Q. Liu, “Discovering Typical Structures of Documents: A Road Map Approach,” In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 146-154, 1998.
[11] J. Pei, J. Han, and R. Mao, “CLOSET: An Efficient Algorithm for Mining Fre-quent Closed Itemsets,” In Proceedings of the 2000 ACM-SIGMOD International Workshop on Data Mining and Knowledge Discovery (DMKD'00), Dallas, TX, 2000.
[12] X. Yan, J. Han, and R. Afshar, “CloSpan: Mining Closed Sequential Patterns in Large Datasets,” In Proceedings of the 2003 SIAM International Conference on Data Mining (SDM'03), San Fransisco, CA, May 2003.
[13] P. Tzvetkov, X. Yan, and J. Han, “TSP: Mining Top-K Closed Sequential Pat-terns,” In Proceedings of the 2003 International Conference on Data Mining (ICDM'03), Melbourne, FL, 2003.