研究生: |
劉育君 Liu Yu-Jiun |
---|---|
論文名稱: |
以資料樣式分群進行變動資料流分類探勘之研究 Using Clustering Method to Construct A Classifier on Concept Drifting Streams |
指導教授: | 柯佳伶 |
學位類別: |
碩士 Master |
系所名稱: |
資訊教育研究所 Graduate Institute of Information and Computer Education |
論文出版年: | 2007 |
畢業學年度: | 95 |
語文別: | 中文 |
論文頁數: | 51 |
中文關鍵詞: | 分類 、分群 、變動資料流 、概念改變 |
論文種類: | 學術論文 |
相關次數: | 點閱:236 下載:4 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
不同於靜態資料庫中保存的歷史資料,資料流上的資料會以高速、連續、沒有限量的方式進入,資料概念與分布還可能隨著時間而改變。所以在資料流環境上進行分類工作必須兼顧速度與準確度,並能夠適時地偵測到概念改變發生,調整原有的分類模型,以符合最近的資料概念和分布趨勢。因此本論文先利用最近鄰居演算法對訓練資料做分群,再從每個群中挑選一個代表樣式來產生分類規則,以降低分類模型的大小。此外,由於近期資料通常比較符合變化的趨勢,所以在分類的過程中,我們使用一個移動視窗來觀察視窗內分類準確度的變化,並記錄在移動視窗中無法被合適預測的資料點,當分類準確度低於設定的門檻值時,則利用被記錄的資料點調整原分類模型。本論文可透過分群時所設定的距離門檻值來調整分類規則數,使分類模型所佔之空間彈性化。並能有效偵測資料概念改變,且動態調整分類模型,以維持預測新進資料類別的準確度。
Differ from the static database for storing history data, the data stream is continuously and unlimited produced in high-speed. Moreover, the implicit concept and the distribution of data may change as time goes by. Accordingly, the classification model is not only required to perform the predictions correctly and efficiently, but also to detect concept changes for adjusting the classification rules to catch recent trends in time. In this thesis, a clustering based classification method is provided for reducing the number of classification rules. First, the nearest neighbor algorithm is adopted to cluster the training data. Then a representational pattern is chosen from each cluster to construct a classification rule. In the process of predicting, the exception data in a sliding window is recorded and the accuracy in the window is monitored. When the accuracy is below a user-defined threshold value, the recorded data is used to adjust the classifier. The proposed method controls the number of classification rules by setting different distance threshold when performing data clustering. Therefore, the required storage of the constructed classifier is adaptable. Furthermore, the constructed classifier model is adjustable for concept changes to keep accurate predictions.
參考文獻
[1] B. Liu, W. Hsu, and Y. Ma, “Integrating Classification and Association Rule Mining,” in Proc. of the International Conference on Knowledge Discovery and Data Mining, pages 80-86, 1998.
[2] D. Aha, “Lazy Learning,” Artificial Intelligence Review, 11:1-5, 1997.
[3] E.-H. Han, G. Karypis, and V. Kumar, “Tex Categorization Using Weight Adjusted k-Nearest Neighbor Classification,” in Proc. of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2001.
[4] F. J. Ferrer-Troyano, J. S. Aguilar-Ruiz, and J. C. Riquelme, “Incremental Rule Learning based on Example Nearness from Numerical Data Streams,” in Proc. of the 20 ACM Symposium on Applied Computing, ACM SAC, 2005.
[5] G. Hulten, L. Spencer, and P. Domingos, “Mining Time-Changing Data Streams,” in Proc. of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 97-106, 2001.
[6] H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining Concept-Drifting Data Streams Using Ensemble Classifiers,” in Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.
[7] J. Gama, P. Medas, and R. Rocha, “Forest Trees for On-Line Data,” in Proc. of the 19 ACM Symposium on Applied Computing, ACM SAC, pages 649-653, 2004.
[8] J. Gama, P. Medas, and P. Rodrigues, “Learning Decision Trees from Dynamic Data Streams,” in Proc. of the 20 ACM Symposium on Applied Computing , ACM SAC, 2005.
[9] J. Gehrek, R. Ramakrishnan, and V. Ganti, “RainForest – A Framework for Fast Decision Tree Construction of Large Datasets,” Data Mining and Knowledge Discovery, 4(2/3):127-162, 2000.
[10] J. Gehrek, V. Ganti, R. Ramakrishnan, and W. Loh, “BOAT- Optimistic Decision Tree Construction,” in Proc. of the International Conference Management of Data, SIGMOD, 1999.
[11] J. R. Quinlan, “C4.5: Programs for Machine Learning,” Morgan-Kaufmann Publishers, San Mateo, CA, 1993.
[12] L. Breiman, “Bagging Predictors,” Machine Learning, 24(2):123-140, 1996.
[13] L. Breiman, J. H. Friedman, R. Olshen, and C. J. Stone, “Classification and Regression Trees,” Chapman & Hall, New York, 1984.
[14] P.-N. Tan, M. Steinbach, and V. Kumar, “Introduction to Data Mining,” Addison Wesley, 2005.
[15] P. Wang, H. Wang, X. Wu, W. Wang, and B. Shi, “On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams,” in Proc. of the 5th IEEE International Conference on Data Mining (ICDM), 2005.
[16] R. Jin and G. Agrawal, “Efficient Decision Tree Construction on Streaming Data”, in Proc. of the 9 Int. Conf. on Knowledge Discovery and Data Mining, ACM SIGKDD, 2003.
[17] S. Guha, N. Koudas, and K. Shim, “Data Streams and Histograms: ACM Symposium on Theory of Computing,” 2001.
[18] W. Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables,” Journal of the American Statistical Association, 58:13-30, 1963.
[19] W. Street and Y. Kim, “A Streaming Ensemble Algorithm SEA for Large-Scale Classification,” in Proc. of the 7 International Conference on Knowledge Discovery and Data Mining, ACM SIGKDD, pages 377-382, 2001.
[20] W. W. Cohen, “Fast Effective Rule Induction,” in Proc. of the 12 International Conference on Machine Learning, pages 115-123, 1995.
[21] Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and An Application to Boosting,” Journal of Computer and System Science, 55(1):119-139, 1997.
[22] Y. Yang, X. Wu, and X.Zhu, “Mining in Anticipation for Concept Change: Proactive-Reactive Prediction in Data Streams,” in Data Mining and Knowledge Discovery (DMKD), Volume 13, Number 3, 2006.
[23] http://www.datasetgenerator.com/