簡易檢索 / 詳目顯示

研究生: 劉育君
Liu Yu-Jiun
論文名稱: 以資料樣式分群進行變動資料流分類探勘之研究
Using Clustering Method to Construct A Classifier on Concept Drifting Streams
指導教授: 柯佳伶
學位類別: 碩士
Master
系所名稱: 資訊教育研究所
Graduate Institute of Information and Computer Education
論文出版年: 2007
畢業學年度: 95
語文別: 中文
論文頁數: 51
中文關鍵詞: 分類分群變動資料流概念改變
論文種類: 學術論文
相關次數: 點閱:159下載:4
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 不同於靜態資料庫中保存的歷史資料,資料流上的資料會以高速、連續、沒有限量的方式進入,資料概念與分布還可能隨著時間而改變。所以在資料流環境上進行分類工作必須兼顧速度與準確度,並能夠適時地偵測到概念改變發生,調整原有的分類模型,以符合最近的資料概念和分布趨勢。因此本論文先利用最近鄰居演算法對訓練資料做分群,再從每個群中挑選一個代表樣式來產生分類規則,以降低分類模型的大小。此外,由於近期資料通常比較符合變化的趨勢,所以在分類的過程中,我們使用一個移動視窗來觀察視窗內分類準確度的變化,並記錄在移動視窗中無法被合適預測的資料點,當分類準確度低於設定的門檻值時,則利用被記錄的資料點調整原分類模型。本論文可透過分群時所設定的距離門檻值來調整分類規則數,使分類模型所佔之空間彈性化。並能有效偵測資料概念改變,且動態調整分類模型,以維持預測新進資料類別的準確度。

    Differ from the static database for storing history data, the data stream is continuously and unlimited produced in high-speed. Moreover, the implicit concept and the distribution of data may change as time goes by. Accordingly, the classification model is not only required to perform the predictions correctly and efficiently, but also to detect concept changes for adjusting the classification rules to catch recent trends in time. In this thesis, a clustering based classification method is provided for reducing the number of classification rules. First, the nearest neighbor algorithm is adopted to cluster the training data. Then a representational pattern is chosen from each cluster to construct a classification rule. In the process of predicting, the exception data in a sliding window is recorded and the accuracy in the window is monitored. When the accuracy is below a user-defined threshold value, the recorded data is used to adjust the classifier. The proposed method controls the number of classification rules by setting different distance threshold when performing data clustering. Therefore, the required storage of the constructed classifier is adaptable. Furthermore, the constructed classifier model is adjustable for concept changes to keep accurate predictions.

    附表目錄 .........................................ii 附圖目錄 ........................................iii 第一章 緒論 ........................................1 1-1 背景與研究動機 .................................1 1-2 相關文獻探討 ...................................3 1-3 論文方法 .....................................10 1-4 論文架構 .....................................12 第二章 問題定義 ...................................13 第三章 建構分類模型 ................................18 3-1 訓練規則分群 ..................................18 3-2 提取分類規則 ..................................26 第四章 偵測概念改變 ................................30 4-1 資料分類預測方法 ..............................30 4-2 偵測概念改變 ..................................38 4-3 調整分類模型 ..................................40 第五章 演算法效率評估 ..............................42 5-1 資料產生方式 ..................................42 5-2 實驗評估 .....................................44 第六章 結論 .......................................50 參考文獻 .........................................51 附表目錄 表2.1 Car 資料集 .................................13 表2.2 範例資料集D .................................15 表2.3 資料集D之屬性值分類錯誤率 .....................16 表2.4 資料集D整體亂度值及各屬性之綜合亂度 ............16 表2.5 資料集D各屬性之分類資訊價值 ...................16 表3.1 資料樣式距離計算實例 .........................21 表4.1 測試資料集S .................................35 表5.1 實驗資料集參數說明 ...........................43 表5.2 資料集分類準確率 .............................44 附圖目錄 圖1.1 整體法架構圖 .................................6 圖1.2 論文方法流程圖 ...............................11 圖3.1 訓練資料分群演算法 ...........................20 圖3.2 訓練資料分群過程 .............................22 圖3.3 提取分類規則演算法 ...........................28 圖3.4 提取分類規則過程 .............................29 圖3.5 資料集D所產生之分類規則 .......................29 圖4.1 資料暫存區之格式 .............................31 圖4.2 預測測試資料之流程圖 ..........................33 圖4.3 預測資料類別演算法 ...........................34 圖4.4 預測e1類別後儲存結構的變化 ....................36 圖4.5 預測e2類別後儲存結構的變化 ....................36 圖4.6 預測e3類別後儲存結構的變化 ....................37 圖4.7 預測e4類別後儲存結構的變化 ....................37 圖4.8 滑動視窗概念圖 ...............................39 圖5.1 視窗大小對概念轉移資料反應時間之比較 ............45 圖5.2 距離門檻值對概念轉移資料反應時間之比較 ..........46 圖5.3 準確率門檻值對概念轉移資料反應時間之比較 ........47 圖5.4 視窗大小對資料分布改變反應時間之比較 ............48 圖5.5 距離門檻值對資料分布改變反應時間之比較 ..........49

    參考文獻
    [1] B. Liu, W. Hsu, and Y. Ma, “Integrating Classification and Association Rule Mining,” in Proc. of the International Conference on Knowledge Discovery and Data Mining, pages 80-86, 1998.
    [2] D. Aha, “Lazy Learning,” Artificial Intelligence Review, 11:1-5, 1997.
    [3] E.-H. Han, G. Karypis, and V. Kumar, “Tex Categorization Using Weight Adjusted k-Nearest Neighbor Classification,” in Proc. of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2001.
    [4] F. J. Ferrer-Troyano, J. S. Aguilar-Ruiz, and J. C. Riquelme, “Incremental Rule Learning based on Example Nearness from Numerical Data Streams,” in Proc. of the 20 ACM Symposium on Applied Computing, ACM SAC, 2005.
    [5] G. Hulten, L. Spencer, and P. Domingos, “Mining Time-Changing Data Streams,” in Proc. of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 97-106, 2001.
    [6] H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining Concept-Drifting Data Streams Using Ensemble Classifiers,” in Proc. of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003.
    [7] J. Gama, P. Medas, and R. Rocha, “Forest Trees for On-Line Data,” in Proc. of the 19 ACM Symposium on Applied Computing, ACM SAC, pages 649-653, 2004.
    [8] J. Gama, P. Medas, and P. Rodrigues, “Learning Decision Trees from Dynamic Data Streams,” in Proc. of the 20 ACM Symposium on Applied Computing , ACM SAC, 2005.
    [9] J. Gehrek, R. Ramakrishnan, and V. Ganti, “RainForest – A Framework for Fast Decision Tree Construction of Large Datasets,” Data Mining and Knowledge Discovery, 4(2/3):127-162, 2000.
    [10] J. Gehrek, V. Ganti, R. Ramakrishnan, and W. Loh, “BOAT- Optimistic Decision Tree Construction,” in Proc. of the International Conference Management of Data, SIGMOD, 1999.
    [11] J. R. Quinlan, “C4.5: Programs for Machine Learning,” Morgan-Kaufmann Publishers, San Mateo, CA, 1993.
    [12] L. Breiman, “Bagging Predictors,” Machine Learning, 24(2):123-140, 1996.
    [13] L. Breiman, J. H. Friedman, R. Olshen, and C. J. Stone, “Classification and Regression Trees,” Chapman & Hall, New York, 1984.
    [14] P.-N. Tan, M. Steinbach, and V. Kumar, “Introduction to Data Mining,” Addison Wesley, 2005.
    [15] P. Wang, H. Wang, X. Wu, W. Wang, and B. Shi, “On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams,” in Proc. of the 5th IEEE International Conference on Data Mining (ICDM), 2005.
    [16] R. Jin and G. Agrawal, “Efficient Decision Tree Construction on Streaming Data”, in Proc. of the 9 Int. Conf. on Knowledge Discovery and Data Mining, ACM SIGKDD, 2003.
    [17] S. Guha, N. Koudas, and K. Shim, “Data Streams and Histograms: ACM Symposium on Theory of Computing,” 2001.
    [18] W. Hoeffding, “Probability Inequalities for Sums of Bounded Random Variables,” Journal of the American Statistical Association, 58:13-30, 1963.
    [19] W. Street and Y. Kim, “A Streaming Ensemble Algorithm SEA for Large-Scale Classification,” in Proc. of the 7 International Conference on Knowledge Discovery and Data Mining, ACM SIGKDD, pages 377-382, 2001.
    [20] W. W. Cohen, “Fast Effective Rule Induction,” in Proc. of the 12 International Conference on Machine Learning, pages 115-123, 1995.
    [21] Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and An Application to Boosting,” Journal of Computer and System Science, 55(1):119-139, 1997.
    [22] Y. Yang, X. Wu, and X.Zhu, “Mining in Anticipation for Concept Change: Proactive-Reactive Prediction in Data Streams,” in Data Mining and Knowledge Discovery (DMKD), Volume 13, Number 3, 2006.
    [23] http://www.datasetgenerator.com/

    QR CODE