Author: |
洪坊瑜 Hong, Fang-Yu |
---|---|
Thesis Title: |
利用隨機交互森林預測模型之應用 Applications of Predictive Models Using Random Interaction Forests |
Advisor: |
程毅豪
Chen, Yi-Hau 呂翠珊 Lu, Tsui-Shan |
Committee: |
林惠文
Lin, Hui-Wen 程毅豪 Chen, Yi-Hau 呂翠珊 Lu, Tsui-Shan |
Approval Date: | 2023/06/20 |
Degree: |
碩士 Master |
Department: |
數學系 Department of Mathematics |
Thesis Publication Year: | 2023 |
Academic Year: | 111 |
Language: | 中文 |
Number of pages: | 43 |
Keywords (in Chinese): | 交互作用 、隨機森林 、隨機交互森林 、機器學習 、迴歸分析 |
Keywords (in English): | interaction effect, random forests, random interaction forests, machine learning, regression analysis |
Research Methods: | 實驗設計法 、 次級資料分析 、 調查研究 、 主題分析 、 比較研究 、 觀察研究 、 內容分析法 |
DOI URL: | http://doi.org/10.6345/NTNU202300737 |
Thesis Type: | Academic thesis/ dissertation |
Reference times: | Clicks: 120 Downloads: 16 |
Share: |
School Collection Retrieve National Library Collection Retrieve Error Report |
根據生物、工業,以及商業統計資料,對於不同領域下的預測分析,舉例客戶行為、消費者需求或股票價格波動以及診斷病人等等,從中探討重要變數之間的交互作用,達到模型更準確的預測結果,本研究套用了隨機森林演算法,考慮交互效應予以改善模型並允許對解釋變數做交互作用進行有價值的洞察效果,而隨機交互作用森林(Random Interaction Forest, RIF)是隨機森林(Random Forest, RF)所衍生出來的一種新策略演算法,適合用於類別、連續變數或存活等資料型態加以預測,並明確地模擬建構森林中的決策樹所執行變數之間定性與定量的相互作用。
在模擬研究中,使用了R包套件中"vivid"(Variable Importance and Variable Interactions Displays),呈現了機器學習模型中變數之間的重要性以及交互作用的可視覺化工具,同時也使用了R包中"diversityForest",透過投票分割抽樣,在隨機森林中進行複雜的分類程序,使用雙變數拆分對定量和定性交互效應進行建模。
交互森林(Interaction Forest, IF)帶有效果重要性度量(Effect Importance Measure, EIM),可用於識別具有高預測相關性的定量和定性交互作用的變數做應對。IF和EIM專注於易於解釋的交互形式。透過新的隨機交互森林結構,檢驗了線性迴歸模型、邏輯迴歸模型,增添了機器學習預測模型的能力。研究結果表明,當RIF模型存在交互作用時,不僅優於隨機森林和邏輯、迴歸分析方法。同時,證實RIF在執行許多情況下比傳統統計方法所創建的模型識別來的更為準確。並且交互作用為顯著時,RIF的性能也顯得更加優越表現,表示使用此方法不但可以提高業務流程和科學研究的效率。而且RIF在預測建模中的辨識度以及利用交互效果的部分都相對容易解釋,這是一項具有挑戰性且合適的工具。本文將透過這些方法的檢測應用於2012~2016年台北市死亡數實際資料進行評估。
According to biological, industrial, and commercial statistical data, for predictive analysis in different fields, such as customer behavior, consumer demand or stock price fluctuations, and patient diagnosis etc., we can explore the interaction between important variables to achieve a more accurate model. To predict the results, this thesis applies the random forest algorithm, considers the interaction effect to improve the model and allows valuable insight into the interaction of explanatory variables. The random interaction forest (RIF) is a random forest and it is a new strategy of algorithm, suitable for categorical, continuous and survival prediction outcomes. It explicitly models the qualitative and quantitative interactions between variables implemented by decision trees in construction forests.
In the simulation study, "Vivid" (Variable Importance and Variable Interactions Displays) in the R package was used to present a visualization tool for the importance and interaction between variables in the machine learning model, and "diversityForest" in the R package was also used, with split sampling by vote, complex classification procedures in random forests, modeling quantitative and qualitative interaction effects using bivariate splits.
The interactional forest with an effect importance measure (EIM) can be used to identify variable responses for quantitative and qualitative interactions with high predictive correlations. Feature Interaction (FI) and EIM focus on easily interpretable forms of interaction. Through the new random interaction forest structure, the linear regression model and logistic regression model are tested, and the ability of the machine learning prediction model is added. The results of the simulation show that the RIF model is not only superior to the random forest and logistic and regression analysis methods, but also gives more accurate results than models created by traditional statistical methods. When the interaction is more significant, the performance of RIF is more superior, indicating that this method can improve the efficiency of business processes and scientific research. Moreover, RIF's recognizability in predictive the model and use of interaction effects are relatively easy to interpret. We believe that it is a challenging and suitable tool in the future. In this paper, the prediction is applied to the actual data of the number of deaths in Taipei City from 2012 to 2016 for evaluation by the method.
Ho, Tin Kam. “Random Decision Forest”. Proc. of the 3rd Int'l Conf. on Document Analysis and Recognition, Montreal, Canada, August 14-18, 278-282, 1995.
Ho, Tin Kam. “The Random Subspace Method for Constructing Decision Forests.” IEEE Trans. Pattern Anal. Mach. Intell. 20: 832-844, 1998.
Breiman, L. “Random Forests.” Machine Learning 45: 5-32, 2001.
Berlind, Roger Steven. “An alternative method of stochastic discrimination with applications to pattern recognition.”: 4878-4878, 1995.
Cutler, Adele and Guo hua Zhao. “PERT – perfect random tree ensembles.” Computing Science and Statistics 33.4:90-4, 2001.
Ho, Tin Kam. “Recognition of handwritten digits by combining independent learning vector quantizations.” Proceedings of 2nd International Conference on Document Analysis and Recognition (ICDAR'93). IEEE, 1993.
Cutler, D. Richard, Thomas C. Edwards, Karen H. Beard, Adele Cutler, Kyle Hess, Jacob Gibson and Joshua J. Lawler. “Random forests for classification in ecology.” Ecology 88 11: 2783-92, 2007.
Tomasi, Carlo. “Decision Trees and Random Decision Forests.”, 2021.
Zhen Zeng, Yuefeng Lu, Judong Shen, Wei Zheng, Peter Shaw, Mary Beth Dorr. “A random interaction forest for prioritizing predictive biomarkers.” arXiv preprint arXiv:1910.01786, 2019.
Loh, Wei-Yin. “Classification and regression trees.” Wiley interdisciplinary reviews: data mining and knowledge discovery 1.1: 14-23, 2011.
Guo, Chao-Yu and Yi-Jyun Lin. “Random Interaction Forest (RIF)–A Novel Machine Learning Strategy Accounting for Feature Interaction.” IEEE Access 11: 1806-1813, 2023.
Cutler, Adele, D. Richard Cutler, and John R. Stevens. “Random forests.” Ensemble machine learning: Methods and applications:157-175, 2012.
Hornung, Roman and Anne‐Laure Boulesteix. “Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects.” Compute Statistics and Data Analysis 171: 107460, 2022.
Inglis, Alan, Andrew Parnell, and Catherine B. Hurley. “Visualizing variable importance and variable interaction effects in machine learning models.” Journal of Computational and Graphical Statistics, 31(3), 766-778, 2022.
McClelland, Gary H. and Charles M Judd. “Statistical difficulties of detecting interactions and moderator effects.” Psychological bulletin 114 2: 376-90, 1993.
Zhang, Jiong and Mohammad Zulkernine. “A hybrid network intrusion detection technique using random forests.” First International Conference on Availability, Reliability and Security (ARES'06). IEEE, 2006.
Chauhan, Vinod Kumar, Kalpana Dahiya and Anuj Kumar Sharma. “Problem formulations and solvers in linear SVM: a review.” Artificial Intelligence Review: 1-53, 2019.
Denisko, Danielle, and Michael M. Hoffman. “Classification and interaction in random forests.” Proceedings of the National Academy of Sciences 115.8: 1690-1692, 2018.
Benarie, Michel. “Interactions between air contaminants and forest ecosystems." Science of the Total Environment 29.1-2: 187-188, 1983.
Zhang, Haotong, Alexander C. Berg, Michael Maire and Jitendra Malik. “SVM-KNN: Discriminative Nearest Neighbor Classification for Visual Category Recognition.” 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06) 2: 2126-2136, 2006.
Breiman, Leo. “Bagging predictors.” Machine learning 24: 123-140, 1996.
Chen, Tianqi, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, Rory Mitchell, Ignacio Cano, Tianyi Zhou, Mu Li, Junyuan Xie, Min Lin, Yifeng Geng and Yutian Li. “Extreme Gradient Boosting [R package xgboost version 1.2.0.1].”, 2020.
Althuwaynee, Omar F., Sang-Wan Kim, Mohamed A. Najemaden, Ali Aydda, Abdul-Lateef Babatunde Balogun, Moatasem M. Fayyadh and Hyuck-Jin Park. “Demystifying uncertainty in PM10 susceptibility mapping using variable drop-off in extreme-gradient boosting (XGB) and random forest (RF)algorithms.” Environmental Science and Pollution Research 28: 43544 – 43566, 2021.
Wälder, Konrad and Olga Wälder. “Analysing interaction effects in forests using the mark correlation function.” Iforest - Biogeosciences and Forestry 1.1: 34, 2008.
Guyon, Isabelle M and André Elisseeff. “An Introduction to Variable and Feature Selection.” J. Mach. Learn. Res. 3: 1157-1182, 2003.
Wright, Marvin N., and Andreas Ziegler. “A. ranger: A fast implementation of random forests for high dimensional data in C++ and R.” arXiv preprint arXiv:1508.04409, 2015.
Yuan, Ye, Liji Wu, and Xiangmin Zhang. “Gini-Impurity index analysis.” IEEE Transactions on Information Forensics and Security, 16, 3154-3169, 2021.
Guo, Chao-Yu, and Ke-Hao Chang. “A Novel Algorithm to Estimate the Significance Level of a Feature Interaction Using the Extreme Gradient Boosting Machine.” International journal of environmental research and public health 19.4: 2338, 2022.
Friedman, Jerome H., and Bogdan E. Popescu. "Predictive learning via rule ensembles." The annals of applied statistics: 916-954, 2008.
Hastie, Trevor, Robert Tibshirani, Jerome H. Friedman, and Jerome H. Friedman. The elements of statistical learning: data mining, inference, and prediction. Vol. 2. New York: springer, 2009.