簡易檢索 / 詳目顯示

研究生: 陳柏瑋
Chen, Po-Wei
論文名稱: 利用機器學習填補遺漏值的比較與研究
Comparison of multiple machine-learning methods of imputation
指導教授: 呂翠珊
Lu, Tsui-Shan
口試委員: 蔡碧紋
Tsai, Pi-Wen
吳宗軒
Wu, Chung-Hsuen
呂翠珊
Lu, Tsui-Shan
口試日期: 2022/06/23
學位類別: 碩士
Master
系所名稱: 數學系
Department of Mathematics
論文出版年: 2022
畢業學年度: 110
語文別: 英文
論文頁數: 33
中文關鍵詞: 遺漏值機器學習K-鄰近算法鏈式方程多重填補法缺失森林
英文關鍵詞: Imputation of missing values, K-Nearest Neighbor, Multivariate Imputation by Chained Equations, MissForest
DOI URL: http://doi.org/10.6345/NTNU202201080
論文種類: 學術論文
相關次數: 點閱:139下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

本研究主要探討具有遺漏值的數據通過多種機器學習方法填補後之比較。遺漏值的填補是進行資料分析的重要過程,若隨意刪除或簡易替換,可能會導致後續的統計分析出現重大偏差,因此,在可用的填補方法中進行有效的選擇至關重要。

我們利用近期熱門的機器學習填補法 K-鄰近算法 (K-Nearest Neighbor)、鏈式方程多重填補法 (Multivariate Imputation by Chained Equations) 及缺失森林 (MissForest) 等三種方法進行了模擬研究。在各種隨機遺漏設置下,當數據是完全續、完全類別或混合型數據集時,以評估每種方法的各自結果,結果表明,利用缺失森林 (MissForest) 方法來對資料進行填補時,其正規化方根均差 (NRMSE) 或是類別錯誤率 (PFC) 都有著最好的表現。我們還將三種方法應用於幾個實徵數據集上,結果顯示缺失森林皆優於其他兩種機器學習填補法。

This study explores the comparison of data with missing values after imputation by
multiple machine-learning methods. The imputation of missing values is an important process in data analysis. If the missing values are arbitrarily deleted or simply substituted, it may lead to substantial bias in the subsequent statistical analysis. Therefore, the effective selection among available imputation methods is extremely crucial.

In this paper, we consider the recent machine-learning imputation methods, K-Nearest Neighbor, Multivariate Imputation by Chained Equations and MissForest. We conduct simulation studies for all-continuous, all-categorical and mixed data to evaluate the respective results from each method under various settings of random omission. The results show that the MissForest method has the best performance in terms of NRMSE and PFC. We also apply three methods to several real data sets.

Chapter 1 Introduction 1 Chapter 2 Statistical Inference Method 4 2.1 K-Nearest Neighbours (KNN) 5 2.2 Multiple Imputation by Chained Equation (MICE) 5 2.3 MissForest 6 2.4 Comparison of Three Methods 8 Chapter 3 Simulation Study 9 3.1 Data generation 9 3.1.1 Continuous data 9 3.1.2 Categorical data 10 3.1.3 Mixed data 10 3.2 Results 11 Chapter 4 Real Data Analysis 19 4.1 Continuous Data 19 4.2 Categorical Data 19 4.3 Mixed Data 21 4.4 Results 22 Chapter 5 Conclusions and Discussions 31 References 33

[1] Roderick JA Little and Donald B Rubin. (2019). Statistical analysis with missing data. John Wiley & Sons.
[2] Daniel J Stekhoven and Peter Bühlmann. (2012). Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118.
[3] Shigeyuki Oba, Masa-aki Sato, Ichiro Takemasa, Morito Monden, Ken-ichi Matsubara, and Shin Ishii. (2003). A bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16):2088–2096.
[4] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. (2001). Missing value estimation methods for dna microarrays. Bioinformatics, 17(6):520–525.
[5] Stef Van Buuren and Karin Oudshoorn. (1999). Flexible multivariate imputation by MICE. Leiden: TNO.
[6] Leo Breiman. (2001). Random forests. Machine learning, 45(1):5–32.
[7] Stef Van Buuren. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical methods in medical research, 16(3):219– 242.
[8] Stef Van Buuren and Karin Groothuis-Oudshoorn. (2011). mice: Multivariate imputation by chained equations in r. Journal of statistical software, 45:1–67.
[9] Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. (2003). Knn model-based approach in classification. In OTM Confederated International Conferences ”On the Move to Meaningful Internet Systems”, pp:986–996.

無法下載圖示 電子全文延後公開
2027/08/10
QR CODE