Basic Search / Detailed Display

Author: 陳柏瑋
Chen, Po-Wei
Thesis Title: 利用機器學習填補遺漏值的比較與研究
Comparison of multiple machine-learning methods of imputation
Advisor: 呂翠珊
Lu, Tsui-Shan
Committee: 蔡碧紋
Tsai, Pi-Wen
吳宗軒
Wu, Chung-Hsuen
呂翠珊
Lu, Tsui-Shan
Approval Date: 2022/06/23
Degree: 碩士
Master
Department: 數學系
Department of Mathematics
Thesis Publication Year: 2022
Academic Year: 110
Language: 英文
Number of pages: 33
Keywords (in Chinese): 遺漏值機器學習K-鄰近算法鏈式方程多重填補法缺失森林
Keywords (in English): Imputation of missing values, K-Nearest Neighbor, Multivariate Imputation by Chained Equations, MissForest
DOI URL: http://doi.org/10.6345/NTNU202201080
Thesis Type: Academic thesis/ dissertation
Reference times: Clicks: 92Downloads: 0
Share:
School Collection Retrieve National Library Collection Retrieve Error Report
  • 本研究主要探討具有遺漏值的數據通過多種機器學習方法填補後之比較。遺漏值的填補是進行資料分析的重要過程,若隨意刪除或簡易替換,可能會導致後續的統計分析出現重大偏差,因此,在可用的填補方法中進行有效的選擇至關重要。

    我們利用近期熱門的機器學習填補法 K-鄰近算法 (K-Nearest Neighbor)、鏈式方程多重填補法 (Multivariate Imputation by Chained Equations) 及缺失森林 (MissForest) 等三種方法進行了模擬研究。在各種隨機遺漏設置下,當數據是完全續、完全類別或混合型數據集時,以評估每種方法的各自結果,結果表明,利用缺失森林 (MissForest) 方法來對資料進行填補時,其正規化方根均差 (NRMSE) 或是類別錯誤率 (PFC) 都有著最好的表現。我們還將三種方法應用於幾個實徵數據集上,結果顯示缺失森林皆優於其他兩種機器學習填補法。

    This study explores the comparison of data with missing values after imputation by
    multiple machine-learning methods. The imputation of missing values is an important process in data analysis. If the missing values are arbitrarily deleted or simply substituted, it may lead to substantial bias in the subsequent statistical analysis. Therefore, the effective selection among available imputation methods is extremely crucial.

    In this paper, we consider the recent machine-learning imputation methods, K-Nearest Neighbor, Multivariate Imputation by Chained Equations and MissForest. We conduct simulation studies for all-continuous, all-categorical and mixed data to evaluate the respective results from each method under various settings of random omission. The results show that the MissForest method has the best performance in terms of NRMSE and PFC. We also apply three methods to several real data sets.

    Chapter 1 Introduction 1 Chapter 2 Statistical Inference Method 4 2.1 K-Nearest Neighbours (KNN) 5 2.2 Multiple Imputation by Chained Equation (MICE) 5 2.3 MissForest 6 2.4 Comparison of Three Methods 8 Chapter 3 Simulation Study 9 3.1 Data generation 9 3.1.1 Continuous data 9 3.1.2 Categorical data 10 3.1.3 Mixed data 10 3.2 Results 11 Chapter 4 Real Data Analysis 19 4.1 Continuous Data 19 4.2 Categorical Data 19 4.3 Mixed Data 21 4.4 Results 22 Chapter 5 Conclusions and Discussions 31 References 33

    [1] Roderick JA Little and Donald B Rubin. (2019). Statistical analysis with missing data. John Wiley & Sons.
    [2] Daniel J Stekhoven and Peter Bühlmann. (2012). Missforest—non-parametric missing value imputation for mixed-type data. Bioinformatics, 28(1):112–118.
    [3] Shigeyuki Oba, Masa-aki Sato, Ichiro Takemasa, Morito Monden, Ken-ichi Matsubara, and Shin Ishii. (2003). A bayesian missing value estimation method for gene expression profile data. Bioinformatics, 19(16):2088–2096.
    [4] Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. (2001). Missing value estimation methods for dna microarrays. Bioinformatics, 17(6):520–525.
    [5] Stef Van Buuren and Karin Oudshoorn. (1999). Flexible multivariate imputation by MICE. Leiden: TNO.
    [6] Leo Breiman. (2001). Random forests. Machine learning, 45(1):5–32.
    [7] Stef Van Buuren. (2007). Multiple imputation of discrete and continuous data by fully conditional specification. Statistical methods in medical research, 16(3):219– 242.
    [8] Stef Van Buuren and Karin Groothuis-Oudshoorn. (2011). mice: Multivariate imputation by chained equations in r. Journal of statistical software, 45:1–67.
    [9] Gongde Guo, Hui Wang, David Bell, Yaxin Bi, and Kieran Greer. (2003). Knn model-based approach in classification. In OTM Confederated International Conferences ”On the Move to Meaningful Internet Systems”, pp:986–996.

    無法下載圖示 Public on Internet date:
    2027/08/10
    QR CODE