簡易檢索 / 詳目顯示

研究生: 楊昊頤
Yang, Hao-Yi
論文名稱: 用於大規模科學數據處理的高效且可移植的分布建模
Efficient and Portable Distribution Modeling for Large-Scale Scientific Data Processing with Data-Parallel Primitives
指導教授: 王科植
Wang, Ko-Chih
口試委員: 張鈞法
Chang, Chun-Fa
紀明德
Chi, Ming-Te
王科植
Wang, Ko-Chih
口試日期: 2021/09/02
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 英文
論文頁數: 55
中文關鍵詞: 大數據處理科學資料平行計算
英文關鍵詞: Data-Parallel Primitives, large-scale data processing, scientific dataset, distribution-based approach, parallel algorithm
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202101344
論文種類: 學術論文
相關次數: 點閱:115下載:31
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 透過基於分布的資料表示法來處理大規模的科學資料集是一種新興且相當有潛
    力的方法。這種資料表示法基本上是將科學資料集轉換為許多分布來表示,並且每
    個分布皆由少量的樣本計算而出。目前大多數的平行演算法著重在將許多輸入樣本
    擬合成單一個分布,但這可能不適合處理大規模的科學資料集,因為這樣並不能很
    有效地利用計算資源。直方圖和高斯混和模型(GMM)最流行的科學資料集的分布
    表示法。因此,我們提出了針對處理大規模科學資料集的多組直方圖和GMM建模
    演算法。我們的演算法是基於data-parallel primitives開發的,以實現不同硬體架構的
    可移植性。我們詳細評估了我們所提出的演算法的性能,並展示了在處理科學數據
    時的使用案例。

    The use of distribution-based data representation to handle large-scale scientific datasets is a promising approach.
    The distribution-based approaches often transform a scientific dataset into many distributions, and each distribution is calculated from a small number of samples.
    Most of the proposed parallel algorithms focus on modeling single distribution from many input samples efficiently, which may not fit the large-scale scientific data processing scenario because they cannot utilize the computing resource well.
    Histogram and Gaussian Mixture Model (GMM) are the most popular distribution representations used to model the scientific datasets.
    Therefore, we propose multi-set histogram and GMM modeling algorithms for the scenario of large-scale scientific data processing. Our algorithms are developed by data-parallel primitives to achieve portability across different hardware architectures.
    We evaluate the performance of the proposed algorithms in detail and demonstrate use cases for scientific data processing.

    1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2. Background and Related Work . . . . . . . . . . . . . . . . . . . . . . . 4 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 Scientific Dataset . . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 Distribution-based Scientific Data Modeling . . . . . . . . . . 5 2.1.3 Data Parallel Primitives . . . . . . . . . . . . . . . . . . . 6 2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.1 Distribution-based Large Data Processing and Analysis . . . . . 7 2.2.2 Parallelization of Modeling Distribution . . . . . . . . . . . 8 2.2.3 Data Parallel Primitives . . . . . . . . . . . . . . . . . . . 9 3. Histogram Modeling using Data-Parallel Primitives . . . . . . . . . . . . 11 4. Gaussian Mixture Model Modeling using Data-Parallel Primitives . . . . . . 16 4.1 Input and Output Arrays . . . . . . . . . . . . . . . . . . . . . 20 4.2 M-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21 4.2.1 Weight Estimation . . . . . . . . . . . . . . . . . . . 22 4.2.2 Mean Vector Estimation . . . . . . . . . . . . . . . . .23 4.2.3 Covariance Matrix Estimation . . . . . . . . . . . . . .24 4.3 E-step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25 4.4 Responsibility Update . . . . . . . . . . . . . . . . . . . . . . 28 4.5 EM Termination Conditions . . . . . . . . . . . . . . . . . . . . 29 4.6 Improvement for the Shared Memory Environment . . . . . . . . . . 31 4.7 Covariance Matrix Computation Simplification . . . . . . . . . . .32 5. K-means algorithm using Data-Parallel Primitives . . . . . . . . . . . . . 34 5.1 Input and Output Arrays . . . . . . . . . . . . . . . . . . . . . 37 5.2 Initialization of the Cluster Centers Array . . . . . . . . . . . 38 5.3 Cluster Assignment . . . . . . . . . . . . . . . . . . . . . . . .38 5.4 K-Mean Termination Conditions . . . . . . . . . . . . . . . . . . 40 5.5 Cluster Center Update . . . . . . . . . . . . . . . . . . . . . . 41 6. Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.1 Performance Analysis of the Algorithms . . . . . . . . . . . . . .44 6.2 Parameter Analysis of the Algorithms . . . . . . . . . . . . . . .45 6.2.1 Histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.2.2 GMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7. Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 8. Conclusion and Future Works . . . . . . . . . . . . . . . . . . . . . . . .51 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52

    [1] Dimitrios Bachtis, Gert Aarts, and Biagio Lucini. Extending machine learning classification capabilities with histogram reweighting. Physical Review E, 102(3):033303, 2020.
    [2] Nathan Bell and Jared Hoberock. Thrust: A productivity-oriented library for cuda. In GPU computing gems Jade edition, pages 359–371. Elsevier, 2012.
    [3] Guy E Blelloch. Vector models for data-parallel computing, volume 2. MIT press Cambridge, 1990.
    [4] Rishav Chakravarti and Xiannong Meng. A study of color histogram based image retrieval. In 2009 Sixth International Conference on Information Technology: New Generations, pages 1323–1328. IEEE, 2009.
    [5] Abon Chaudhuri, Teng-Yok Lee, Han-Wei Shen, and Tom Peterka. Efficient range distribution query in large-scale scientific data. In 2013 IEEE Symposium on Large- Scale Data Analysis and Visualization (LDAV), pages 125–126. IEEE, 2013.
    [6] Abon Chaudhuri, Tzu Hsuan Wei, Teng Yok Lee, Han Wei Shen, and Tom Peterka. Efficient range distribution query for visualizing scientific data. In 2014 IEEE Pacific Visualization Symposium, pages 201–208. IEEE, 2014.
    [7] Chun-Ming Chen, Ayan Biswas, and Han-Wei Shen. Uncertainty modeling and error reduction for pathline computation in time-varying flow fields. In 2015 IEEE Pacific Visualization Symposium (PacificVis), pages 215–222. IEEE, 2015.
    [8] Soumya Dutta, Chun-Ming Chen, Gregory Heinlein, Han-Wei Shen, and Jen-Ping Chen. In situ distribution guided analysis and visualization of transonic jet engine simulations. IEEE transactions on visualization and computer graphics, 23(1):811– 820, 2016.
    [9] Soumya Dutta and Han-Wei Shen. Distribution driven extraction and tracking of features for time-varying data analysis. IEEE transactions on visualization and computer graphics, 22(1):837–846, 2015.
    [10] Soumya Dutta, Han-Wei Shen, and Jen-Ping Chen. In situ prediction driven feature analysis in jet engine simulations. In 2018 IEEE Pacific Visualization Symposium (PacificVis), pages 66–75. IEEE, 2018.
    [11] Kim H Esbensen, Dominique Guyot, Frank Westad, and Lars P Houmoller. Multivariate data analysis: in practice: an introduction to multivariate data analysis and experimental design. Multivariate Data Analysis, 2002.
    [12] Subhashis Hazarika, Ayan Biswas, and Han-Wei Shen. Uncertainty visualization using copula-based analysis in mixed distribution models. IEEE Transactions on Visualization and Computer Graphics, 24(1):934–943, 2017.
    [13] Subhashis Hazarika, Soumya Dutta, Han-Wei Shen, and Jen-Ping Chen. Codda: A flexible copula-based distribution driven analysis framework for large-scale multivariate data. IEEE transactions on visualization and computer graphics, 25(1):1214–1224, 2018.
    [14] Michael Jones and Paul Viola. Fast multi-view face detection. Mitsubishi Electric Research Lab TR-20003-96, 3(14):2, 2003.
    [15] NSL Phani Kumar, Sanjiv Satoor, and Ian Buck. Fast parallel expectation maximization for gaussian mixture models on gpus using cuda. In 2009 11th IEEE International Conference on High Performance Computing and Communications, pages 103–109. IEEE, 2009.
    [16] Wojciech Kwedlo. A parallel em algorithm for gaussian mixture models implemented on a numa system using openmp. In 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, pages 292–298. IEEE, 2014.
    [17] Matthew Larsen, Stephanie Labasan, Paul A Navr´atil, Jeremy S Meredith, and Hank Childs. Volume rendering via data-parallel primitives. In EGPGV@ EuroVis, pages 53–62, 2015.
    [18] Teng-Yok Lee and Han-Wei Shen. Efficient local statistical analysis via integral histograms with discrete wavelet transform. IEEE Transactions on Visualization and Computer Graphics, 19(12):2693–2702, 2013.
    [19] Brenton Lessley, Shaomeng Li, and Hank Childs. Hashfight: A platform-portable hash table for multi-core and many-core architectures. Electronic Imaging, 2020(1):376–1, 2020.
    [20] Brenton Lessley, Talita Perciano, Colleen Heinemann, David Camp, Hank Childs, and E Wes Bethel. Dpp-pmrf: rethinking optimization for a probabilistic graphical model using data-parallel primitives. In 2018 IEEE 8th Symposium on Large Data Analysis and Visualization (LDAV), pages 34–44. IEEE, 2018.
    [21] Brenton Lessley, Talita Perciano, Manish Mathai, Hank Childs, and E Wes Bethel. Maximal clique enumeration with data-parallel primitives. In 2017 IEEE 7th Symposium on Large Data Analysis and Visualization (LDAV), pages 16–25. IEEE, 2017.
    [22] Cheng Li and Han-Wei Shen. Winding angle assisted particle tracing in distributionbased vector field. In SIGGRAPH Asia 2017 Symposium on Visualization, pages 1–8, 2017.
    [23] Guan Li, Jiayi Xu, Tianchi Zhang, Guihua Shan, Han-Wei Shen, Ko-Chih Wang, Shihong Liao, and Zhonghua Lu. Distribution-based particle data reduction for in-situ analysis and visualization of large-scale n-body cosmological simulations. In 2020 IEEE Pacific Visualization Symposium (PacificVis), pages 171–180. IEEE, 2020.
    [24] Shaomeng Li, Nicole Marsaglia, Vincent Chen, Christopher M Sewell, John P Clyne, and Hank Childs. Achieving portable performance for wavelet compression using data parallel primitives. In EGPGV@ EuroVis, pages 73–81, 2017.
    [25] Shusen Liu, Joshua A Levine, Peer-Timo Bremer, and Valerio Pascucci. Gaussian mixture model based volume visualization. In IEEE Symposium on Large Data Analysis and Visualization (LDAV), pages 73–77. IEEE, 2012.
    [26] Kenneth Moreland, Christopher Sewell, William Usher, Li-ta Lo, Jeremy Meredith, David Pugmire, James Kress, Hendrik Schroots, Kwan-Liu Ma, Hank Childs, et al. Vtk-m: Accelerating the visualization toolkit for massively threaded architectures. IEEE computer graphics and applications, 36(3):48–58, 2016.
    [27] Christopher Meyer Sewell. Piston: A portable cross-platform framework for dataparallel visualization operators. Technical report, Los Alamos National Lab.(LANL), Los Alamos, NM (United States), 2012.
    [28] Ramtin Shams, RA Kennedy, et al. Efficient histogram algorithms for nvidia cuda compatible devices. In Proc. Int. Conf. on Signal Processing and Communications Systems (ICSPCS), pages 418–422. Citeseer, 2007.
    [29] David Thompson, Joshua A Levine, Janine C Bennett, Peer-Timo Bremer, Attila Gyulassy, Valerio Pascucci, and Philippe P P´ebay. Analysis of large-scale scalar data using hixels. In 2011 IEEE Symposium on Large Data Analysis and Visualization, pages 23–30. IEEE, 2011.
    [30] Ko-ChihWang, Kewei Lu, Tzu-HsuanWei, Naeem Shareef, and Han-Wei Shen. Statistical visualization and analysis of large data using a value-based spatial distribution. In 2017 IEEE Pacific Visualization Symposium (PacificVis), pages 161–170. IEEE, 2017.
    [31] Ko-Chih Wang, Tzu-Hsuan Wei, Naeem Shareef, and Han-Wei Shen. Ray-based exploration of large time-varying volume data using per-ray proxy distributions. IEEE transactions on visualization and computer graphics, 26(11):3299–3313, 2019.
    [32] Ko-Chih Wang, Jiayi Xu, Jonathan Woodring, and Han-Wei Shen. Statistical super resolution for data analysis and visualization of large scale cosmological simulations. In 2019 IEEE Pacific Visualization Symposium (PacificVis), pages 303–312. IEEE, 2019.
    [33] Tzu-Hsuan Wei, Chun-Ming Chen, Jonathan Woodring, HuiJie Zhang, and Han-Wei Shen. Efficient distribution-based feature search in multi-field datasets. In 2017 IEEE Pacific Visualization Symposium (PacificVis), pages 121–130. IEEE, 2017.
    [34] Tzu-Hsuan Wei, Soumya Dutta, and Han-Wei Shen. Information guided data sampling and recovery using bitmap indexing. In 2018 IEEE Pacific Visualization Symposium (PacificVis), pages 56–65. IEEE, 2018.
    [35] Abhishek Yenpure, Hank Childs, and Kenneth D Moreland. Efficient point merging using data parallel techniques. Technical report, Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), 2019.
    [36] YJ Zhang. Improving the accuracy of direct histogram specification. Electronics Letters, 28(3):213–214, 1992.

    下載圖示
    QR CODE