簡易檢索 / 詳目顯示

研究生: 余松恬
Yu, Song-Tien
論文名稱: 通用型脈動陣列 AI 加速器:評估適用性與效能研究
A Study on the Applicability and Performance Evaluation of a General-Purpose Systolic Array AI Accelerator
指導教授: 黃文吉
Hwang, Wen-Jyi
口試委員: 葉佐任
Yeh, Tso-Zen
董一志
Tung, Yi-Chih
黃文吉
Hwang, Wen-Jyi
口試日期: 2023/07/17
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 51
中文關鍵詞: 脈動陣列硬體加速器邊緣運算神經網路模型
英文關鍵詞: Gemmini, RISC-V
DOI URL: http://doi.org/10.6345/NTNU202301110
論文種類: 學術論文
相關次數: 點閱:142下載:25
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 本論文旨在評估通用型脈動陣列 AI 硬體加速器在不同類型神經網路模型上的適用性及效能。隨著深度學習在邊緣運算中的廣泛應用,硬體加速器的設計成為提升邊緣運算效率的關鍵。然而,為每種類神經網路配置專用的硬體加速器並不切實際,若硬體加速器配置需要隨著模型架構的不同而頻繁改變,將是高昂成本負擔。
    本論文提出一套通用型 AI 脈動陣列硬體加速器的配置,目的是解決類神經網路應用中硬體適配的問題,使單一硬體加速器能夠適用於多種不同類神經網路架構,並建立了一個基於 RISC-V 核心且與通用型 AI 硬體加速器做整合之SoC 架構平台,實作於 FPGA 板,該 SoC架構提供一個真實情況的評估平台。
    本論文選用 Gemmini 作為通用型脈動陣列 AI 硬體加速器的代表,在不同的硬體配置下,針對兩種具代表性的類神經網路模型進行實驗,分別是基於二維卷積神經網路的影像元件辨識模型以及基於一維卷積的手勢辨識模型。本研究會結合效能評估並衡量 FPGA 硬體資源使用量,提出合適的通用型脈動陣列加速器硬體配置選用方案,供 AI 領域研究者參考。

    誌謝 i 摘要 ii 目錄 iii 圖目錄 v 表目錄 vi 第一章 緒論 1 1-1 研究背景 1 1-2 研究動機 3 1-3 研究目的 3 1-4 研究貢獻 4 第二章 理論基礎 5 2-1 Chipyard SoC Generators 5 2-1-1 Rocket Chip 5 2-1-2 Deep Learning Accelerator 6 2-1-3 Components and Tools 6 2-2 Gemmini AI Accelerator 8 2-2-1 Processing Element and Controller 8 2-2-2 Dataflow Type 9 2-2-3 Software Support 10 2-3 Systolic Array 11 2-4 Weight Stationary 12 第三章 研究方法 18 3-1 Gemmini: Hardware Flexibility 19 3-2 Model Architecture 21 3-2-1 Automated Optical Inspection Model 21 3-2-2 Gesture Recognition Model 22 3-3 Quantization and Deployment 25 第四章 實驗結果與效能分析 27 4-1 Experimental Environment 27 4-2 Acceleration Performance Compared to CPU 29 4-2-1 Automated Optical Inspection Model Evaluation 29 4-2-2 Gesture Recognition Model Evaluation 30 4-2-3 Execution Time and Speedup Ratio 31 4-3 Gemmini Memory Subsystem and Performance 33 4-3-1 Scratchpad Memory and Hardware Resources 34 4-3-2 Accumulator Memory and Hardware Resources 36 4-3-3 Performance of Different Memory Configurations 37 4-4 Gemmini Systolic Array Size and Performance 39 4-4-1 Systolic Array PE Size and Hardware Resources 39 4-4-2 Performance of Different Systolic Array PE Size 41 4-5 L2 Cache Capacity and Performance 43 4-5-1 L2 Cache Capacity and Hardware Resources 43 4-5-2 Performance of Different L2 Cache Capacity 45 4-6 Hardware Configuration Selection Guide 47 第五章 結論 48 參考文獻 49

    [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015, doi: 10.1038/nature14539.
    [2] Y. Chen et al., “A Survey of Accelerator Architectures for Deep Neural Networks,” Engineering, vol. 6, no. 3, pp. 264–274, Mar. 2020, doi: 10.1016/j.eng.2020.01.007.
    [3] A. Gonzalez and C. Hong, "A chipyard comparison of NVDLA and Gemmini", 2020
    [4] H. Genc et al., "Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration," 2021 58th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 2021, pp. 769-774, doi: 10.1109/DAC18074.2021.9586216.
    [5] N. P. Jouppi et al., "In-datacenter performance analysis of a tensor processing unit," 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 2017, pp. 1-12, doi: 10.1145/3079856.3080246.
    [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, May 2012, doi: 10.1145/3065386.
    [7] P. M. Kogge and H. S. Stone, “A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations,” IEEE Transactions on Computers, vol. C–22, no. 8, pp. 786–793, Aug. 1973, doi: 10.1109/tc.1973.5009159.
    [8] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta and A. A. Bharath, "Generative Adversarial Networks: An Overview," in IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 53-65, Jan. 2018, doi: 10.1109/MSP.2017.2765202.
    [9] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, doi: 10.1162/neco.1997.9.8.1735.
    [10] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, 2017, pp. 5998-6008.
    [11] A. Amid et al., “Chipyard: Integrated Design, Simulation, and Implementation Framework for Custom SoCs,” IEEE Micro, vol. 40, no. 4, pp. 10–21, Jul. 2020, doi: 10.1109/mm.2020.2996616.
    [12] K. Asanović et al., “The rocket chip generator,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2016-17 4, 2016.
    [13] A. Waterman and K. Asanović, “The RISC-V Instruction Set Manual: Volume I: Unprivileged ISA. ” SiFive Inc. and University of California, Berkeley, 2019.
    [14] Y. Lee et al., “The Hwacha vector-fetch architecture manual,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2015-262, 2015.
    [15] IceNet — Chipyard 1.9.0 documentation, accessed on July 9, 2023
    https://chipyard.readthedocs.io/en/latest/Generators/IceNet.html
    [16] SiFive Generators — Chipyard 1.9.0 documentation, accessed on July 9, 2023
    https://chipyard.readthedocs.io/en/latest/Generators/SiFive-Generators.html
    [17] J. Bachrach et al., “Chisel,” Proceedings of the 49th Annual Design Automation Conference on - DAC ’12, 2012, doi: 10.1145/2228360.2228584.
    [18] A. Izraelevitz et al., “Reusability is FIRRTL ground: Hardware construction languages, compiler frameworks, and transformations,” Nov. 2017, doi: 10.1109/iccad.2017.8203780.
    [19] Verilator — Chipyard 1.9.0 documentation, accessed on July 9, 2023 https://chipyard.readthedocs.io/en/main/Simulation/Software-RTL-Simulation.html
    [20] VCS Functional Verification Solution, accessed on July 9, 2023 https://www.synopsys.com/verification/simulation/vcs.html
    [21] S. Karandikar et al., “FireSim: FPGA-accelerated cycle-exact scale-out system simulation in the public cloud,” 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 29-42, 2018.
    [22] R. D. Schreiber, “SYSTOLIC ARRAYS: HIGH PERFORMANCE PARALLEL MACHINES FOR MATRIX COMPUTATION,” Jan. 1984, doi: 10.1016/b978-0-12-100560-3.50019-6.
    [23] U. S. Solangi, M. Ibtesam, M. A. Ansari, J. Kim, and S. Park, “Test Architecture for Systolic Array of Edge-Based AI Accelerator,” IEEE Access, vol. 9, pp. 96700–96710, 2021, doi: 10.1109/access.2021.3094741.
    [24] H. Genc et al., "Gemmini: An agile systolic array generator enabling systematic evaluations of deep-learning architectures", arXiv:1911.09925, 2019.
    [25] D. A. N. Gookyi, E. Lee, K. Kim, S. -J. Jang and S. -S. Lee, "Exploring GEMM Operations on Different Configurations of the Gemmini Accelerator," 2022 19th International SoC Design Conference (ISOCC), Gangneung-si, Korea, Republic of, 2022, pp. 356-357, doi: 10.1109/ISOCC56007.2022.10031536.
    [26] J. H. Koo, “A component layout inspection system based on the heat map marking rule applied to Printed Circuit Boards”, National Taiwan Normal University, July. 2022, doi: 10.6345/NTNU202201353.
    [27] H. K. Chang, “Real-time gesture recognition system based on CenterNet algorithm and digital flex sensor”, National Taiwan Normal University, Aug. 2021, doi: 10.6345/NTNU202101300.
    [28] Y. Li et al., “BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction,” Feb. 2021, doi: 10.48550/arxiv.2102.05426.
    [29] P. S. Cheng, “Matrix multiplication based 1-D convolution with hardware accelerator”, National Taiwan Normal University, doi: 10.6345/NTNU202201331.

    下載圖示
    QR CODE