國立臺灣師範大學博碩士論文全文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	余松恬 Yu, Song-Tien
論文名稱：	通用型脈動陣列 AI 加速器：評估適用性與效能研究 A Study on the Applicability and Performance Evaluation of a General-Purpose Systolic Array AI Accelerator
指導教授：	黃文吉 Hwang, Wen-Jyi
口試委員：	葉佐任 Yeh, Tso-Zen 董一志 Tung, Yi-Chih 黃文吉 Hwang, Wen-Jyi
口試日期：	2023/07/17
學位類別：	碩士 Master
系所名稱：	資訊工程學系 Department of Computer Science and Information Engineering
論文出版年：	2023
畢業學年度：	111
語文別：	中文
論文頁數：	51
中文關鍵詞：	脈動陣列硬體加速器、邊緣運算、神經網路模型
英文關鍵詞：	Gemmini, RISC-V
DOI URL：	http://doi.org/10.6345/NTNU202301110
論文種類：	學術論文
相關次數：	點閱：381 下載：36
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

本論文旨在評估通用型脈動陣列 AI 硬體加速器在不同類型神經網路模型上的適用性及效能。隨著深度學習在邊緣運算中的廣泛應用，硬體加速器的設計成為提升邊緣運算效率的關鍵。然而，為每種類神經網路配置專用的硬體加速器並不切實際，若硬體加速器配置需要隨著模型架構的不同而頻繁改變，將是高昂成本負擔。
本論文提出一套通用型 AI 脈動陣列硬體加速器的配置，目的是解決類神經網路應用中硬體適配的問題，使單一硬體加速器能夠適用於多種不同類神經網路架構，並建立了一個基於 RISC-V 核心且與通用型 AI 硬體加速器做整合之SoC 架構平台，實作於 FPGA 板，該 SoC架構提供一個真實情況的評估平台。
本論文選用 Gemmini 作為通用型脈動陣列 AI 硬體加速器的代表，在不同的硬體配置下，針對兩種具代表性的類神經網路模型進行實驗，分別是基於二維卷積神經網路的影像元件辨識模型以及基於一維卷積的手勢辨識模型。本研究會結合效能評估並衡量 FPGA 硬體資源使用量，提出合適的通用型脈動陣列加速器硬體配置選用方案，供 AI 領域研究者參考。

誌謝	i
摘要	ii
目錄	iii
圖目錄	v
表目錄	vi
第一章 緒論	1
1-1 研究背景	1
1-2 研究動機	3
1-3 研究目的	3
1-4 研究貢獻	4
第二章 理論基礎	5
2-1 Chipyard SoC Generators	5
2-1-1 Rocket Chip	5
2-1-2 Deep Learning Accelerator	6
2-1-3 Components and Tools	6
2-2 Gemmini AI Accelerator	8
2-2-1 Processing Element and Controller	8
2-2-2 Dataflow Type	9
2-2-3 Software Support	10
2-3 Systolic Array	11
2-4 Weight Stationary	12
第三章 研究方法	18
3-1 Gemmini: Hardware Flexibility	19
3-2 Model Architecture	21
3-2-1 Automated Optical Inspection Model	21
3-2-2 Gesture Recognition Model	22
3-3 Quantization and Deployment	25
第四章 實驗結果與效能分析	27
4-1 Experimental Environment	27
4-2 Acceleration Performance Compared to CPU	29
4-2-1 Automated Optical Inspection Model Evaluation	29
4-2-2 Gesture Recognition Model Evaluation	30
4-2-3 Execution Time and Speedup Ratio	31
4-3 Gemmini Memory Subsystem and Performance	33
4-3-1 Scratchpad Memory and Hardware Resources	34
4-3-2 Accumulator Memory and Hardware Resources	36
4-3-3 Performance of Different Memory Configurations	37
4-4 Gemmini Systolic Array Size and Performance	39
4-4-1 Systolic Array PE Size and Hardware Resources	39
4-4-2 Performance of Different Systolic Array PE Size	41
4-5 L2 Cache Capacity and Performance	43
4-5-1 L2 Cache Capacity and Hardware Resources	43
4-5-2 Performance of Different L2 Cache Capacity	45
4-6 Hardware Configuration Selection Guide	47
第五章 結論	48
參考文獻	49
                                

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” Nature, vol. 521, no. 7553, pp. 436–444, May 2015, doi: 10.1038/nature14539.
[2] Y. Chen et al., “A Survey of Accelerator Architectures for Deep Neural Networks,” Engineering, vol. 6, no. 3, pp. 264–274, Mar. 2020, doi: 10.1016/j.eng.2020.01.007.
[3] A. Gonzalez and C. Hong, "A chipyard comparison of NVDLA and Gemmini", 2020
[4] H. Genc et al., "Gemmini: Enabling Systematic Deep-Learning Architecture Evaluation via Full-Stack Integration," 2021 58th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 2021, pp. 769-774, doi: 10.1109/DAC18074.2021.9586216.
[5] N. P. Jouppi et al., "In-datacenter performance analysis of a tensor processing unit," 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA), Toronto, ON, Canada, 2017, pp. 1-12, doi: 10.1145/3079856.3080246.
[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, May 2012, doi: 10.1145/3065386.
[7] P. M. Kogge and H. S. Stone, “A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations,” IEEE Transactions on Computers, vol. C–22, no. 8, pp. 786–793, Aug. 1973, doi: 10.1109/tc.1973.5009159.
[8] A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta and A. A. Bharath, "Generative Adversarial Networks: An Overview," in IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 53-65, Jan. 2018, doi: 10.1109/MSP.2017.2765202.
[9] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, doi: 10.1162/neco.1997.9.8.1735.
[10] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, 2017, pp. 5998-6008.
[11] A. Amid et al., “Chipyard: Integrated Design, Simulation, and Implementation Framework for Custom SoCs,” IEEE Micro, vol. 40, no. 4, pp. 10–21, Jul. 2020, doi: 10.1109/mm.2020.2996616.
[12] K. Asanović et al., “The rocket chip generator,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2016-17 4, 2016.
[13] A. Waterman and K. Asanović, “The RISC-V Instruction Set Manual: Volume I: Unprivileged ISA. ” SiFive Inc. and University of California, Berkeley, 2019.
[14] Y. Lee et al., “The Hwacha vector-fetch architecture manual,” EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS-2015-262, 2015.
[15] IceNet — Chipyard 1.9.0 documentation, accessed on July 9, 2023
https://chipyard.readthedocs.io/en/latest/Generators/IceNet.html
[16] SiFive Generators — Chipyard 1.9.0 documentation, accessed on July 9, 2023
https://chipyard.readthedocs.io/en/latest/Generators/SiFive-Generators.html
[17] J. Bachrach et al., “Chisel,” Proceedings of the 49th Annual Design Automation Conference on - DAC ’12, 2012, doi: 10.1145/2228360.2228584.
[18] A. Izraelevitz et al., “Reusability is FIRRTL ground: Hardware construction languages, compiler frameworks, and transformations,” Nov. 2017, doi: 10.1109/iccad.2017.8203780.
[19] Verilator — Chipyard 1.9.0 documentation, accessed on July 9, 2023 https://chipyard.readthedocs.io/en/main/Simulation/Software-RTL-Simulation.html
[20] VCS Functional Verification Solution, accessed on July 9, 2023 https://www.synopsys.com/verification/simulation/vcs.html
[21] S. Karandikar et al., “FireSim: FPGA-accelerated cycle-exact scale-out system simulation in the public cloud,” 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 29-42, 2018.
[22] R. D. Schreiber, “SYSTOLIC ARRAYS: HIGH PERFORMANCE PARALLEL MACHINES FOR MATRIX COMPUTATION,” Jan. 1984, doi: 10.1016/b978-0-12-100560-3.50019-6.
[23] U. S. Solangi, M. Ibtesam, M. A. Ansari, J. Kim, and S. Park, “Test Architecture for Systolic Array of Edge-Based AI Accelerator,” IEEE Access, vol. 9, pp. 96700–96710, 2021, doi: 10.1109/access.2021.3094741.
[24] H. Genc et al., "Gemmini: An agile systolic array generator enabling systematic evaluations of deep-learning architectures", arXiv:1911.09925, 2019.
[25] D. A. N. Gookyi, E. Lee, K. Kim, S. -J. Jang and S. -S. Lee, "Exploring GEMM Operations on Different Configurations of the Gemmini Accelerator," 2022 19th International SoC Design Conference (ISOCC), Gangneung-si, Korea, Republic of, 2022, pp. 356-357, doi: 10.1109/ISOCC56007.2022.10031536.
[26] J. H. Koo, “A component layout inspection system based on the heat map marking rule applied to Printed Circuit Boards”, National Taiwan Normal University, July. 2022, doi: 10.6345/NTNU202201353.
[27] H. K. Chang, “Real-time gesture recognition system based on CenterNet algorithm and digital flex sensor”, National Taiwan Normal University, Aug. 2021, doi: 10.6345/NTNU202101300.
[28] Y. Li et al., “BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction,” Feb. 2021, doi: 10.48550/arxiv.2102.05426.
[29] P. S. Cheng, “Matrix multiplication based 1-D convolution with hardware accelerator”, National Taiwan Normal University, doi: 10.6345/NTNU202201331.

簡易檢索 / 詳目顯示

相關論文