研究生: |
周柏永 Chou, Po-Yung |
---|---|
論文名稱: |
以快慢雙流圖卷積神經網路架構實現骨架動作辨識 SlowFast-GCN: A Novel Skeleton-Based Action Recognition Framework |
指導教授: |
林政宏
Lin, Cheng-Hung |
口試委員: |
賴穎暉
Lau, Ying-Hui 陳勇志 Chen, Yung-Zhi |
口試日期: | 2021/08/10 |
學位類別: |
碩士 Master |
系所名稱: |
電機工程學系 Department of Electrical Engineering |
論文出版年: | 2021 |
畢業學年度: | 109 |
語文別: | 中文 |
論文頁數: | 45 |
中文關鍵詞: | 動作辨識 、骨架分析 、圖卷積神經網路 |
英文關鍵詞: | Action Recognition, Skeletons, Graph Convolutional Network |
研究方法: | 實驗設計法 |
DOI URL: | http://doi.org/10.6345/NTNU202101013 |
論文種類: | 學術論文 |
相關次數: | 點閱:208 下載:32 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本論文討論骨架動作辨識任務,此任務在過去的論文中較少討論到時間特徵的學習,大多研究如何學習到更好的空間特徵,而就過去在動作辨識任務中的經驗,時間維度對於動作辨識任務的影響是巨大的,因此我們聚焦在時間維度對此任務之影響,為此提出了一個雙流網路架構來融合不同時間尺度的輸入,以此方法來提取靜態與動態特徵,接著我們進一步針對圖卷積內部的鄰接矩陣作改良,將其設計為可以針對不同時間時間區段學習,進而學習到更精準的骨架相關性,從實驗結果可以得知,混和不同時間尺度特徵可以有效增加準確率,在NTU RGB+D能夠到達94.8%的準確率,經過改良鄰接矩陣後更是能到達95.2%的準確率,由此可以驗證,時間尺度上的特徵對於骨架動作辨識任務是相當重要的。
This thesis discusses skeleton-based action recognition tasks. In the past, most researches on this task have studied how to learn better spatial features, and seldom discussed the learning of temporal features. However, based on our experience in action recognition tasks, the features in the time dimension have a huge impact on the accuracy of the action recognition tasks. Therefore, we focus on the impact of the features in the time dimension on this task, and propose a two-stream network, called SlowFast-GCN to extract static and dynamic features simultaneously and fuse features of different time scales. Then we further improve the adjacency matrix inside the graph convolution to learn the characteristics of different time periods, and then learn more accurate skeleton correlation. Experimental results show that mixing features of different time scales can effectively increase the accuracy of action recognition. The proposed SlowFast-GCN achieves 94.8% accuracy on NTU RGB+D. After improving the adjacency matrix, it can reach an accuracy of 95.2%. The results show that the temporal features are very important for the task of skeleton-based action recognition.
[1] D.C. Van Essen, and J.L. Gallant, “Neural mechanisms of form and motion processing in the primate visual system,” Neuron, Vol. 13, Issue 1, pp. 1-10, 1994.
[2] E.A. DeYoe, and D.C. Van Essen, “Concurrent processing streams in monkey visual cortex,” Trends in Neurosciences, Vol. 11, Issue 5, pp. 219-226, 1988.
[3] J. Liu, A. Shahroudy, M. Perez, G. Wang, L. -Y. Duan and A. C. Kot, “NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI), vol. 42, no. 10, pp. 2684-2701, 1 Oct. 2020.
[4] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, 1998.
[5] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden, “Pyramid methods in image processing,” RCA Engineer, vol. 29, no. 6, pp. 33–41, 1984.
[6] T. N Kipf and M Welling, “Semi-supervised classification with graph convolutional networks,” in International Conference on Learning Representations (ICLR), 2017.
[7] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph Attention Networks,” in International Conference on Learning Representations (ICLR), 2018.
[8] S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in AAAI Conference on Artificial Intelligence, 2018.
[9] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[10] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Skeleton-Based Action Recognition with Multi-Stream Adaptive Graph Convolutional Networks,” in arXiv preprint, arXiv:1912.06971, 2019.
[11] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Skeleton-Based Action Recognition With Directed Graph Neural Networks,” in Conference on Computer Vision and Pattern Recognition(CVPR), 2019, pp. 7904-7913.
[12] L. Li, W. Zheng, Z.-X. Zhang, Y. Huang, and L. Wang , “Relational Network for Skeleton-Based Action Recognition,” in IEEE International Conference on Multimedia and Expo (ICME), 2019, pp. 826-831.
[13] C. Si, W. Chen, W. Wang, L. Wang and T. Tan, “An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 1227-1236.
[14] J. Gao, T. He, X. Zhou, S. Ge, “Focusing and Diffusion: Bidirectional Attentive Graph Convolutional Networks for Skeleton-based Action Recognition,” in arXiv preprint, arXiv:1912.11521, 2019.
[15] L. Shi, Y. Zhang, J. Cheng , and H. Lu, “Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action-Gesture Recognition,” in Proceedings of the Asian Conference on Computer Vision(ACCV), 2020.
[16] C. Yang, Y. Xu, J. Shi, B. Dai, and B. Zhou , “Temporal Pyramid Network for Action Recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[17] D. Zhang, X. Dai, and Y.-F. Wang, “Dynamic temporal pyramid network: A closer look at multi-scale modeling for activity detection,” in Asian Conference on Computer Vision(ACCV), 2018, pp 712–728.
[18] Y. Wang, M. Long, J. Wang, and P. S Yu, “Spatiotemporal pyramid network for video action recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[19] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “SlowFast Networks for Video Recognition,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
[20] C. -H. Lin, P. -Y. Chou, C. -H. Lin and M. -Y. Tsai, "SlowFast-GCN: A Novel Skeleton-Based Action Recognition Framework," in International Conference on Pervasive Artificial Intelligence (ICPAI), 2020, pp. 170-174.
[21] A. Shahroudy, J. Liu, T. -T. Ng, and G. Wang, “Ntu rgb+d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2016, pp. 1010-1019.
[22] C. Szegedy et al., “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9.
[23] A. Bochkovskiy, C. -Y. Wang, and H. -Y. M. Liao, “YOLOv4: Optimal Speed and Accuracy of Object Detection,” in arXiv preprint, arXiv:2004.10934, 2020.
[24] T. -Yi Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature Pyramid Networks for Object Detection,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2015, pp. 936-944.
[25] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980-2988.
[26] J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4724-4733.
[27] C. Gu et al., “AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2018, pp. 6047-6056.
[28] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and Manohar Paluri , “A Closer Look at Spatiotemporal Convolutions for Action Recognition,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2018, pp. 6450-6459.
[29] B. Yu, H. Yin, and Z. Zhu, “Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting,” in Proceedings of the International Joint Conference on Artificial Intelligence(IJCAI), 2018, pp. 3634-3640.
[30] K. Simonyan, and A. Zisserman, “Two-Stream Convolutional Networks for Action Recognition in Videos,” in Advances in Neural Information Processing Systems 27 (NIPS), 2014, pp. 568–576.
[31] J. Y. -H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4694-4702.
[32] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri , “Learning Spatiotemporal Features with 3D Convolutional Networks,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4489-4497.
[33] K. Hara, H. Kataoka, and Y. Satoh, “Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6546-6555.
[34] J. Deng, W. Dong, R. Socher, L. -Jia Li, K. Li, F. -F Li, “ImageNet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2009, pp. 248-255.
[35] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-Local Neural Networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7794-7803.
[36] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in International Conference on Learning Representations(ICLR), 2015.
[37] A. Vaswani et al., “Attention is All you Need,” in Advances in Neural Information Processing Systems(NIPS), 2017, pp. 6000–6010.
[38] D. Tran, H. Wang, L. Torresani, and M. Feiszli, “Video Classification With Channel-Separated Convolutional Networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 5552-5561.
[39] H. Liu, J. Tu, and M. Liu , “Two-Stream 3D Convolutional Neural Network for Human Skeleton-Based Action Recognition,” in arXiv preprint, arXiv:1705.08106, 2017.
[40] Z. Liu, H. Zhang, Z. Chen, Z. Wang, and W. Ouyang, “Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 143-152.
[41] Y. Obinata, and T. Yamamoto, “Temporal Extension Module for Skeleton-Based Action Recognition,” in arXiv preprint, arXiv:2003.08951, 2020.
[42] L. Huang, Y. Huang, W. Ouyang, and L. Wang, “Part-Level Graph Convolutional Network for Skeleton-Based Action Recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 11045-11052.
[43] Y. -F. Song, Z. Zhang, C. Shan, L. Wang, “Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-based Action Recognition,” in ACM International Conference on Multimedia, 2020, pp. 1625–1633.
[44] H. Yang, D. Yan, L. Zhang, D. Li, Y. Sun, S. -D. You, and S. J. Maybank, “Feedback Graph Convolutional Network for Skeleton-based Action Recognition,” in arXiv preprint, arXiv:2003.07564, 2020.
[45] Z. Huang, X. Shen, X. Tian, H. Li, J. Huang, and X. -S. Hua, “Spatio-Temporal Inception Graph Convolutional Networks for Skeleton-Based Action Recognition,” in Proceedings of the ACM International Conference on Multimedia (ACMMM), 2020, pp. 2122–2130.
[46] K. Cheng, Y. Zhang , C. Cao, L. Shi, J. Cheng, and H. Lu “Decoupling GCN with DropGraph Module for Skeleton-Based Action Recognition”, in ECCV, 2020, pp. 536-553.
[47] K. He, X. Zhang, S. Ren, and J. Sun , “Deep Residual Learning for Image Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.
[48] S. Ioffe, and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” in Proceedings of the International Conference on Machine Learning(ICML), 2015, pp. 448-456.
[49] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 779-788.
[50] W. Liu et al., “SSD: Single Shot MultiBox Detector,” in European Conference on Computer Vision(ECCV), 2016.
[51] J. Hu, L. Shen, S. Albanie, G. Sun, and A. Vedaldi, “Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks,” in Conference on Neural Information Processing Systems (NIPS), 2018, pp. 9423–9433.
[52] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, “Spatial Transformer Networks,” in Advances in Neural Information Processing Systems(NIPS), 2015, pp. 2017–2025.
[53] B. Wu et al., “Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2018, pp. pp. 9127-9135.
[54] C. Li, Q. Zhong, D. Xie, and S. Pu, “Collaborative spatiotemporal feature learning for video action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2019, pages 7872–7881.
[55] J. Lin, C. Gan, and S. Han, “TSM: Temporal Shift Module for Efficient Video Understanding,” in IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 7083-7093.
[56] K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, and H. Lu, “Skeleton-Based Action Recognition with Shift Graph Convolutional Network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2020, pp. 183-192.
[57] J. Zhu, W. Zou, and Z. Zhu, “End-to-end Video-level Representation Learning for Action Recognition,” in International Conference on Pattern Recognition (ICPR), 2018, pp. 645-650.
[58] T.-S. Kim, and A. Reiter, “Interpretable 3d human action analysis with temporal convolutional networks,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 1623-1631.
[59] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition From Skeleton Data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2017, pp.2117-2126.
[60] C. Li, Q. Zhong, D. Xie, and S. Pu, “Skeleton based action recognition with convolutional neural network,” in IAPR Asian Conference on Pattern Recognition (ACPR), 2015, pp. 579-583.
[61] Y. Tang, Y. Tian, J. Lu, P. Li, and J. Zhou, “Deep progressive reinforcement learning for skeleton-based action recognition,” in IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2018, pp. 5323-5332.