研究生: |
黃弘智 Huang, Hung-Chih |
---|---|
論文名稱: |
基於半監督式骨架動作辨識模型之圖資料增強方法 Graph-Based Augmentation for Semi-supervised Learning in Skeleton-based Action Recognition |
指導教授: |
林政宏
Lin, Cheng-Hung |
口試委員: |
陳勇志
Chen, Yung-Chih 賴穎暉 Lai, Ying-Hui 林政宏 Lin, Cheng-Hung |
口試日期: | 2023/07/24 |
學位類別: |
碩士 Master |
系所名稱: |
電機工程學系 Department of Electrical Engineering |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 中文 |
論文頁數: | 51 |
中文關鍵詞: | 動作辨識 、圖卷積神經網路 、半監督式學習 、圖資料強化 |
英文關鍵詞: | action recognition, Graph Convolution Network, Semi-Supervised Learning, graph data augmentation |
研究方法: | 實驗設計法 |
DOI URL: | http://doi.org/10.6345/NTNU202301345 |
論文種類: | 學術論文 |
相關次數: | 點閱:206 下載:12 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,基於骨架資訊之骨架動作辨識在圖卷積架構的導入下獲得顯著的效能提升。不同於傳統RGB影像動作辨識,骨架動作辨識的輸入資料為人體的關節點資訊,這種輸入資料的特點為不易受到現實中的背景雜訊影響,進而取得更有效率及精確性的動作辨識結果。然而,製作人體關節點的資料需要大量人力資源,這導致在現實應用環境中缺少標註樣本資料進行訓練。另外,採用預訓練好的模型亦需要花費相當的時間成本進行參數調整,成為應用的一個瓶頸。
為此,本研究中我們提出多種骨架動作資料的資料強化方法以解決少量標註資料的問題,並結合半監督學習策略有效利用未標註樣本,進而提高骨架動作辨識模型在少量標註資料環境下的辨識能力。我們提出的資料強化方法能在低成本的額外運算下,有效提高資料的多樣性,使模型可以提取更多不同的特徵資訊。在半監督學習策略中,我們採用兩種強度不同的資料增強方法作為輸入,透過計算經不同強化方法產生的辨識結果之相似度作為損失函數以強化模型對於辨識結果的一致性,並期望模型可以學習更多關於辨識決策的有效資訊。此外,我們還透過調整非標註資料加入網路訓練的時間點,在確保準確率的同時,也顯著地降低了模型訓練所需時間。實驗結果顯示,我們提出的架構在NTU RGB+D大型資料集的低資料環境實驗中,達到了84.16%的準確率,相較於原始方法的77.5%的準確率,提升了6.66%;研究結果表明我們所提出之方法在少量標註資料的情況下可以有效提升模型之辨識準確率及泛化能力,為解決實際應用中資料稀缺和降低模型的調整成本問題中提供一個有效的解決方案。
In recent years, skeleton-based action recognition has been significantly improved by the introduction of graph convolutional networks. Unlike conventional video, the input data for skeleton-based action recognition is the location of human joints, which is less susceptible to background noise, leading to more efficient and accurate action recognition results. However, preparing joint data requires a lot of human resources, causing a lack of labeled training data. On the other hand, turning the parameters of pre-trained models requires time and human resource. The above reasons become bottlenecks for skeleton-based action recognition applications. To overcome these problems, we propose several data augmentation methods specifically designed for skeleton-based data. With these approaches, we can effectively leverage the limited amount of labeled data. Additionally, we introduce a semi-supervised learning strategy to exploit the useful information of unlabeled data. For the semi-supervised learning strategy, we apply two difference strengths of data augmentation to the two streams of unlabeled data inputs. We believe that by using of the loss of consistency between the outputs of different transformations as part of the model loss, the robustness of the model can be improved and more recognition information can be obtained. Moreover, we optimize the schedule for incorporating unlabeled data into the model training process. This approach ensures both model performance and reduces training time effectively. The experimental results show that our proposed framework achieves 84.16% accuracy in a low-volume data environment experiment with NTU RGB+D large data sets, which is a 6.66% improvement over the 77.5% accuracy of the original method. The results show that our proposed method can effectively improve the recognition accuracy and generalization ability of the model with a small amount of annotated data, which provides an effective solution to solve the problem of data scarcity and reduce the adjustment cost of the model in practical applications.
[1] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016). Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision (pp. 20-36). Springer, Cham.
[2] Ji, S., Xu, W., Yang, M., & Yu, K. (2012). 3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221-231.
[3] Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision (pp. 4489-4497).
[4] Feichtenhofer, C., Pinz, A., & Zisserman, A. (2016). Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1933-1941).
[5] Zhou, B., Andonian, A., Oliva, A., & Torralba, A. (2018). Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision (ECCV) (pp. 803-818).
[6] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 1725-1732).
[7] Zhang, B., Wang, L., Wang, Z., Qiao, Y., & Wang, H. (2016). Real-time action recognition with enhanced motion vector CNNs. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2718-2726).
[8] Du, Y., Fu, Y., & Wang, L. (2015, November). Skeleton based action recognition with convolutional neural network. In 2015 3rd IAPR Asian conference on pattern recognition (ACPR) (pp. 579-583). IEEE.
[9] Liu, J., Shahroudy, A., Perez, M., Wang, G., Duan, L. Y., & Kot, A. C. (2019). Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE transactions on pattern analysis and machine intelligence, 42(10), 2684-2701.
[10] Zhao, T., Liu, Y., Neves, L., Woodford, O., Jiang, M., & Shah, N. (2021, May). Data augmentation for graph neural networks. In Proceedings of the aaai conference on artificial intelligence (Vol. 35, No. 12, pp. 11015-11023).
[11] Efros, Berg, Mori, & Malik. (2003, October). Recognizing action at a distance. In Proceedings Ninth IEEE International Conference on Computer Vision (pp. 726-733). IEEE.
[12] Wang, H., & Schmid, C. (2013). Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision (pp. 3551-3558).
[13] Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. Advances in neural information processing systems, 27.
[14] Yan, S., Xiong, Y., & Lin, D. (2018, April). Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence (Vol. 32, No. 1).
[15] Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C. A., ... & Li, C. L. (2020). Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33, 596-608.
[16] Feichtenhofer, C., Fan, H., Malik, J., & He, K. (2019). Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6202-6211).
[17] Shahroudy, A., Liu, J., Ng, T. T., & Wang, G. (2016). Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1010-1019).
[18] Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4694-4702).
[19] Carreira, J., & Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6299-6308).
[20] Zhang, B., Wang, Y., Hou, W., Wu, H., Wang, J., Okumura, M., & Shinozaki, T. (2021). Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, 34, 18408-18419.
[21] Du, Y., Wang, W., & Wang, L. (2015). Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1110-1118).
[22] Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
[23] Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019). Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12026-12035).
[24] Li, M., Chen, S., Chen, X., Zhang, Y., Wang, Y., & Tian, Q. (2019). Actional-structural graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3595-3603).
[25] Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2020). Skeleton-based action recognition with multi-stream adaptive graph convolutional networks. IEEE Transactions on Image Processing, 29, 9532-9545.
[26] Shorten, C., & Khoshgoftaar, T. M. (2019). A survey on image data augmentation for deep learning. Journal of big data, 6(1), 1-48.
[27] Zhao, T., Jin, W., Liu, Y., Wang, Y., Liu, G., Günnemann, S., ... & Jiang, M. (2022). Graph data augmentation for graph machine learning: A survey. arXiv preprint arXiv:2202.08871.
[28] Ding, K., Xu, Z., Tong, H., & Liu, H. (2022). Data augmentation for deep graph learning: A survey. ACM SIGKDD Explorations Newsletter, 24(2), 61-77.
[29] Feng, F., He, X., Tang, J., & Chua, T. S. (2019). Graph adversarial training: Dynamically regularizing based on graph structure. IEEE Transactions on Knowledge and Data Engineering, 33(6), 2493-2504.
[30] Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., & Dahl, G. E. (2017, July). Neural message passing for quantum chemistry. In International conference on machine learning (pp. 1263-1272). PMLR.
[31] Topping, J., Di Giovanni, F., Chamberlain, B. P., Dong, X., & Bronstein, M. M. (2021). Understanding over-squashing and bottlenecks on graphs via curvature. arXiv preprint arXiv:2111.14522.
[32] You, Y., Chen, T., Sui, Y., Chen, T., Wang, Z., & Shen, Y. (2020). Graph contrastive learning with augmentations. Advances in neural information processing systems, 33, 5812-5823.
[33] Thakoor, S., Tallec, C., Azar, M. G., Azabou, M., Dyer, E. L., Munos, R., ... & Valko, M. (2021). Large-scale representation learning on graphs via bootstrapping. arXiv preprint arXiv:2102.06514.
[34] Wang, Y., Wang, W., Liang, Y., Cai, Y., Liu, J., & Hooi, B. (2020, August). Nodeaug: Semi-supervised node classification with data augmentation. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 207-217).
[35] Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., & Le, Q. V. (2019). Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 113-123).
[36] Dong, H., Chen, J., Feng, F., He, X., Bi, S., Ding, Z., & Cui, P. (2021, April). On the equivalence of decoupled graph convolution network and label propagation. In Proceedings of the Web Conference 2021 (pp. 3651-3662).
[37] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., ... & Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).
[38] Gan, C., Yang, T., & Gong, B. (2016). Learning attributes equals multi-source domain generalization. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 87-97).
[39] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
[40] Yang, X., Song, Z., King, I., & Xu, Z. (2022). A survey on deep semi-supervised learning. IEEE Transactions on Knowledge and Data Engineering.
[41] Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., & Raffel, C. A. (2019). Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems, 32.
[42] Schmarje, L., Santarossa, M., Schröder, S. M., & Koch, R. (2021). A survey on semi-, self-and unsupervised learning for image classification. IEEE Access, 9, 82146-82168.
[43] Li, N., Shepperd, M., & Guo, Y. (2020). A systematic review of unsupervised learning techniques for software defect prediction. Information and Software Technology, 122, 106287.
[44] Duan, H., Wang, J., Chen, K., & Lin, D. (2022, October). Pyskl: Towards good practices for skeleton action recognition. In Proceedings of the 30th ACM International Conference on Multimedia (pp. 7351-7354).
[45] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.
[46] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams. "Learning representations by back-propagating errors." nature 323.6088 (1986): 533-536.