簡易檢索 / 詳目顯示

研究生: 謝日棠
Hsieh, Jih-Tang
論文名稱: 以深度學習技術為基礎之線上人體動作辨識應用於室內移動型智慧機器人
Online Human Action Recognition Using Deep Learning for Indoor Smart Mobile Robots
指導教授: 方瓊瑤
Fang, Chiung-Yao
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 54
中文關鍵詞: 線上人體動作辨識室內移動行智慧機器人移動式攝影機深度學習長短期記憶雙向長短期記憶強化時序長短期記憶空間特徵時序特徵結構特徵
英文關鍵詞: Online human action recognition, Indoor smart mobile robot, Bi-directional long short-term memory, Temporal enhancement long short-term memory, Temporal feature, Structural feature
DOI URL: http://doi.org/10.6345/NTNU202001028
論文種類: 學術論文
相關次數: 點閱:261下載:46
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

本研究提出一種以深度學習技術為基礎應用於室內移動型智慧機器人之線上人體動作辨識系統。此系統利用輸入的視覺資訊且在攝影機朝向目標人物移動的狀況下進行線上人體動作辨識,主要目的在提供智慧型人機互動除了聲控與螢幕觸控外更多的介面選擇。
本系統採用三種視覺輸入資訊,分別為彩色影像資訊、短期動態資訊以及人體骨架資訊。且在進行人體偵測時涵蓋五個階段,分別為人體偵測階段、人體追蹤階段、特徵擷取階段、動作辨識階段以及結果整合階段。本系統首先使用一種二維姿態估測方法用來偵測影像中的人物位置,之後利用Deep SORT追蹤方式進行人物追蹤。之後,在已追蹤到的人物身上擷取人體動作特徵以便後續的動作辨識。本系統擷取的人體動作特徵有三種,分別為空間特徵、短期動態特徵以及骨架特徵。在動作辨識階段,本系統將三種人體動作特徵分別輸入三種訓練好的神經網路(LSTM networks)進行人體動作分類。最後,將上述三個不同神經網路的輸出結果整合後作為系統的分類結果輸出以期達到最佳成效。
另外,本研究建立一個移動式攝影機下的人體動作資料庫(CVIU Moving Camera Human Action dataset)。此資料庫共計3646個人體動作影片,其中包含三個不同攝影角度的11種單人動作和5種雙人互動動作。單人動作包括站著喝水、坐著喝水、站著吃食物、坐著吃食物、滑手機、坐下、起立、使用筆記型電腦、直走、橫走和閱讀。雙人互動動作包括踢腿、擁抱、搬東西、走向對方和走離對方。此資料庫的影片也使用來訓練與評估本系統。實驗結果顯示,空間特徵之分類器的辨識率達96.64%,短期動態特徵之分類器的辨識率達81.87%,而骨架特徵之分類器的辨識率則為68.10%。最後,三種特徵之整合辨識率可達96.84%。

This research proposes a vision-based online human action recognition system. This system uses deep learning methods to recognise human action under moving camera circumstances. The proposed system consists of five stages: human detection, human tracking, feature extraction, action classification and fusion. The system uses three kinds of input information: colour intensity, short-term dynamic information and skeletal joints.
In the human detection stage, a two-dimensional (2D) pose estimator method is used to detect a human. In the human tracking stage, a deep SORT tracking method is used to track the human. In the feature extraction stage, three kinds of features, spatial, temporal and structural, are extracted to analyse human actions. In the action classification stage, three kinds of features of human actions are respectively classified by three kinds of long short-term memory (LSTM) classifiers. In the fusion stage, a fusion method is used to leverage the three output results from the LSTM classifiers.
This study constructs a computer vision and image understanding (CVIU) Moving Camera Human Action dataset (CVIU dataset), containing 3,646 human action sequences, including 11 types of single human actions and 5 types of interactive human actions. Single human actions include drink in sit and stand positions, eat in sit and stand positions, play with a phone, sit down, stand up, use a laptop, walk straight, walk horizontal, and read. Interactive human actions include kick, hug, carry object, walk toward each other, and walk away from each other. This dataset was used to train and evaluate the proposed system. Experimental results showed that the recognition rates of spatial features, temporal features and structural features were 96.64%, 81.87% and 68.10%, respectively. Finally, the fusion result of human action recognition for indoor smart mobile robots in this study was 96.84%.

Chapter 1 Introduction 1 1.1 Research Motivation 1 1.2 Background and Difficulty 6 1.3 Research Contribution 7 1.4 Thesis Framework 8 Chapter 2 Related Work 9 2.1 Features of Human Action Recognition 9 2.2 Models of Human Action Recognition 13 Chapter 3 Online Human Action Recognition System 15 3.1 Research Purpose 15 3.2 System Flowchart 15 3.2.1 Human Detection 16 3.2.2 Human Tracking 17 3.2.3 Feature Extraction 20 3.2.4 Action Classification 25 3.2.5 Fusion 30 Chapter 4 Experimental Results 32 4.1 Research Environment and Equipment Setup 32 4.2 CVIU Moving Camera Human Action Dataset 32 4.3 Action Classification Results of Three Types of Features 33 4.4 Fusion Results 40 4.5 Multi-Human Action Classification Results 43 Chapter 5 Conclusions and Future Works 45 5.1 Conclusions 45 5.2 Future Works 46 References 47

[Hoa12] M. Hoai and F. De la Torre, “Max-Margin Early Event Detectors,” Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, 2012, pp. 2863-2870.
[De18] R. De Geest and T. Tuytelaars, “Modeling Temporal Structure with LSTM for Online Action Detection,” Proceedings of 2018 IEEE Winter Conference on Applications of Computer Vision, Lake Tahoe, NV, 2018, pp. 1549-1557.
[Hoc97] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Computation, vol. 9, no. 8, 1997, pp. 1735-1780.
[Han18] Y. Han, S. Chung, A. Ambikapathi, J. Chan, W. Lin, and S. Su, “Robust Human Action Recognition Using Global Spatial-Temporal Attention for Human Skeleton Data,” Proceedings of 2018 International Joint Conference on Neural Networks, Rio de Janeiro, 2018, pp. 1-8.
[Jun18] S. Jun and Y. Choe, “Deep Batch-Normalized LSTM Networks with Auxiliary Classifier for Skeleton Based Action Recognition,” Proceedings of 2018 IEEE International Conference on Image Processing, Applications and Systems, Sophia Antipolis, France, 2018, pp. 279-284.
[Sha16] A. Shahroudy, J. Liu, T. Ng, and G. Wang, “NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis,” Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 1010-1019.
[Son18] S. Song, C. Lan, J. Xing, W. Zeng, and J. Liu, “Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection,” IEEE Transactions on Image Processing, vol. 27, no.7, 2018, pp. 3459-3471.
[Tu18] J. Tu, H. Liu, F. Meng, M. Liu, and R. Ding, “Spatial-Temporal Data Augmentation Based on LSTM Autoencoder Network for Skeleton-Based Human Action Recognition,” Proceedings of 2018 25th IEEE International Conference on Image Processing, Athens, 2018, pp. 3478-3482.
[Li17] C. Li, Y. Hou, P. Wang, and W. Li, “Joint Distance Maps Based Action Recognition with Convolutional Neural Networks,” IEEE Signal Processing Letters, vol. 24, no. 5, 2017, pp. 624-628.
[Liu18] J. Liu, A. Shahroudy, D. Xu, A. C. Kot, and G. Wang, “Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 12, 2018, pp. 3007-3021.
[Soo19] K. Soomro, H. Idrees, and M. Shah, “Online Localization and Prediction of Actions and Interactions,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 41, no. 2, 2019, pp. 459-472.
[Wei16] S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional Pose Machines,” Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 4724-4732.
[Ull18] A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, “Action Recognition in Video Sequences Using Deep Bi-Directional LSTM with CNN Features,” IEEE Access, vol. 6, 2018, pp. 1155-1166.
[Ouy19] X. Ouyang, S. Xu, C. Zhang, P. Zhou, Y. Yang, G. Liu, and X. Li, “A 3D-CNN and LSTM Based Multi-Task Learning Architecture for Action Recognition,” IEEE Access, vol. 7, pp. 40757-40770, 2019.
[Hua19] J. Huang, N. Li, T. Li, S. Liu, and G. Li, “Spatial-Temporal Context-Aware Online Action Detection and Prediction,” IEEE Transactions on Circuits and Systems for Video Technology (Early Access), 2019, pp. 1-13.
[You19] Q. You and H. Jiang, “Action4D: Online Action Recognition in the Crowd and Clutter,” Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 2019, pp. 11849-11858.
[Liu19] J. Liu, Y. Li, S. Song, J. Xing, C. Lan, and W. Zeng, “Multi-Modality Multi-Task Recurrent Neural Network for Online Action Detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 9, 2019, pp. 2667-2682.
[Goe18] A. Goel, A. Abubakr, M. Koperski, F. Bremond, and G. Francesca, “Online Temporal Detection of Daily-Living Human Activities in Long Untrimmed Video Streams,” Proceedings of 2018 IEEE International Conference on Image Processing, Applications and Systems, Sophia Antipolis, France, 2018, pp. 43-48.
[Du18] W. Du, Y. Wang, and Y. Qiao, “Recurrent Spatial-Temporal Attention Network for Action Recognition in Videos,” IEEE Transactions on Image Processing, vol. 27, no. 3, 2018, pp. 1347-1360.
[Ni11] J. Ni and J. Xu, “A Statistical Model Based on Spatio-Temporal Features for Action Recognition,” Proceedings of 2011 Seventh International Conference on Natural Computation, Shanghai, 2011, pp. 1593-1597.
[Liu10] J. Liu, J. Yang, Y. Zhang, and X. He, “Action Recognition by Multiple Features and Hyper-Sphere Multi-Class SVM,” Proceedings of 2010 20th International Conference on Pattern Recognition, Istanbul, 2010, pp. 3744-3747.
[Dol05] P. Dollar, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior Recognition via Sparse Spatio-Temporal Features,” Proceedings of 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, Beijing, 2005, pp. 65-72.
[Sch97] M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neural Networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, 1997, pp. 2673-2681.
[Kri12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” Proceedings of the 25th International Conference on Neural Information Processing Systems, vol. 1, Nevada, 2012, pp. 1097-1105.
[Tra15] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks,” Proceedings of 2015 IEEE International Conference on Computer Vision, Santiago, 2015, pp. 4489-4497.
[Sim14] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arXiv:1409.1556 [cs.CV], 2014.
[Bah14] D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” arXiv:1409.0473 [cs.CL], 2014.
[Lin13] M. Lin, Q. Chen, and S. Yan, “Network in Network,” arXiv:1312.4400 [cs.NE], 2013.
[Liu16] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “SSD: Single Shot Multibox Detector,” Proceedings of European Conference on Computer Vision, arXiv:1512.02325 [cs.CV], Amsterdam, 2016.
[He16] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 770-778.
[Li17] C. Li, P. Wang, S. Wang, Y. Hou, and W. Li, “Skeleton-Based Action Recognition Using LSTM and CNN,” Proceedings of 2017 IEEE International Conference on Multimedia & Expo Workshops, Hong Kong, 2017, pp. 585-590.
[Liu17] C. Liu, Y. Li, Y. Hu, and J. Liu, “Online Action Detection and Forecast via Multitask Deep Recurrent Neural Networks,” Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, New Orleans, LA, 2017, pp. 1702-1706.
[Cio18] G. Ciocca, A. Elmi, P. Napoletano, and R. Schettini, “Activity Monitoring from RGB Input for Indoor Action Recognition Systems,” Proceedings of 2018 IEEE 8th International Conference on Consumer Electronics - Berlin, Berlin, 2018, pp. 1-4.
[Cha19] M. Chang, J. Hsieh, C. Fang, and S. Chen, “A Vision-Based Human Action Recognition System for Moving Cameras Through Deep Learning,” Proceedings of the 2019 2nd International Conference on Signal Processing and Machine Learning, Hangzhou, 2019, pp. 85–91.
[Wan16] P. Wang, C. Li, Y Hou, and W Li, “Action Recognition Based on Joint Trajectory Maps with Convolutional Neural Networks,” Proceedings of the 24th ACM international conference on Multimedia, Amsterdam, 2016.
[Ijj14] E. P. Ijjina and C. K. Mohan, “Human Action Recognition Based on Recognition of Linear Patterns in Action Bank Features Using Convolutional Neural Networks,” Proceedings of 13th International Conference on Machine Learning and Applications, Detroit, MI, 2014, pp. 178-182.
[Ull19] A. Ullah, K. Muhammad, J. Del Ser, S. W. Baik, and V. H. C. de Albuquerque, “Activity Recognition Using Temporal Optical Flow Convolutional Features and Multilayer LSTM,” IEEE Transactions on Industrial Electronics, vol. 66, no. 12, 2019, pp. 9692-9702.
[Jag16] B. Jagadeesh and C. M. Patil, “Video Based Action Detection and Recognition Human Using Optical Flow and SVM Classifier,” Proceedings of 2016 IEEE International Conference on Recent Trends in Electronics, Information & Communication Technology, Bangalore, 2016, pp. 1761-1765.
[Ilg17] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, “FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks,” Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, 2017, pp. 1647-1655.
[Cao19] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh, “OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. (Early Access)
[Fan17] H. Fang, S. Xie, Y. Tai, and C. Lu, “RMPE: Regional Multi-Person Pose Estimation,” Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, 2017, pp. 2353-2362.
[He17] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” Proceedings of 2017 IEEE International Conference on Computer Vision, Venice, 2017, pp. 2980-2988.
[Pap18] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and K. Murphy, “Personlab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model,” Proceedings of European Conference on Computer Vision, Germany, 2018, pp. 282-299.
[Koc18] M. Kocabas, S. Karagoz, and E. Akbas, “MultiPoseNet: Fast Multi-Person Pose Estimation Using Pose Residual Network,” Proceedings of European Conference on Computer Vision, Germany, 2018, pp. 437-453.
[Woj17] N. Wojke, A. Bewley, and D. Paulus, “Simple Online and Realtime Tracking with a Deep Association Metric,” Proceedings of 2017 IEEE International Conference on Image Processing, Beijing, 2017, pp. 3645-3649.
[Sze16] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, 2016, pp. 2818-2826.
[He15] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification,” Proceedings of 2015 IEEE International Conference on Computer Vision, Santiago, 2015, pp. 1026-1034.
[Iof15] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” Proceedings of the 32nd International Conference on Machine Learning, France, 2015, pp 448-456.
[Sze15] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going Deeper with Convolutions,” Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, 2015, pp. 1-9.
[Chu14] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” Neural Information Processing Systems 2014 Workshop on Deep Learning, Canada, 2014.

Jenny Medeiros (2018, May 24). “LG Made An Airport Guide Robot and It's Going Places (Literally)” Available: https://www.voicesum mit.ai/blog/lg-made-an-airport-guide-robot-and-its-going-places-litera lly-0. (Nov. 10, 2019)

Margaret Rouse (2005, Sep.). “AIBO (Artificial Intelligence roBOt)”. Available: https://searchcio.techtarget.com/definition/AIBO. (Jul. 30, 2019)

Zenbo Design Guideline (Zenbo Introduction). Available: https://zen bo.asus.com/developer/documents/Design-uideline/Zenbo-Introductio n/Basic-Functions. (Nov. 10, 2019)

Hope Reese (2016, Aug. 11). “A 4-foot tall humanoid robot named Pepper, made by Aldebaran, a SoftBank company, interacts with people in hospitals, hotels, and homes. TechRepublic's comprehensive guide explains how Pepper works”. Available: https://www.techrepubli c.com/article/pepper-the-robot-the-smart-persons-guide/. (Jul.30, 2019)

Justin Kahn (2017, Jan, 4). LG tries to take over the world with new lineup of adorable life-size robots for the home, airport and more. Available: https://9to5toys.com/2017/01/04/lg-adorable-life-size-robot s-home-airport/. (Apr. 4, 2020)

Shine ding (2018, Jan. 12). Aibo resurgence is super cute!Sony participates in AI network connection! Available: http://pc3mag.com/ sony-ai-aibo/. (Apr. 4, 2020)

Ausu. Zenbo Qrobot. Available: https://zenbo.asus.com.cn/product /zenbo/support/. (Apr. 4, 2020)

Softbank Robotics Europe (2016, Feb. 12). File: Pepper the Robot.jpg. Available: https://fr.wikipedia.org/wiki/Fichier:Pepper_the_Robot.jpg. (Apr. 4, 2020)

“Smart Robot Market – Global Industry Analysis and Forecast (2017-2026) _ by Components (Hardware, Software), by Industrial Application (Electronics, Automotive, and Others), by Service Application (Personal, Professional), and by Geography” (2018, Feb.). Available: https://www.maximizemarketresearch.com/market-report/ smart-robot-market/2317/. (Jul. 30, 2019).

“Global Indoor Robots Market – Industry Analysis and Forecast (2019-2026) – by Product, End User and Region” (2019, Jul.). Available: https://www.maximizemarketresearch.com/market-report/global-indoo r-robots-market/33164/. (Apr. 4, 2020)

“Mobile Robots Market by Operating Environment (Aerial, Ground, and Marine), Component (Control System, Sensors), Type (Professional and Personal & Domestic Robots), Application (Domestic, Military, Logistics, Field), and Geography - Global Forecast 2023” (2019, Sep.). Available: https://www.marketsandmarkets.com/ Market-Reports/mobile-robots-market-43703276.html?gclid=CjwKC A iA zuPuBRAIEiwAkkmOSNhzEu8csjHz WQaBt6UcIvepvuUAfmcJ QhD ijRNQ0HZh_Xf630L-GBoCG20QAvD_BwE. (Nov. 23, 2019)

“Service Robotics Market - Growth, Trends, and Forecast (2019-2024)” (2019, Apr.). Available: https://www.mordorintelligence.com/industry-reports/global-service-robotics-market-industry. (Nov. 23, 2019)

“Robotics Market - Growth, Trends, and Forecast (2019-2024)” (2019,Feb.). Available: https://www.mordorintelligence. com/industry-reports/robotics-market. (Nov. 23, 2019)

J. Liu, A. Shahroudy, M. Perez, G. Wang, L. Duan, and A. Kot, “NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2019. Available: http://rose1.ntu.edu.sg/ Dataset s/actionRecognition.asp#C7. (Apr. 15, 2020)

University of California, Berkeley. 2013. Berkeley Multimodal Human Action Database (MHAD). Available: https://tele-immersion.citris-uc.org/berkeley_mhad. (Apr. 15, 2020)

KTH-dataset, Schuldt, Laptev and Caputo, Proc. ICPR’04, Cambridge, UK. 2005. Recognition of human actions. Available: https://ww w.csc.kth.se/cvap/actions/. (Apr. 15, 2020)

Stony Brook University. 2012. SBU Kinect Interaction Dataset. Available: https://www3.cs.stonybrook. edu/~kyun/research /kinect_int eraction/index.html. (Apr. 15, 2020)

Spatial and Temporal Resolution Up Conversion Team, ICST, Peking University. 2017. PKU-MMD. Available: http://www.icst.pku.edu.cn /struct/Projects/PKUMMD.html. (Apr. 15, 2020)

下載圖示
QR CODE