研究生: 謝日棠
Hsieh, Jih-Tang
論文名稱: 以深度學習技術為基礎之線上人體動作辨識應用於室內移動型智慧機器人
Online Human Action Recognition Using Deep Learning for Indoor Smart Mobile Robots
指導教授: 方瓊瑤
Fang, Chiung-Yao
學位類別: 碩士
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 54
中文關鍵詞: 線上人體動作辨識室內移動行智慧機器人移動式攝影機深度學習長短期記憶雙向長短期記憶強化時序長短期記憶空間特徵時序特徵結構特徵
英文關鍵詞: Online human action recognition, Indoor smart mobile robot, Bi-directional long short-term memory, Temporal enhancement long short-term memory, Temporal feature, Structural feature
DOI URL: http://doi.org/10.6345/NTNU202001028
論文種類: 學術論文
相關次數: 點閱:141下載:46
  • 本研究提出一種以深度學習技術為基礎應用於室內移動型智慧機器人之線上人體動作辨識系統。此系統利用輸入的視覺資訊且在攝影機朝向目標人物移動的狀況下進行線上人體動作辨識,主要目的在提供智慧型人機互動除了聲控與螢幕觸控外更多的介面選擇。
    本系統採用三種視覺輸入資訊,分別為彩色影像資訊、短期動態資訊以及人體骨架資訊。且在進行人體偵測時涵蓋五個階段,分別為人體偵測階段、人體追蹤階段、特徵擷取階段、動作辨識階段以及結果整合階段。本系統首先使用一種二維姿態估測方法用來偵測影像中的人物位置,之後利用Deep SORT追蹤方式進行人物追蹤。之後,在已追蹤到的人物身上擷取人體動作特徵以便後續的動作辨識。本系統擷取的人體動作特徵有三種,分別為空間特徵、短期動態特徵以及骨架特徵。在動作辨識階段,本系統將三種人體動作特徵分別輸入三種訓練好的神經網路(LSTM networks)進行人體動作分類。最後,將上述三個不同神經網路的輸出結果整合後作為系統的分類結果輸出以期達到最佳成效。
    另外,本研究建立一個移動式攝影機下的人體動作資料庫(CVIU Moving Camera Human Action dataset)。此資料庫共計3646個人體動作影片,其中包含三個不同攝影角度的11種單人動作和5種雙人互動動作。單人動作包括站著喝水、坐著喝水、站著吃食物、坐著吃食物、滑手機、坐下、起立、使用筆記型電腦、直走、橫走和閱讀。雙人互動動作包括踢腿、擁抱、搬東西、走向對方和走離對方。此資料庫的影片也使用來訓練與評估本系統。實驗結果顯示,空間特徵之分類器的辨識率達96.64%,短期動態特徵之分類器的辨識率達81.87%,而骨架特徵之分類器的辨識率則為68.10%。最後,三種特徵之整合辨識率可達96.84%。

    This research proposes a vision-based online human action recognition system. This system uses deep learning methods to recognise human action under moving camera circumstances. The proposed system consists of five stages: human detection, human tracking, feature extraction, action classification and fusion. The system uses three kinds of input information: colour intensity, short-term dynamic information and skeletal joints.
    In the human detection stage, a two-dimensional (2D) pose estimator method is used to detect a human. In the human tracking stage, a deep SORT tracking method is used to track the human. In the feature extraction stage, three kinds of features, spatial, temporal and structural, are extracted to analyse human actions. In the action classification stage, three kinds of features of human actions are respectively classified by three kinds of long short-term memory (LSTM) classifiers. In the fusion stage, a fusion method is used to leverage the three output results from the LSTM classifiers.
    This study constructs a computer vision and image understanding (CVIU) Moving Camera Human Action dataset (CVIU dataset), containing 3,646 human action sequences, including 11 types of single human actions and 5 types of interactive human actions. Single human actions include drink in sit and stand positions, eat in sit and stand positions, play with a phone, sit down, stand up, use a laptop, walk straight, walk horizontal, and read. Interactive human actions include kick, hug, carry object, walk toward each other, and walk away from each other. This dataset was used to train and evaluate the proposed system. Experimental results showed that the recognition rates of spatial features, temporal features and structural features were 96.64%, 81.87% and 68.10%, respectively. Finally, the fusion result of human action recognition for indoor smart mobile robots in this study was 96.84%.

    Chapter 1 Introduction 1 1.1 Research Motivation 1 1.2 Background and Difficulty 6 1.3 Research Contribution 7 1.4 Thesis Framework 8 Chapter 2 Related Work 9 2.1 Features of Human Action Recognition 9 2.2 Models of Human Action Recognition 13 Chapter 3 Online Human Action Recognition System 15 3.1 Research Purpose 15 3.2 System Flowchart 15 3.2.1 Human Detection 16 3.2.2 Human Tracking 17 3.2.3 Feature Extraction 20 3.2.4 Action Classification 25 3.2.5 Fusion 30 Chapter 4 Experimental Results 32 4.1 Research Environment and Equipment Setup 32 4.2 CVIU Moving Camera Human Action Dataset 32 4.3 Action Classification Results of Three Types of Features 33 4.4 Fusion Results 40 4.5 Multi-Human Action Classification Results 43 Chapter 5 Conclusions and Future Works 45 5.1 Conclusions 45 5.2 Future Works 46 References 47

