研究生: 呂健維
Lu, Chien-Wei
論文名稱: 基於臉部及語音特徵之輕量化深度學習情感辨識系統
Lightweight Deep Learning Emotion Recognition System Based on Facial and Speech Features
指導教授: 呂成凱
Lu, Cheng-Kai
口試委員: 呂成凱
Lu, Cheng-Kai
Lin, Cheng-Hung
Lien, Chung-Yueh
口試日期: 2024/07/15
學位類別: 碩士
系所名稱: 電機工程學系
Department of Electrical Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 86
中文關鍵詞: 深度學習雙模態情感識別輕量化模型卷積神經網路陪伴型機器人
英文關鍵詞: Deep Learning, Bimodal Emotion Recognition, Lightweight Models, Convolutional Neural Networks, Companion Robots
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202401361
論文種類: 學術論文
  • 因應近年來高齡化導致老人照護人力缺乏,本研究提出了一種可被應用於陪伴型機器人(Zenbo Junior II)上的整合臉部表情和語音的情感識別輕量化模型。近年來對於人類的情感識別技術大多使用基於卷積神經網路(Convolutional Neural Network, CNN)的方式來實現,並得到了優秀的成果,然而,這些先進的技術都沒有考慮計算成本的問題,導致這些技術在計算能力有限的設備上無法運行(例如,陪伴型機器人)。因此,本研究將輕量化的GhostNet模型,應用於臉部情感識別的模型,並將輕量化的一維卷積神經網路(One Dimensional Convolutional Neural Network, 1D-CNN)作為語音情感識別模型,再利用幾何平均數的方式將兩個模態預測的結果整合。所提出的模型,在RAVDESS和CREMA-D兩個數據集上分別取得了97.56%及82.33%的準確率,在確保了高準確率的情況下,本研究將參數量壓縮到了0.92M,浮點運算次數減少至0.77G,比起目前已知的先進技術要少了數十倍。最後,將本研究的模型實際部署在Zenbo Junior II中,並透過模型與硬體的運算強度作比較,得知本研究的模型能夠更加順利的在該硬體中運行,且臉部及語音情感識別模型的推理時間分別只有1500毫秒及12毫秒。

    According to the shortage of human resources to take care of the elderly due to the aging population in recent years, this study proposes a lightweight model that integrates facial and speech emotion recognition and can be applied to a companion robot, Zenbo Junior II. In recent years, most of the human emotion recognition techniques have been implemented using Convolutional Neural Network (CNN) based approaches and have achieved excellent results. However, these advanced techniques do not take into account the computational cost, which makes them unworkable on devices with limited computational power, including companion robots. Thus, this study constructs a more lightweight GhostNet as a model for facial emotion recognition and a lightweight 1D-CNN as a model for speech emotion recognition, and utilizes the geometric mean to predict the two modalities. The results of the two modalities are integrated in the RAVDESS and CREMA-D datasets, achieving 97.56% and 82.33% accuracy. The number of parameters was compressed to 0.92M and the floating-point operations was reduced to 0.77G, which is tens of times less than that of the state-of-the-art technology, with high accuracy. Finally, the model was actually deployed in Zenbo Junior II, and by comparing the computational intensity of the model and the hardware, it was learned that the model was able to run more smoothly in Zenbo Junior II, and the inference time of face and speech emotion recognition models are only 1500 ms and 12 ms.

    誌 謝 i 目 錄 iv 表目錄 vi 圖目錄 vii 第一章 緒論 1 1.1 背景和研究動機 1 1.2 研究目標 5 1.3 研究貢獻 6 1.4 論文架構 7 第二章 文獻探討 8 2.1 情感識別的理論基礎 8 2.2 情感識別系統 10 2.2.1 臉部情感識別系統 10 2.2.2 語音情感識別系統 17 2.2.3 文本情感識別系統 23 2.2.4 多模態情感識別系統 25 2.2.5 情感識別系統之先進技術 26 第三章 研究方法 28 3.1 數據集 30 3.1.1 RAVDESS 30 3.1.2 CREMA-D 31 3.2 臉部情感識別 32 3.2.1 臉部影像之預處理 32 3.2.2 GhostNet 33 3.2.3 Frame Attention Network 38 3.3 語音情感識別 40 3.3.1 聲音之預處理 40 3.3.2 1D-CNN架構 41 3.4 整合臉部與語音情感識別系統 44 3.5 神經網路之運算成本指標 45 3.5.1 參數量 45 3.5.2 浮點數運算次數 47 第四章 實驗結果 49 4.1 神經網路訓練環境 49 4.2 實驗參數 51 4.3 研究結果 56 4.4 消融實驗 67 4.5 情感識別系統部署Zenbo Junior II 69 4.6 運算強度 73 4.7 CPU占用率 77 4.8 推理時間 78 第五章 結論與未來展望 79 5.1 結論 79 5.2 未來展望 80 參考文獻 81 自 傳 85 學術成就 86

