研究生: |
俞柏丞 Yu, Po-Cheng |
---|---|
論文名稱: |
基於深度學習之攝影指引系統──多面相評論和評分 Deep Learning-Based Photography Guidance System: Multi-Aspect Reviews and Ratings |
指導教授: |
方瓊瑤
Fang, Chiung-Yao 吳孟倫 Wu, Meng-Luen |
口試委員: |
方瓊瑤
Fang, Chiung-Yao 陳世旺 Chen, Sei-Wang 羅安鈞 Luo, An-Chun 黃仲誼 Huang, Chung-I 吳孟倫 Wu, Meng-Luen |
口試日期: | 2024/07/12 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2024 |
畢業學年度: | 112 |
語文別: | 中文 |
論文頁數: | 49 |
中文關鍵詞: | 拍攝指引系統 、影像美學描述 、自然語言處理 、異質性特徵融合 、影像美學評估 、電腦視覺 |
英文關鍵詞: | Photography Guidance System, Aesthetic Image Captioning, Natural Language Processing, Heterogeneous features fusion, Image Aesthetics Assessment, Computer Vision |
研究方法: | 實驗設計法 、 主題分析 、 比較研究 、 觀察研究 |
DOI URL: | http://doi.org/10.6345/NTNU202401358 |
論文種類: | 學術論文 |
相關次數: | 點閱:315 下載:4 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來,自然語言處理和影像處理領域進步迅速,各種應用蓬勃發展帶眾多應用。隨著手機成為日常拍攝的重要工具,本研究提出一套基於深度學習的拍攝指引系統。該系統結合自然語言處理和影像處理技術,幫助使用者在拍攝過程中獲得具有情感和美學價值的建議。本系統通過文字評論與美學分數提供指引,幫助使用者提高攝影技巧,並準確地捕捉畫面中的美感。
拍攝指引系統主要可以分成兩個子系統,一個是輸出分數的美學評分子系統,另外一個是輸出文字的美學評論子系統。其中第一個為輸出分數的美學評分子系統,採用多尺度影像品質評估模型,作為本研究客觀評估影像的參考指標。另外一個為美學評論子系統,採用Encoder-Decoder構成的文字生成模型,本研究選擇SwinV2作為Encoder來擷取影像特徵,並使用GPT-2作為Decoder學習文字特徵,同時在其內部使用交互注意力機制(cross attention)做異質性特徵融合,最後生成評論。但交互注意力機制不能有效融合異質性特徵,所以本研究引入Self-Resurrecting Activation Unit (SRAU)來控制異質性特徵學習的內容。而GPT-2 block中的多層感知網路Multi-Layer Perceptron(MLP)無法學習處理複雜的特徵資訊,所以本研究採用前饋網路高斯誤差門控線性單元Feedforward Network Gaussian Error Gated Linear Units (FFN_GEGLU)網路架構,來提升模型學習的效果。
為解決資料集過少的問題,本研究採用網路收集的弱標籤資料集,但弱標籤資料內文字評論常有錯誤。為提升資料集品質,本研究採用兩個方法。一是收集並整理弱標籤資料集,通過資料清洗提高品質;二是加入高品質資料進行訓練,並通過資料增強的方式增加高品質資料集的數量。通過這些資料處理方法,本研究將其整合成一個高品質資料集進行訓練及測試。結果顯示35個評估指標中有33個優於基準模型,改良證明模型在五種美學面向中有94%的指標優於基準模型,顯示其有效性。
In recent years, advancements in natural language processing and image processing have led to numerous applications. With smartphones becoming essential for daily photography, this study proposes a deep learning-based photography guidance system. Combining natural language processing and image processing, the system provides users with emotionally and aesthetically valuable suggestions through textual comments, enhancing their photography skills to better capture and present beauty and express emotional stories.
The photography guidance system is divided into two subsystems: an aesthetic scoring subsystem that outputs scores and an aesthetic critique subsystem that outputs text. The aesthetic scoring subsystem employs a multi-scale image quality assessment model as an objective reference for evaluating images. The aesthetic critique subsystem uses an Encoder-Decoder framework for text generation. This study selects SwinV2 as the Encoder to extract image features and GPT-2 as the Decoder to learn text features. Additionally, it employs a cross-attention mechanism for heterogeneous feature fusion, ultimately generating reviews. Since the cross-attention mechanism cannot effectively fuse heterogeneous features, this study introduces the Self-Resurrecting Activation Unit (SRAU) to control the learning of heterogeneous features. Moreover, the Multi-Layer Perceptron (MLP). In the GPT-2 block, the Multi-Layer Perceptron (MLP) is unable to learn and process complex feature information. Therefore, this study adopts the Feedforward Network Gaussian Error Gated Linear Units (FFN_GEGLU) architecture to enhance the model's learning effectiveness.
The dataset in this study mainly consists of weakly labeled data with unverified and error-prone textual reviews. To improve quality, two methods are proposed: cleaning and organizing the weakly labeled data, and augmenting high-quality datasets to address the shortage of aesthetic reviews. These methods create a high-quality database for training and testing. Results show that the proposed photography guidance system outperforms the baseline model in 33 of 35 metrics and exceeds the baseline in 94% of aesthetic aspects, confirming its effectiveness.
[Liu20] D. Liu, R. Puri, N. Kamath, and S. Bhattacharya, "Composition-aware Image Aesthetics Assessment," Proceedings of 2020 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, 2020, pp. 3569-3578.
[Roy18] H. Roy, T. Yamasaki, and T. Hashimoto, "Predicting Image Aesthetics Using Objects in the Scene," Proceedings of 2018 International Joint Workshop on Multimedia Artworks Analysis and Attractiveness Computing in Multimedia (MMArt&ACM'18), New York, NY, USA, 2018, pp. 14-19.
[He22] S. He, Y. Zhang, R. Xie, D. Jiang, and A. Ming, "Rethinking Image Aesthetics Assessment: Models, Datasets and Benchmarks," Proceedings of 2022 Thirty-First International Joint Conference on Artificial Intelligence (IJCAI), Vienna, 2022, pp. 942-948.
[Ke21] J. Ke, Q. Wang, Y. Wang, P. Milanfar, and F. Yang, "MUSIQ: Multi-scale Image Quality Transformer," Proceedings of 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, 2021, pp. 5148-5157.
[Cha17] K.-Y. Chang, K.-H. Lu, and C.-S. Chen, "Aesthetic Critiques Generation for Photos," Proceedings of 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 2017, pp. 3534-3543.
[He20] J. He, Y. Liu, Y. Qiao, and C. Dong, "Conditional Sequential Modulation for Efficient Global Image Retouching," Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 2020, pp. 679-695.
[Wan19] R. Wang, Q. Zhang, C.-W. Fu, X. Shen, W. Zheng, and J. Jia, “Underexposed Photo Enhancement Using Deep Illumination Estimation,” Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 6842-6850.
[Gan22] G. Li, D. Xu, X. Cheng, L. Si, and C. Zheng, "SimViT: Exploring a Simple Vision Transformer with Sliding Windows," Proceedings of 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 2022, pp. 1-6.
[Jin22] X. Jin, J. Lv, X. Zhou, C. Xiao, X. Li, and S. Zhao, "Aesthetic Image Captioning on the FAE-Captions Dataset," Computers and Electrical Engineering (CEE), vol. 101, no. C, pp. 1-7, Jul. 2022.
[Rad19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, "Language models are unsupervised multitask learners," OpenAI blog, vol. 1, no. 8, pp. 9, 2019.
[Sca08] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The Graph Neural Network Model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61-80, 2008.
[Kuc18] M. Kucer, A. C. Loui, and D. W. Messinger, "Leveraging Expert Feature Knowledge for Predicting Image Aesthetics," IEEE Transactions on Image Processing (TIP), vol. 27, no. 10, pp. 5100-5112, 2018.
[Jin19] X. Jin, L. Wu, G. Zhao, X. Li, X. Zhang, S. Ge, D. Zou, B. Zhou, and X. Zhou, "Aesthetic Attributes Assessment of Images," Proceedings of the 27th ACM International Conference on Multimedia (MM '19), Association for Computing Machinery, 2019., Nice, France, pp. 311-319
[Liu19] D. Liu, R. Puri, N. Kamath, and S. Bhattacharya, "Modeling Image Composition for Visual Aesthetic Assessment," Proceedings of 2019 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), Long Beach, CA, USA, 2019, pp. 320-322.
[Guo18] G. Guo, H. Wang, C. Shen, Y. Yan, H. Y. M. Liao, "Automatic Image Cropping for Visual Aesthetic Enhancement Using Deep Neural Networks and Cascaded Regression," IEEE Transactions on Multimedia (TMM), vol. 20, no. 8, pp. 2073-2085, 2018.
[Wan22] J. Wang, Z. Yang, X. Hu, L. Li, K. Lin, Z. Gan, Z. Liu, C. Liu, L. Wang, "GIT: A Generative Image-to-text Transformer for Vision and Language," arXiv preprint arXiv: 2205.14100, 2022.
[Sha20] N. Shazeer, "GLU Variants Improve Transformer," arXiv preprint arXiv: 2002.05202, 2020.
[Mur12] N. Murray, L. Marchesotti, and F. Perronnin, "AVA: A Large-scale Database for Aesthetic Visual Analysis," Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2408-2415.
[Abo21] F. S. Abousaleh, W. H. Cheng, N. H. Yu, Y. Tsao, "Multimodal deep learning framework for image popularity prediction on social media," IEEE Transactions on Cognitive and Developmental Systems (TCDS), vol. 13, no. 3, pp. 679-692, 2021.
[Yao17] T. Yao, Y. Pan, Y. Li, Z. Qiu, and T. Mei, "Boosting Image Captioning with Attributes," Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4894-4902.
[Li23] J. Li, D. Li, S. Savarese, and S. Hoi, "Blip-2: Bootstrapping Language-Image Pre-Training with Frozen Image Encoders and Large Language Models," Proceedings of the International Conference on Machine Learning (ICML), PMLR, 2023, pp. 19730-19742.
[Liu22] Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong, and B. Guo, "Swin Transformer V2: Scaling Up Capacity and Resolution," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 12009-12019.
[Che22] J. Chen, H. Guo, K. Yi, B. Li, and M. Elhoseiny, "VisualGPT: Data-Efficient Adaptation of Pretrained Language Models for Image Captioning," Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18030-18040.
[1] HOW MANY PEOPLE HAVE SMARTPHONES IN 2024?:https://www.oberlo.com/statistics/how-many-people-have-smartphones,2024年。
[2] Smartphone Cameras Have Come A Long Way As Imaging Trends Receive Massive Upgrades:https://www.digitalinformationworld.com/2022/06/smartphone-cameras-have-come-long-way.html,2022年。
[3] 【快訊】手機鏡頭數越多越好?新研調:2或3顆剛剛好:https://www.jyes.com.tw/news.php?act=view&id=9248,2023年。
[4] 如何在 Galaxy s10 上使用「構圖建議」:https://www.samsung.com/tw/support/mobile-devices/how-to-use-shot-suggestions/,2020年。
[5] 手機也能拍出大片感?最常用的 10 種基礎構圖法,旅行必備拍照攻略教學:https://www.popdaily.com.tw/life/119258,2019年。
[6] A Genetic Algorithm to Combine Deep Features for the Aesthetic Assessment of Images Containing Faces:https://www.mdpi.com/1424-8220/21/4/1307,2021年。
[7] 【深度報導#2】隨手拍出大師級作品: Galaxy S10系列的革命性相機:https://news.samsung.com/tw/%E3%80%90%E6%B7%B1%E5%BA%A6%E5%A0%B1%E5%B0%8E%EF%BC%832%E3%80%91%E9%9A%A8%E6%89%8B%E6%8B%8D%E5%87%BA%E5%A4%A7%E5%B8%AB%E7%B4%9A%E4%BD%9C%E5%93%81%EF%BC%9A-galaxy-s10%E7%B3%BB%E5%88%97%E7%9A%84,2019年。
[8] Scene recognition and object detection technology:https://r2.community.samsung.com/t5/CamCyclopedia/Scene-recognition-and-object-detection-technology/ba-p/13607344,2023年。
[9] 20 種場景的智慧影像辨識加入:三星 Note 9 相機拍照進化功能首度實測,實拍對比 S9+!:https://m.eprice.com.tw/mobile/talk/4523/5115499/1/rv/samsung-galaxy-note-9-8gb_512gb-review,2018年。
[10] 分享的喜悅!台灣攝影師簡汝羚捕捉英國耶誕節街頭溫暖:https://travel.ettoday.net/article/611269.htm#ixzz8ZYUKp7xq,2015年。
[11] dpchallenge-A digital photography contest:https://www.dpchallenge.com。