簡易檢索 / 詳目顯示

研究生: 來毓庭
Lai, Yu-Ting
論文名稱: 利用視覺Transformer之多標籤深度視覺語義嵌入模型
Multi-Label Deep Visual-Semantic Embedding with Visual Transformer
指導教授: 葉梅珍
Yeh, Mei-Chen
口試委員: 陳祝嵩
Chen, Chu-Song
彭彥璁
Peng, Yan-Tsung
葉梅珍
Yeh, Mei-Chen
口試日期: 2021/10/22
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 24
中文關鍵詞: 多標籤分類視覺語義嵌入模型關注機制
英文關鍵詞: multi-label classification, visual-semantic embedding, Transformer
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202101778
論文種類: 學術論文
相關次數: 點閱:130下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 多標籤影像分類是一項具挑戰性的工作,目標是同時找出不同大小的物件並且辨識正確的標籤。然而,常見的做法是使用整張影像抽取特徵,較小物體的資訊可能會因此被稀釋,或是成為雜訊,造成辨識困難。在先前的研究裡顯示,使用關注機制和標籤關係能各自增進特徵擷取和共生關係,以取得更強健的資訊,幫助多標籤分類任務。
    在本工作中,我們使用Transformer之架構,將視覺區域特徵關注至全域特徵,同時考慮標籤之間的共生關係,最後將加權後之新特徵產生出一動態的語義分類器,在語義空間內分類得出預測標籤。在實驗中,顯示我們的模型可達到很好的成效。

    Multi-label classification is a challenge task since we must identify many kinds of objects in different scales. While using global features of an image may discard small object information, many researches have shown that an attention mechanism improves feature extraction and that label relations reveal label co-occurrence, both of which benefit a multi-label classification task.
    In this work, we extract attended features from one image by Transformer and simultaneously consider labels’ co-occurrence. Then, we use the attended features to generate a classifier applied on the semantic space to predict the labels. Experiments validate the proposed method.

    第一章 緒論 1 1.1 研究背景 1 1.2 研究動機 2 1.3 研究目的 3 1.4 論文架構 3 第二章 相關研究探討 4 2.1 多標籤分類 4 2.2 語義嵌入 5 2.3 關注機制與Transformer 6 第三章 方法與步驟 8 3.1 問題定義 8 3.2 模型架構 8 3.3 Transformer 9 3.3.1 Multi-Head Attention 9 3.3.2 前饋神經網路 10 3.4 損失函數 10 第四章 實驗 11 4.1 資料集 11 4.2 實作細節 12 4.3 評估方式 13 4.4 實驗一:與其他模型的比較 14 4.5 實驗二:與baseline的比較 17 4.6 實驗三:類別平均精準度分析 18 第五章 結論與未來工作 21

    [1]M. Yeh, Y. Li, “Multilabel Deep Visual-Semantic Embedding”, In TPAMI, 2019
    [2]Z. M. Chen, X. S. Wei, P. Wang and Y. Guo, “Multi-label image recognition with graph convolutional networks”, In CVPR, 2019.
    [3]V. O. Yazici, A. Gonzalez-Garcia, A. Ramisa, B. Twardowski, and J. V. D. Weijer, “Orderless recurrent models for multi-label classification”, In CVPR, 2020.
    [4]J. Zhang, Q. Wu, C. Shen, J. Zhang, and J. Lu, “Multilabel image classification with regional latent semantic dependencies”, In IEEE Transactions on Multimedia 20(10): 2801-2813, 2018.
    [5]F. Zhu, H. Li, W. Ouyang, N. Yu, and X. Wang, “Learning spatial regularization with image-level supervisions for multi-label image classification”, In CVPR, 2017.
    [6]T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks”, In arXiv:1609.02907, 2016.
    [7]K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition”, In CVPR, 2016.
    [8]M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. “The Pascal Visual Object Classes (VOC) Challenge”. In IJCV, pp. 303–338, 2010.
    [9]T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick. “Microsoft COCO: Common objects in context”. In ECCV. 2014.
    [10]T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “Nuswide: A real-world web image database from national university of singapore,” In Proceedings of the ACM international conference on image and video retrieval, 2009.
    [11]C.-J. Wang, T.-H. Wang, H.-W. Yang, B.-S. Chang, and M.-F. Tsai, “ICE: Item concept embedding via textual information,” In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 85–94, 2017.
    [12]Y. Gong, Y. Jia, T. Leung, A. Toshev, and S. Ioffe, “Deep Convolutional Ranking for Multilabel Image Annotation”, In arXiv:1312.4894, 2013.
    [13]F. Liu, T. Xiang, T. Hospedales, T. M. Yang, and C. Sun, “Semantic Regularisation for Recurrent Image Annotation”, In CVPR, 2017.
    [14]R. You, Z. Guo, L. Cui, X. Long, Y. Bao, and S. Wen, “Cross-modality attention with semantic graph embedding for multi-label classification”, In AAAI, 2020.
    [15]A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean,T. Mikolov, “DeViSE: A deep visual-semantic embedding model”, In NIPS, 2013.
    [16]A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, I. Polosukhin, et al, “Attention is all you need”, In NeurIPS, 2017.
    [17]A. SharifRazavian, H. Azizpour, J. Sullivan and S. Carlsson, “CNN features off-the-shelf: an astounding baseline for recognition”, In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 806-813, 2014.
    [18]J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. “Imagenet: A large-scale hierarchical image database”, In CVPR, 2009.
    [19]K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition”, In ICLR, 2015.
    [20]J. Wang, Y. Yang, J. Mao, Z. Huang, C. Huang, and W. Xu, “CNN-RNN: A unified framework for multi-label image classification”, In CVPR, 2016.
    [21]D. Huynh and E. Elhamifar, “Fine-grained generalized zero-shot learning via dense attribute-based attention”, In CVPR, 2020.
    [22]Y. Fu, T. M. Hospedales, T. Xiang, and S. Gong, “Transductive multi-view zero-shot learning”, In TPAMI, 2015.
    [23]T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space”, In ICLR, 2013.
    [24]A. Vaswani, P. Ramachandran, A. Srinivas, N. Parmar, B. Hechtman and J. Shlens, “Scaling Local Self-Attention for Parameter Efficient Visual Backbones”, In CVPR, 2021.
    [25]K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. R. Salakhutdinov, R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention”, In ICML, 2015.
    [26]S. Narayan, A. Gupta, S. Khan, F. S. Khan, L. Shao, and M. Shah, “Discriminative Region-based Multi-Label Zero-Shot Learning”, arXiv preprint arXiv:2108.09301, 2021.
    [27]S. Liu, L. Zhang, X. Yang, H. Su, and J. Zhu, “Query2Label: A Simple Transformer Way to Multi-Label Classification”, arXiv preprint arXiv:2107.10834, 2021.
    [28]N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers”, In ECCV, 2021.
    [29]A. Srinivas, T. Y. Lin, N. Parmar, J. Shlens, P. Abbeel, and A. Vaswani, “Bottleneck transformers for visual recognition”, In CVPR, 2021.
    [30]A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, and J. Uszkoreit, “An image is worth 16x16 words: Transformers for image recognition at scale”, In ICLR, 2021.
    [31]T.Chen, M. Xu, X. Hui, H. Wu, and L. Lin, “Learning semantic-specific graph representation for multi-label image recognition”, In ICCV, 2019.

    無法下載圖示 本全文未授權公開
    QR CODE