簡易檢索 / 詳目顯示

研究生: 黃柏穎
Huang, Po-Ying
論文名稱: 基於 Transformer 用於物件狀態分析之關聯度計算模型
A Transformer-based Relationship Computing Model for Object Status Analysis
指導教授: 林政宏
Lin, Cheng-Hung
口試委員: 賴穎暉
Lai, Ying-Hui
陳勇志
Chen, Yung-Chih
林政宏
Lin, Cheng-Hung
口試日期: 2023/07/24
學位類別: 碩士
Master
系所名稱: 電機工程學系
Department of Electrical Engineering
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 47
中文關鍵詞: 持球者分析自注意機制骨架關節點物體關聯性
英文關鍵詞: ball handler analysis,, self-attention, skeleton joints, object relationship
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202301095
論文種類: 學術論文
相關次數: 點閱:80下載:9
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 籃球分析系統是現代籃球運動中不可或缺的工具,其中判斷持球者更是重要的任務之一。傳統的做法是先透過物件檢測取得球員與籃球的檢測框,再透過球員與籃球的幾何關係,例如計算球員與籃球的Intersection over Union,或計算球員與籃球的中心座標距離,來判斷球員是否持球。然而我們發現這樣的做法很容易發生誤判,肇因於籃球比賽中存在著複雜的球員重疊情況,幾何關係無法準確判斷球員是否持球,從而出現誤判的情況,這對於分析持球者的任務來說帶來了極大的挑戰。
    為了解決上述問題,我們提出了「基於Transformer用於物件狀態分析之關聯度計算模型」,藉由加入球員的骨架資訊做為動作特徵,透過self-attention的方法來學習球員與籃球之間彼此的關聯性。實驗結果顯示,我們的架構可以透少量的訓練資料,得到92.3%的持球準確率,這個結果超越了傳統演算法85.1%的持球準確率。最終,更在非訓練使用之測試資料集獲得95.4%的持球準確率。

    Basketball analysis systems are essential tools in modern basketball, where identifying the ball handler is one of the most critical tasks. The traditional method is to first obtain the bounding box of players and the basketball through object detection, and use the geometric relationship between players and the ball, such as calculating the Intersection over Union or the center coordinate distance between the player and the ball, to determine who is ball handler. However, we found that such an approach is prone to misjudgment, due to the complex overlap of players in the basketball game. The geometric relationship cannot accurately determine who is ball handler. This poses a great challenge to the task of analyzing ball handler.
    In order to solve the above problems, we proposed” A Transformer-based Relationship Computing Model for Object Status Analysis” by adding the player's skeleton information as the action feature to learn the relationship between players and the ball through the self-attention. Experimental results show that our method achieves an accuracy of ball handler up to 92.3% based on a smaller dataset, surpassing the 85.1% accuracy of traditional algorithms. In the end, it achieved 95.4% in test dataset which different from the training dataset.

    誌謝 i 摘 要 ii ABSTRACT iii 目錄 iv 表目錄 vi 第一章 緒論 1 1.1 研究背景與動機 1 1.2 研究目的 3 1.3 研究方法概述 3 1.4 研究貢獻 4 1.5 論文架構 5 第二章 文獻探討 6 2.1 歐式距離 6 2.1.1 計算中心座標的歐式距離 6 2.2 Intersection over Union (IoU) 8 2.2.1 IoU持球判斷 9 2.3 殘差式學習 11 2.4 Transformer 12 2.4.1 多頭注意力機制 12 2.4.2 位置編碼(position encoding) 13 2.5 ViTPose 15 第三章 研究方法 17 3.1 基於Transformer用於物件狀態分析之關聯度計算模型 17 3.1.1 Input Feature Token 18 3.1.2 Embedding layer 20 3.1.3 Object Relation Finder (ORF) 20 3.1.4 Multi-Head Attention 22 3.1.5 Layer Normalization 24 3.1.6 Feed-Forward 25 第四章 實驗結果 26 4.1 實驗配置 26 4.1.1 持球訓練資料集 26 4.1.2 評測指標 29 4.1.3 軟硬體設備詳細設置 29 4.1.4 實驗訓練細節 30 4.2 各種研究方法的實驗結果 30 4.2.1 傳統演算法 30 4.2.2 基於Transformer用於物件狀態分析之關聯度計算模型 31 4.2.3 人球關係探討實驗 32 4.2.4 各式實驗方法之數據比較與結論 33 4.2.5 測試集應用之實際結果圖比較 34 4.3 實際場域應用之實驗結果比較 37 4.3.1 應用於實際場域之結果圖比較 38 第五章 結論與未來展望 41 5.1 結論 41 5.2 未來展望 41 參考文獻 42 自傳 47

    GEIGER, Andreas; LENZ, Philip; URTASUN, Raquel. Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012. p. 3354-3361.
    Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., ... & Ng, A. Y. (2014). Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567.
    Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
    Bengio, Yoshua, Réjean Ducharme, and Pascal Vincent. "A neural probabilistic language model." Advances in neural information processing systems 13 (2000).
    COLLOBERT, Ronan; WESTON, Jason. A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th international conference on Machine learning. 2008. p. 160-167.
    Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
    Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. "Neural machine translation by jointly learning to align and translate." arXiv preprint arXiv:1409.0473 (2014).
    Zhang, K., Zhang, Z., Li, Z., & Qiao, Y. (2016). Joint face detection and alignment using multitask cascaded convolutional networks. IEEE signal processing letters, 23(10), 1499-1503.
    Deng, J., Guo, J., Xue, N., & Zafeiriou, S. (2019). Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4690-4699).
    Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., ... & Liu, W. (2018). Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5265-5274).
    Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and autonomous systems, 57(5), 469-483.
    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
    Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580-587).
    GIRSHICK, Ross. Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. 2015. p. 1440-1448.
    Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
    Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779-788).
    Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint arXiv:1804.02767 (2018).
    Bochkovskiy, Alexey, Chien-Yao Wang, and Hong-Yuan Mark Liao. "Yolov4: Optimal speed and accuracy of object detection." arXiv preprint arXiv:2004.10934 (2020).
    Ge, Z., Liu, S., Wang, F., Li, Z., & Sun, J. (2021). Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430.
    Wang, Chien-Yao, Alexey Bochkovskiy, and Hong-Yuan Mark Liao. "YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors." arXiv preprint arXiv:2207.02696 (2022).
    Bewley, A., Ge, Z., Ott, L., Ramos, F., & Upcroft, B. (2016, September). Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP) (pp. 3464-3468). IEEE.
    Wojke, N., Bewley, A., & Paulus, D. (2017, September). Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP) (pp. 3645-3649). IEEE.
    Zhang, Y., Sun, P., Jiang, Y., Yu, D., Weng, F., Yuan, Z., ... & Wang, X. (2022, October). Bytetrack: Multi-object tracking by associating every detection box. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII (pp. 1-21). Cham: Springer Nature Switzerland.
    Cao, J., Weng, X., Khirodkar, R., Pang, J., & Kitani, K. (2022). Observation-centric sort: Rethinking sort for robust multi-object tracking. arXiv preprint arXiv:2203.14360.
    MacQueen, J. "Classification and analysis of multivariate observations." 5th Berkeley Symp. Math. Statist. Probability. Los Angeles LA USA: University of California, 1967.
    LONG, Jonathan; SHELHAMER, Evan; DARRELL, Trevor. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015. p. 3431-3440.
    He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969).
    Wang, X., Kong, T., Shen, C., Jiang, Y., & Li, L. (2020). Solo: Segmenting objects by locations. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVIII 16 (pp. 649-665). Springer International Publishing.
    He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778).
    Schmidhuber, Jürgen. "Deep learning in neural networks: An overview." Neural networks 61 (2015): 85-117.
    Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Communications of the ACM 60.6 (2017): 84-90.
    SIMONYAN, Karen; ZISSERMAN, Andrew. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
    TAN, Mingxing; LE, Quoc. Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. PMLR, 2019. p. 6105-6114.
    O'Shea, Keiron, and Ryan Nash. "An introduction to convolutional neural networks." arXiv preprint arXiv:1511.08458 (2015).
    Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
    Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
    Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. "Layer normalization." arXiv preprint arXiv:1607.06450 (2016).
    Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
    Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259.
    Xu, Y., Zhang, J., Zhang, Q., & Tao, D. (2022). Vitpose: Simple vision transformer baselines for human pose estimation. arXiv preprint arXiv:2204.12484.
    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
    Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., ... & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 (pp. 740-755). Springer International Publishing.
    Zhang, Z., & Sabuncu, M. (2018). Generalized cross entropy loss for training deep neural networks with noisy labels. Advances in neural information processing systems, 31.
    PEARCE, Tim; BRINTRUP, Alexandra; ZHU, Jun. Understanding softmax confidence and uncertainty. arXiv preprint arXiv:2106.04972, 2021.
    Ioffe, S., & Szegedy, C. (2015, June). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning (pp. 448-456). pmlr.

    下載圖示
    QR CODE