簡易檢索 / 詳目顯示

研究生: 江瑞飛
Rifky Afriza
論文名稱: 新型態合作型深度強化學習方法用於多智能個體協作任務
A Novel Cooperative Deep Reinforcement Learning to Learn How to Communicate in Multi-Agent Cooperative Tasks
指導教授: 包傑奇
Jacky Baltes
薩義德
Saeed Saeedvand
口試委員: 包傑奇
Jacky Baltes
薩義德
Saeed Saeedvand
陳瑄易
Chen, Syuan-Yi
李祖聖
Li, Tzuu-Hseng
口試日期: 2024/07/01
學位類別: 碩士
Master
系所名稱: 電機工程學系
Department of Electrical Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 英文
論文頁數: 39
中文關鍵詞: 多智能體強化學習深度強化學習
英文關鍵詞: Multi-Agent Reinforcement Learning, Deep reinforcement learning
研究方法: 實驗設計法行動研究法
DOI URL: http://doi.org/10.6345/NTNU202400884
論文種類: 學術論文
相關次數: 點閱:199下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報

多智能體強化學習 (MARL) 在處理合作任務時面臨巨大的挑戰,主要因為狀態空間的龐大。傳統方法如獨立近端策略優化 (IPPO) 缺乏對其他智能體的感知,而集中化方法如多智能體近端策略優化 (MAPPO) 則採用集中學習與分散策略。在本研究中,我們引入了一種新穎的以通信為中心的方法,其中智能體將其狀態信息與行動信息一起編碼,創建動態的信息交換通道。通過促進智能體之間的信息交換,我們的方法在個體決策和協作任務完成之間架起了橋樑。通過實證評估,我們展示了我們的方法在提高多種合作MARL場景中的收斂性和性能的有效性,從而推動了在集中框架內的分散策略學習的邊界.

Multi-Agent Reinforcement Learning (MARL) faces formidable challenges when tackling cooperative tasks due to the expansive state space. Traditional approaches, such as Independent Proximal Policy Optimization (IPPO), lack awareness of other agents, while centralized methods like Multi-Agent Proximal Policy Optimization (MAPPO) employ centralized learning with decentralized policies. In this study, This research introduces a novel communication-centric approach where agents encode their state information alongside action messages, creating dynamic channels of information exchange. By facilitating information exchange among agents, our approach bridges the gap between individual decision-making and collaborative task completion. Through empirical evaluations, we demonstrate the effectiveness of our method in improving convergence and performance across diverse cooperative MARL scenarios, thus pushing the boundaries of decentralized policy learning within a centralized framework.

Acknowledgements i Abstract ii Contents iii List of Figures v List of Tables vii List of Symbols viii Chapters Chapter 1 Introduction 1 1.1 Background 1 1.2 Related Works 2 1.3 Problem Statement 3 1.4 Research Aim and Objectives 3 1.5 Structure of the Thesis 4 Chapter 2 Literature Review 5 2.1 Proximal Policy Optimization (PPO) 5 2.2 IPPO and MAPPO 7 2.2.1 Independent Proximal Policy Optimization (IPPO) 7 2.2.2 Multi-Agent Proximal Policy Optimization (MAPPO) 7 Chapter 3 Methodology 10 3.1 Pushing Box Environment 10 3.1.1 Scenario 10 3.1.2 Agent 12 3.1.3 Reward Function 14 3.2 Multi-Agent MuJoCo Environment 17 3.3 Communication 18 Chapter 4 Experimental Results and Discussion 21 4.1 Pushing Box Environment Testbed 21 4.1.1 Hyperparameters 21 4.1.2 Results 23 4.1.2.1 Two Humanoids in Cooperative Task Scenarios 23 4.1.2.2 Humanoid and Ant in Cooperative Task Scenarios 25 4.1.3 Summary of Pushing Box Testbed 26 4.1.3.1 Two Humanoids Performance 26 4.1.3.2 Humanoid and Ant Performance 27 4.2 MAMuJoCo Testbed 27 4.2.1 Scenarios 28 4.2.2 Hyperparameters 28 4.2.3 Results 29 4.3 Ablation Experiment 31 Chapter 5 Conclusion and Future Work 34 5.1 Conclusion 34 References 36

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
S. Saeedvand, H. Mandala, and J. Baltes, “Hierarchical deep reinforcement learning to drag heavy objects by adult-sized humanoid robot,” Applied Soft Computing, vol. 110, p. 107601, 2021.
Y. Chen, T. Wu, S. Wang, X. Feng, J. Jiang, Z. Lu, S. McAleer, H. Dong, S.-C. Zhu, and Y. Yang, “Towards human-level bimanual dexterous manipulation with reinforcement learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 5150–5163, 2022.
J. Baltes, G. Christmann, and S. Saeedvand, “A deep reinforcement learning algorithm to control a two-wheeled scooter with a humanoid robot,” Engineering Applications of Artificial Intelligence, vol. 126, p. 106941, 2023.
R. Lowe, Y. I. Wu, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” Advances in neural information processing systems, vol. 30, 2017.
H. Liu, “Cooperative multi-agent game based on reinforcement learning,” High-Confidence Computing, vol. 4, no. 1, p. 100205, 2024.
X. Hao, H. Mao, W. Wang, Y. Yang, D. Li, Y. Zheng, Z. Wang, and J. Hao, “Breaking the curse of dimensionality in multiagent state space: A unified agent permutation framework,” arXiv preprint arXiv:2203.05285, 2022.
C. S. De Witt, T. Gupta, D. Makoviichuk, V. Makoviychuk, P. H. Torr, M. Sun, and S. Whiteson, “Is independent learning all you need in the starcraft multi-agent challenge?,” arXiv preprint arXiv:2011.09533, 2020.
C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Bayen, and Y. Wu, “The surprising effectiveness of ppo in cooperative multi-agent games,” Advances in Neural Information Processing Systems, vol. 35, pp. 24611–24624, 2022.
J. Foerster, I. A. Assael, N. De Freitas, and S. Whiteson, “Learning to communicate with deep multi-agent reinforcement learning,” Advances in neural information processing systems, vol. 29, 2016.
A. Das, T. Gervet, J. Romoff, D. Batra, D. Parikh, M. Rabbat, and J. Pineau, “TarMAC: Targeted multi-agent communication,” in Proceedings of the 36th International Conference on Machine Learning (K. Chaudhuri and R. Salakhutdinov, eds.), vol. 97 of Proceedings of Machine Learning Research, pp. 1538–1546, PMLR, 09–15 Jun 2019.
S. Sukhbaatar, A. Szlam, and R. Fergus, “Learning multiagent communication with backpropagation,” in Advances in Neural Information Processing Systems (D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, eds.), vol. 29, Curran Associates, Inc., 2016.
N. Jaques, A. Lazaridou, E. Hughes, C. Gulcehre, P. Ortega, D. Strouse, J. Z. Leibo, and N. De Freitas, “Social influence as intrinsic motivation for multi-agent deep reinforcement learning,” in Proceedings of the 36th International Conference on Machine Learning (K. Chaudhuri and R. Salakhutdinov, eds.), vol. 97 of Proceedings of Machine Learning Research, pp. 3040–3049, PMLR, 09–15 Jun 2019.
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” CoRR, vol. abs/1707.06347, 2017.
J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in Proceedings of the 32nd International Conference on Machine Learning (F. Bach and D. Blei, eds.), vol. 37 of Proceedings of Machine Learning Research, (Lille, France), pp. 1889–1897, PMLR, 07–09 Jul 2015.
V. Makoviychuk, L. Wawrzyniak, Y. Guo, M. Lu, K. Storey, M. Macklin, D. Hoeller, N. Rudin, A. Allshire, A. Handa, et al., “Isaac gym: High performance gpu-based physics simulation for robot learning,” arXiv preprint arXiv:2108.10470, 2021.
B. Peng, T. Rashid, C. Schroeder de Witt, P.-A. Kamienny, P. Torr, W. Böhmer, and S. Whiteson, “Facmac: Factored multi-agent centralised policy gradients,” Advances in Neural Information Processing Systems, vol. 34, pp. 12208–12221, 2021.

無法下載圖示 本全文未授權公開
QR CODE