簡易檢索 / 詳目顯示

研究生: 陳品源
Chen, Pin-Yuan
論文名稱: 利用AlphaZero框架實作與改良MiniShogi程式
Implement and Improve a MiniShogi Program Using the AlphaZero Framework
指導教授: 林順喜
Lin, Shun-Shii
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 中文
論文頁數: 49
中文關鍵詞: 電腦對局5五將棋蒙地卡羅樹搜尋神經網路深度學習強化學習
英文關鍵詞: computer games, MiniShogi, Monte Carlo Tree Search, neural network, deep learning, reinforcement learning
DOI URL: http://doi.org/10.6345/NTNU202000411
論文種類: 學術論文
相關次數: 點閱:357下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 2016年3月,DeepMind的AlphaGo程式以4:1的結果擊敗了當時韓國職業圍棋9段棋士李世乭,讓電腦對局的AI程式在強化學習的路上取得了巨大的突破與成就。隨後2017年10月更提出了AlphaGo Zero方法,以100:0的比數戰勝了原本的AlphaGo Lee程式,也證明了不用人類的棋譜當作先驗知識,就可以訓練出比人類還要更強的圍棋程式。而DeepMind最終把AlphaGo Zero方法一般化成了AlphaZero方法,也訓練出了當今世界棋力最強的西洋棋與將棋程式。但相對的,DeepMind也運用了非常龐大的運算資源來訓練,才得到了最強的棋力。
    本論文所研究的棋類為1970年楠本茂信所發明的5五將棋,5五將棋是一種將棋變體,特色是棋盤大小比本將棋還要小,只有5×5的盤面,將棋則有9×9,所以5五將棋是很適合一般人在硬體資源有限的情況下,來實作電腦對局的AI程式項目。
    本實驗是使用AlphaZero的演算法,搭配AlphaZero General框架來實作出使用神經網路搭配強化學習來訓練的AI程式,而我們也搭配了一些已知的優勢策略做改良,讓我們可以在有限的硬體資源下,增進神經網路模型的訓練效率。
    在5五將棋的訓練中,我們使用兩種方法去做改良,第一種方法是依盤面的重要性對樣本做採樣,設定中局會比終盤與開局還要高的採樣機率,期待能讓神經網路學習下中盤棋局時能比一般的版本下的更好。
    第二種方式是用能贏直接贏的方式去訓練,藉由提前一回合看到終局盤面,來達到Winning Attack的效果,因為MCTS在下棋時,即便是遇到能分出勝負的走步,不一定會走出能分出勝負的那一步,導致神經網路權重會收斂的很慢,而藉由此方法,可以比一般的訓練方法還要快的收斂。
    本研究所採用的兩個方法是一個成功一個失敗的結果,以實驗數據來說,如果取樣取的好,是有機會提升棋力的,但數據的表現上除了一組數據外,其他數據皆不盡理想;而Winning Attack的棋力提升的數據就非常顯著了,不過兩種方法搭配起來一起訓練時,雖然也會提升棋力,但是兩個方法沒有互相加成的效果。

    In March 2016, DeepMind's AlphaGo program defeated the Korean 9-dan professional Go player Lee Se-Dol with a 4:1 result, promoting the computer game's AI to make a huge breakthrough and achievement on the field of reinforcement learning. DeepMind also proposed the AlphaGo Zero method, which defeated the original AlphaGo Lee program with a score of 100:0, and also proved that without the human playing record as prior knowledge, we can also train a stronger Go program better than humans. DeepMind finally generalized the AlphaGo Zero method to the AlphaZero method, making their programs become the most powerful one in the world today. However, DeepMind also used very huge computing resources to get the strongest strength.
    The game studied in this thesis is MiniShogi invented by Nanben Maoxin in 1970. MiniShogi is a variant of Shogi with the characteristic that the board size is smaller, a 5×5 board. Shogi has a 9×9 size, so MiniShogi is very suitable for ordinary people to implement their AI programs with limited hardware resources.
    Our experiment uses the AlphaZero General framework to implement an AI program trained on neural network by reinforcement learning. We also use some known advantageous strategies to improve its performance.
    In the training of the MiniShogi program, we used two methods. The first is to select the samples according to the importance of the board. We set the sampling probability of the middle stages’ games to be higher than the final and the opening stages’ games in order to let the neural network learn better than the original version when playing the middle stages’ games.
    The second way is to use the Winning Attack training method. By looking ahead of the final result one round in advance, it may achieve the effect of "winning directly". We observed that when MCTS plays games, even if it encounters a move that can distinguish the winner and the loser, it is unable to take the move that can win the game. This will cause the weights of the neural network to converge slowly. By using our method, it may converge faster than the ordinary training methods.
    The two methods used in this research are a success and a failure result. For experimental data, if the samples are taken well, there is a chance to improve its performance.

    一、 緒論 1 1.1 研究背景 1 1.2 研究動機 2 1.3 研究目的 2 二、 文獻探討 3 2.1 5五將棋介紹 3 2.2 5五將棋的走步 4 2.3 5五將棋的規則 9 2.4 AlphaGo介紹 11 2.5 AlphaGo Zero介紹 13 2.6 AlphaGo Zero的訓練流程 14 2.7 AlphaGo Zero中的Monte Carlo Tree Search(MCTS) 18 2.8 AlphaZero介紹 19 三、 研究方法 20 3.1 AlphaZero General 20 3.2 5五將棋於AlphaZero General框架中實作 22 3.3 依盤面局勢重要性之採樣的訓練策略 31 3.4 使用Winning Attack之訓練策略 34 四、 實驗與結果 37 4.1 環境與參數設定 37 4.2 訓練樣本的壓縮 38 4.3 原始版本的訓練成果 39 4.4 依盤面局勢重要性之採樣的訓練策略之效果 39 4.5 使用Winning Attack之訓練策略方法之效果 41 4.6 依盤面局勢重要性之採樣搭配Winning Attack之策略方法之效果 42 五、 結論與未來方向 44 參考文獻 45 附錄A TCGA 2019獎牌照片 47 附錄B 參加比賽的經驗 48

    [1] 維基百科:將棋(日本),
    https://zh.wikipedia.org/wiki/%E5%B0%86%E6%A3%8B_(%E6%97%A5%E6%9C%AC) 。
    [2] 維基百科:5五將棋,
    https://zh.wikipedia.org/wiki/5%E4%BA%94%E5%B0%87%E6%A3%8B。
    [3] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis., “Mastering the game of Go without human knowledge” , Nature volume550, pages354–359 (19 October 2017).
    [4] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis, “Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm” , (Submitted on 5 Dec 2017)
    [5] Google Play-Mini Shogi,
    https://play.google.com/store/apps/details?id=info.gamebatake.android.sg55shogi&hl=zh_TW
    [6] TCGA - 5五將棋規則
    https://www.tcga.tw/taai2018/attachment/5%E4%BA%94%E5%B0%87%E6%A3%8B%E8%A6%8F%E5%89%87.pdf。
    [7] 張懷文,5五將棋程式Wonders的設計與實作,2014年,國立臺灣師範大學資工系碩士論文。
    [8] Monte Carlo Tree Search,
    https://towardsdatascience.com/monte-carlo-tree-search-158a917a8baa。
    [9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, “Deep Residual Learning for Image Recognition”, (Submitted on 10 Dec 2015)
    [10] Surag Nair, “Alpha-Zero-General”, Stanford University, December 2017.
    https://github.com/suragnair/alpha-zero-general。
    [11] Tom Schaul, John Quan, Ioannis Antonoglou and David Silver, “Prioritized Experience Replay” , (Submitted on 25 Feb 2016).
    [12] 吳天宇,基於AlphaZero General Framework實現Breakthrough遊戲,2019年,國立臺灣師範大學資工系碩士論文。
    [13] 黃士豪,利用有利條件訓練神經網路-以六子棋為例,2019年,國立臺灣師範大學資工系碩士論文。
    [14] WinBoard 4.8.0,
    http://hgm.nubati.net/。

    無法下載圖示 電子全文延後公開
    2025/03/02
    QR CODE