研究生: |
蘇柏豪 Sue, Bo-Hao |
---|---|
論文名稱: |
基於機器學習預測有機分子之最高佔據分子軌域與最低未佔據分子軌域及其能隙 Predictions of HOMO, LUMO, and Energy gap of Organic Molecules based on machine learning methods. |
指導教授: |
蔡明剛
Tsai, Ming-Kang |
口試委員: |
蔡明剛
Tsai, Ming-Kang 葉丞豪 Yeh, Chen-Hao 張鈞智 Chang, Chun-Chih |
口試日期: | 2023/07/14 |
學位類別: |
碩士 Master |
系所名稱: |
化學系 Department of Chemistry |
論文出版年: | 2023 |
畢業學年度: | 111 |
語文別: | 中文 |
論文頁數: | 103 |
中文關鍵詞: | 機器學習 、QM9資料集 、聚類分群法 、隨機森林 |
英文關鍵詞: | machine learning, Quantum-Machine 9, K-means, random forest |
研究方法: | 主題分析 |
DOI URL: | http://doi.org/10.6345/NTNU202301435 |
論文種類: | 學術論文 |
相關次數: | 點閱:111 下載:6 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
近年來科技發展迅速,以大數據的電腦模擬研究也跟著興起,利用機器學習的方式透過演算法來精準預測結果,並輔佐實驗進展,從中尋找出新的可能性已然是種趨勢,而傳統的量化計算耗時長,成本相對高,且只能做少量的分子。
HOMO、LUMO和Energy gap性質用於化學領域中,因其放光波長、電子傳遞、化學反應性等特性,廣泛應用於有機化學,本研究基於上述問題,使用了機器學習中的分群法、線性及非線性回歸的方式建立模型,逐步針對大量種類的有機化合物進行分析與探討。
本研究利用機器學習中的Lasso回歸、K-means分群法、隨機森林演算法,用於預測114896種有機化學分子的HOMO、LUMO和能隙(Energy gap)性質,透過本研究之模型,得出:HOMO、LUMO、Energy gap的理論與預測值之MAE小於 0.3 eV,並且非線性回歸模型之校正R2值大於 0.93,顯示模型預測結果高度符合吾人預期之化學性質。
透過本研究之分析結果,顯示本研究所建立之模型,除了有著良好的預測效果,其篩選出來的描述特徵與一般化學界的認知相吻合,未來可期運用本研究之相關概念與分析方法,對相關領域之數值分析有所貢獻。
With the rapid development of science and technology in recent years, computer simulation research based on big data is also on the rise. It is a trend to use machine learning to accurately predict the results through algorithms and assist the progress of experiments. Traditional quantitative calculations take a long time, usually expensive, and can only do a small amount of molecules. Comparatively, using computers and machine learning has already become a new trend to find new possibilities.
The properties of HOMO, LUMO and Energy gap are widely used in the field of organic chemistry because of their light emission wavelength, electron transfer, chemical reactivity and other characteristics. Based on the above properties, this study uses the clustering algorithm in machine learning, linear and nonlinear regression methods to establish machine learning models. The models are used to analyze various kind of organic molecular step by step.
This research uses Lasso regression, K-means clustering method, and random forest algorithm in machine learning to predict the HOMO, LUMO, and energy gap properties of 114,896 organic chemical molecules. Through the model of this study, it is concluded that the MAE of the theoretical and predicted values of HOMO, LUMO, and Energy gap is less than 0.3 eV, and the corrected R2 value of the nonlinear regression model is greater than 0.93, showing that the predicted results of the model are highly in line with the expected chemical properties.
Through the analysis results of this study, it is concluded that the model established in the study not only has prediction effect, but also the selected descriptors are consistent with the cognition of general chemistry. In the future, the related concepts and analysis methods of this study can be used to contribute to the numerical analysis of related fields.
1. Xie, Y., Zhang, C., Hu, X., Zhang, C., Kelley, steven P., Atwood, J. L., & Lin, J. (2020). Machine Learning Assisted Synthesis of Metal–Organic Nanocapsules. J. Am. Chem. Soc, 142(3), 1475–1481. https://doi.org/10.1021/jacs.9b11569
2. Chen, H., Tang, P., Chen, G., Chang, C., & Pao, C. (2021). Microstructure Maps of Complex Perovskite Materials from Extensive Monte Carlo Sampling Using Machine Learning Enabled Energy Model. J. Phys. Chem. Lett., 12(14), 3591–3599. https://doi.org/10.1021/acs.jpclett.1c00410
3. Tilborg, D. van, alenicheva, A., & Grisoni, F. (2022). Exposing the Limitations of Molecular Machine Learning with Activity Cliffs. J. Chem. Inf. Model., 62(23), 5938–5951. https://doi.org/10.1021/acs.jcim.2c01073
4. Ruddigkeit, L., Deursen, R. van, Blum, L. C., & Reymond, J. (2012). Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. J. Chem. Inf. Model., 52(11), 2864–2875. https://doi.org/10.1021/ci300415d
5. Ramakrishnan, R., Dral, P. O., Rupp, M., & von Lilienfeld, O. A. (2014). Quantum chemistry structures and properties of 134 kilo molecules. Scientific data, 1, 140022. https://doi.org/10.1038/sdata.2014.22.
6. Zahrt, A. F., Mo, Y., Nandiwale, K. Y., Shprints, R., Heid, E., & Jensen, K. F. (2022). Machine-Learning-Guided Discovery of Electrochemical Reactions. J. Am. Chem. Soc., 144(49), 22599–22610. https://doi.org/10.1021/jacs.2c08997
7. Nakayama, H., & Kimura, S. (2011). Suppression of HOMO–LUMO Transition in a Twist Form of Oligo(Phenyleneethynylene) Clamped by a Right-Handed Helical Peptide. J. Phys. Chem. A, 115(32), 8960–8968. https://doi.org/10.1021/jp200997c
8. Brownell, L. V., Robins, K. A., Jeong, Y., lee, Y., & Lee, D. (2013). Highly Systematic and Efficient HOMO–LUMO Energy Gap Control of Thiophene-Pyrazine-Acenes. J. Phys. Chem. C, 117(48), 25236–25247. https://doi.org/10.1021/jp407269p
9. Kaur, I., Jia, W., Kopreski, R. P., Selvarasah, S., Dokmeci, M. R., Pramanik, C., Mcgruer, N. E., & Miller, G. P. (2008). Substituent Effects in Pentacenes: Gaining Control over HOMO−LUMO Gaps and Photooxidative Resistances. J. Am. Chem. Soc., 130(48), 16274–16286. https://doi.org/10.1021/ja804515y
10. Panapitiya, G., Avendaño-Franco, G., Ren, P., Wen, X., Li, Y., & Lewis, J. P. (2018). Machine-Learning Prediction of CO Adsorption in Thiolated, Ag-Alloyed Au Nanoclusters. Journal of the American Chemical Society, 140(50), 17508–17514. https://doi.org/10.1021/jacs.8b08800
11. Ye, Z. R., Huang, I. S., Chan, Y. T., Li, Z. J., Liao, C. C., Tsai, H. R., Hsieh, M. C., Chang, C. C., & Tsai, M. K. (2020). Predicting the emission wavelength of organic molecules using a combinatorial QSAR and machine learning approach. RSC advances, 10(40), 23834–23841. https://doi.org/10.1039/d0ra05014h
12. Yap, C. W. (2011). PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. Journal of computational chemistry, 32(7), 1466–1474. https://doi.org/10.1002/jcc.21707
13. Weininger, D. (1988). SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci., 28(1), 31–36. https://doi.org/10.1021/ci00057a005
14. Ying, X. (2019). An Overview of Overfitting and Its Solutions. J. Phys.: Conf. Ser., 1168(022022), 1–6. https://doi.org/10.1088/1742-6596/1168/2/022022
15. Fisher, R. (1919). XV.—The Correlation between Relatives on the Supposition of Mendelian Inheritance. Earth and Environmental Science Transactions of The Royal Society of Edinburgh, 52(2), 399-433. doi:10.1017/S0080456800012163
16. Lloyd, S.P. (1982). Least squares quantization in PCM. IEEE Trans. Inf. Theory, 28, 129-136. https://doi.org/10.1109/TIT.1982.1056489
17. MacQueen, J. B. (1967). Some methods for classification and analysis of multivariate observations. In L. M. Le Cam & J. Neyman (Eds.), Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 281–297). California: University of California Press.
18. Shukla, S. & Naganna, S. (2014). A Review on K-means data clusteringapproach. International Journal of Information & Computation Technology (vol. 4, no. 17, pp. 1847-1860).
19. Rousseeuw, P. J. (1987). Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. Journal of Computational and Applied Mathematics, 20, 53–65. https://doi.org/10.1016/0377-0427(87)90125-7
20. Tibshirani, R. (2011). Regression Shrinkage and Selection via the Lasso: A Retrospective. Journal of the Royal Statistical Society Series B: Statistical Methodology, 73(3), 273–282. https://doi.org/10.1111/j.1467-9868.2011.00771.x
21. Kirenz, J. (2021, December 27). Lasso Regression with Python. https://www.kirenz.com/post/2019-08-12-python-lasso-regression-auto/
22. Fortmann-roe, S. (2012, June). Understanding the Bias-Variance Tradeoff. http://scott.fortmann-roe.com/docs/BiasVariance.html.
23. Ho, T. kam. (1998). The Random Subspace Method for Constructing Decision Forests (Vol. 20, Issue 8). IEEE. https://doi.org/10.1109/34.709601
24. 10程式中. (2021, September 26). 多棵決策樹更厲害:隨機森林. IThelp. https://ithelp.ithome.com.tw/articles/10272586
25. Chwang. (2021, August 1). Machine Learning-交叉驗證(Cross Validation)-找到KNN中適合的K值-Scikit Learn一步一步實作教學. https://chwang12341.medium.com/machine-learning-%E4%BA%A4%E5%8F%89%E9%A9%97%E8%AD%89-cross-validation-%E6%89%BE%E5%88%B0knn%E4%B8%AD%E9%81%A9%E5%90%88%E7%9A%84k%E5%80%BC-scikit-learn%E4%B8%80%E6%AD%A5%E4%B8%80%E6%AD%A5%E5%AF%A6%E4%BD%9C%E6%95%99%E5%AD%B8-4109bf470340.
26. Golbraikh, A., & Tropsha, A. (2002). Beware of Q2! Journal of Molecular Graphics and Modelling, 20(4), 269–276. https://doi.org/10.1016/s1093-3263(01)00123-1
27. Musgrave, C. B., & Zhang, G. (2007). Comparison of DFT Methods for Molecular Orbital Eigenvalue Calculations. J. Phys. Chem. A, 111(8), 1554–1561. https://doi.org/10.1021/jp061633o
28. Hall, L. H., & Kier, L. B. (1995). Electrotopological State Indices for Atom Types: A Novel Combination of Electronic, Topological, and Valence State Information. J. Chem. Inf. Comput. Sci., 35(6), 1039–1045. https://doi.org/10.1021/ci00028a014