Author: |
董景華 Tung, Ching-Hua |
---|---|
Thesis Title: |
以機器學習方法預測溶劑對其有機螢光分子之放光波長 Machine learning description on the solvent effect to the emission wavelengths of organic fluorescent molecules |
Advisor: |
蔡明剛
Tsai, Ming-Kang |
Committee: |
張鈞智
Chang, Chun-Chih 葉丞豪 Yeh, Chen-Hao 蔡明剛 Tsai, Ming-Kang |
Approval Date: | 2023/07/14 |
Degree: |
碩士 Master |
Department: |
化學系 Department of Chemistry |
Thesis Publication Year: | 2023 |
Academic Year: | 111 |
Language: | 中文 |
Number of pages: | 93 |
Keywords (in Chinese): | 定量構效關係 、機器學習 、螢光分子 、溶劑效應 |
Keywords (in English): | QSAR, Machine learning, Flourescent molecules, Solvent effect |
Research Methods: | 大數據分析 |
DOI URL: | http://doi.org/10.6345/NTNU202301332 |
Thesis Type: | Academic thesis/ dissertation |
Reference times: | Clicks: 138 Downloads: 7 |
Share: |
School Collection Retrieve National Library Collection Retrieve Error Report |
在19、20世紀,定量構效關係之方法逐漸發展,以機器學習方法對於預測化學分子的生物活性、藥物性質等的研究也與日俱增。許多軟體可以用於計算分子描述符,描述符為用於表示分子的物理化學性質。透過機器學習方法,我們可以預測有機螢光分子之放光波長,對於不同分子描述符以及溶劑效應之影響。
本研究中,為使用SKlearn作為機器學習的方法。並使用線性迴歸、LASSO、隨機森林三種不同的迴歸方法訓練模型,且搭配K-means分群法及聚合階層式分群法來探討其模型訓練之表現。
對於11146種SMILES分子,加入8種溶劑描述符後,以隨機森林迴歸方法進行模型訓練,或基於K-means分群及LASSO迴歸方法進行隨機森林迴歸方法之模型訓練,亦或是基於沃德法及LASSO迴歸方法進行隨機森林迴歸方法之模型訓練。其R^2分別有0.01至0.02的不等提升,且分別在各模型之重要性特徵,8種溶劑描述符有包含在其中,且有與共軛π鍵相關的描述符,對於預測放光波長有顯著的貢獻,與參考文獻結果具一致的解釋性。
In the 19th and 20th centuries, the method of quantitative structure-activity relationship gradually developed, and the research on predicting the biological activity and drug properties of chemical molecules by machine learning methods also increased day by day. Many of software can be used to compute molecular descriptors, which represent the physicochemical properties of molecules. Through machine learning methods, we can predict the emission wavelengths of organic fluorescent molecules, the impact of different molecular descriptors and solvent effects.
In this study, SKlearn is used as a machine learning method. And use linear regression, LASSO, random forest three different regression methods to train the model, and use K-means clustering method and aggregation hierarchical clustering method to explore the performance of the model training.
For 11146 kinds of SMILES molecules, after adding 8 kinds of solvent descriptors, the model training is carried out by random forest regression method, or the model training of random forest regression method is carried out based on K-means clustering and LASSO regression method, or based on Ward's method and The LASSO regression method performs model training of the random forest regression method. Its coefficient of determination has an improvement ranging from 0.01 to 0.02, and the importance features of each model, 8 kinds of solvent descriptors are included in it, and there are descriptors related to conjugated π-bonding , which have a significant contribution to the prediction of emission wavelengths, and have consistent interpretation with references.
(1) Lichtman, J. W.; Conchello, J.-A. Fluorescence microscopy. Nature Methods 2005, 2 (12), 910-919. DOI: 10.1038/nmeth817
(2) Nguyen, V.-N.; Ha, J.; Cho, M.; Li, H.; Swamy, K. M. K.; Yoon, J. Recent developments ofBODIPY-based colorimetric and fluorescent probes for the detection of reactive oxygen/nitrogen species and cancer diagnosis. Coord. Chem. Rev. 2021, 439, 213936. DOI: https://doi.org/10.1016/j.ccr.2021.213936
(3) Yang, J.; Chen, S.-W.; Zhang, B.; Tu, Q.; Wang, J.; Yuan, M.-S. Non-biological fluorescent chemosensors for pesticides detection. Talanta 2022, 240, 123200. DOI: https://doi.org/10.1016/j.talanta.2021.123200
(4) Geng, X.; Sun, Y.; Guo, Y.; Zhao, Y.; Zhang, K.; Xiao, L.; Qu, L.; Li, Z. Fluorescent Carbon Dots for in Situ Monitoring of Lysosomal ATP Levels. Anal. Chem. 2020, 92 (11), 7940-7946. DOI: 10.1021/acs.analchem.0c01335
(5) Reaxys. Elsevier. https://www.reaxys.com/ (accessed 2021-04-29).
(6) Roberto Todeschini; Consonni, V. Molecular Descriptors for Chemoinformatics. 2009; p 512.
(7) Yap, C. W. PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 2011, 32 (7), 1466-1474. DOI: 10.1002/jcc.21707
(8) Danishuddin; Khan, A. U. Descriptors and their selection methods in QSAR analysis: paradigm for drug design. Drug Discovery Today 2016, 21 (8), 1291-1302. DOI: 10.1016/j.drudis.2016.06.013
(9) Weininger, D. SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988, 28 (1), 31-36. DOI: 10.1021/ci00057a005
(10) Weininger, D.; Weininger, A.; Weininger, J. L. SMILES. 2. Algorithm for Generation of Unique SMILES Notation. J. Chem. Inf. Comput. Sci. 1989, 29 (2), 97-101. DOI: 10.1021/ci00062a008
(11) Snow, J. On chloroform and other anaesthetics: their action and administration. Wood Library Museum of Anesthesiology, 1989; pp 57-74.
(12) Brown, A. C.; Fraser, T. R. V.—On the Connection between Chemical Constitution and Physiological Action. Part. I.—On the Physiological Action of the Salts of the Ammonium Bases, derived from Strychnia, Brucia, Thebaia, Codeia, Morphia, and Nicotia. Earth and Environmental Science Transactions of The Royal Society of Edinburgh 1868, 25 (1), 151-203. DOI: 10.1017/S0080456800028155
(13) Brown, A. C.; Fraser, T. R. XX.—On the Connection between Chemical Constitution and Physiological Action. Part II.—On the Physiological Action of the Ammonium Bases derived from Atropia and Conia. Earth and Environmental Science Transactions of The Royal Society of Edinburgh 1869, 25 (2), 693-739. DOI: 10.1017/S0080456800035377
(14) Kubinyi, H. QSAR: Hansch Analysis and Related Approaches. Wiley-VCH, 1993; p 4.
(15) Hansen, O. R. Hammett Series with Biological Activity. Acta Chem. Scand. 1962, 16, 1593-1600.
(16) Hansch, C.; Maloney, P. P.; Fujita, T.; Muir, R. M. Correlation of Biological Activity of Phenoxyacetic Acids with Hammett Substituent Constants and Partition Coefficients. Nature 1962, 194 (4824), 178-180. DOI: 10.1038/194178b0
(17) Hansch, C. A Quantitative Approach to Biochemical Structure-Activity Relationships. Acc. Chem. Res. 1969, 2 (8), 232-239. DOI: 10.1021/ar50020a002
(18) Hansch, C.; Fujita, T. p-σ-π Analysis. A Method for the Correlation of Biological Activity and Chemical Structure. J. Am. Chem. Soc. 1964, 86 (8), 1616-1626. DOI: 10.1021/ja01062a035
(19) Free, S. M.; Wilson, J. W. A Mathematical Contribution to Structure-Activity Studies. J. Med. Chem. 1964, 7 (4), 395-399. DOI: 10.1021/jm00334a001
(20) Hansch, C.; Yoshimoto, M. Structure-activity relations in immunochemistry. 2. Inhibition of complement by benzamidines. J. Med. Chem. 1974, 17 (11), 1160-1167. DOI: 10.1021/jm00257a007
(21) Kubinyi, H. Quantitative structure-activity relationships. 2. A mixed approach, based on Hansch and Free-Wilson analysis. J. Med. Chem. 1976, 19 (5), 587-600. DOI: 10.1021/jm00227a004
(22) Kubinyi, H. Quantitative structure-activity relations. 7. The bilinear model, a new model for nonlinear dependence of biological activity on hydrophobic character. J. Med. Chem. 1977, 20 (5), 625-629. DOI: 10.1021/jm00215a002
(23) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12 (null), 2825–2830.
(24) Hao, J.; Ho, T. K. Machine Learning Made Easy: A Review of Scikit-learn Package in Python Programming Language. Journal of Educational and Behavioral Statistics 2019, 44 (3), 348-361. DOI: 10.3102/1076998619832248
(25) Scikit-Learn. https://scikit-learn.org/stable/ (accessed 2021-04-29).
(26) Tibshirani, R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996, 58 (1), 267-288.
(27) Pearson, K. LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 1901, 2 (11), 559-572. DOI: 10.1080/14786440109462720
(28) Hotelling, H. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 1933, 24, 417-441. DOI: 10.1037/h0071325
(29) Steinhaus, H. Sur la division des corps matériels en parties. Bull. Acad. Polon. Sci 1956, 1 (804), 801.
(30) MacQueen, J. Some methods for classification and analysis of multivariate observations. In Proc. 5th Berkeley Symposium on Math., Stat., and Prob, 1965; p 281.
(31) Cauchy, A. Méthode générale pour la résolution des systemes d’équations simultanées. Comp. Rend. Sci. Paris 1847, 25 (1847), 536-538.
(32) Tin Kam, H. Random decision forests. In Proceedings of 3rd International Conference on Document Analysis and Recognition, 14-16 Aug. 1995, 1995; Vol. 1, pp 278-282 vol.271. DOI: 10.1109/ICDAR.1995.598994
(33) Zhang, H.; Zhang, L.; Jiang, Y. Overfitting and Underfitting Analysis for Deep Learning Based End-to-end Communication Systems. In 2019 11th International Conference on Wireless Communications and Signal Processing (WCSP), Xi'an, China, October 23−25, 2019, 2019; pp 1-6. DOI: 10.1109/WCSP.2019.8927876
(34) Minnesota Solvent Descriptor Database. https://comp.chem.umn.edu/solvation/ (accessed 2021-05-22).
(35) Ye, Z.-R.; Huang, I. S.; Chan, Y.-T.; Li, Z.-J.; Liao, C.-C.; Tsai, H.-R.; Hsieh, M.-C.; Chang, C.-C.; Tsai, M.-K. Predicting the emission wavelength of organic molecules using a combinatorial QSAR and machine learning approach. RSC Advances 2020, 10 (40), 23834-23841. DOI: 10.1039/D0RA05014H