研究生: |
周弈銘 Chou, Yi-Ming |
---|---|
論文名稱: |
以機器學習方法分析結構與螢光波長之關係 Analyzing the relationship between the structure and fluorescence: Machine learning method |
指導教授: |
蔡明剛
Tsai, Ming-Kang |
學位類別: |
碩士 Master |
系所名稱: |
化學系 Department of Chemistry |
論文出版年: | 2018 |
畢業學年度: | 106 |
語文別: | 中文 |
論文頁數: | 70 |
中文關鍵詞: | QSAR 、機器學習 、螢光 |
英文關鍵詞: | QSAR, Machine learning, fluorescence |
DOI URL: | http://doi.org/10.6345/THE.NTNU.DC.036.2018.B05 |
論文種類: | 學術論文 |
相關次數: | 點閱:188 下載:0 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在定量構效關係的研究中,以機器學習方式進行資料挖掘的比例越來越高,而使用少量描述符對某種化學特性進行建模一直是化學訊息學中非常重要的一環,在擁有少量樣本以及大量從E-Dragon資料庫中取得的分子結構與特性相關的描述符數據後,特過機器學習的方式找出能夠對萘和香豆素之不同取代基化合物之螢光波長進行擬合的描述符和演算法,變成為本次實驗的目的,而透過四種不同的機器學習演算法 ( 決策樹回歸、隨機森林回歸、GBDT回歸、極端樹回歸 ) 之間投票和比較,從1664種描述符中取得R3m、Ss、R7u+三種描述符對螢光波長進行擬合;再透過測試集準確率的比較與檢驗,選出對於處理非線性問題具有良好功能的隨機森林回歸做為最後建模工具 ( 隨機森林回歸所使用的層數為19層、65個弱學習器 ) 。而此三種描述符則是在本實驗中做為具有預測螢光波長之描述符。
在建模之後,分析訓練集和測試集的平均絕對誤差以及誤差百分率,得到訓練集之平均絕對誤差為16奈米、誤差百分率為百分之四;而測試集的平均絕對誤差為26奈米、誤差百分率為百分之六。而在分析誤差結果時也發現,R3m和Ss之相關性程度取決於取代基的複雜程度,而不同的複雜程度會對不同光區的分子有著不同的影響。如果具有高度相關性,也就是取代基舉有多重鍵以及複雜性,則落在短波長區間(尤其是紫光)的預測能力較佳;若高度相關性的情況發生在長波長分子上,則模型的預測能力會變弱。
In the study of quantitative structure-activity relationship, the proportion of data mining by machine learning method is getting higher and higher, and the use of a small number of descriptors to model a certain chemical property has always been a very important part of chemical informatics. After getting the data and a large number of descriptor from the E-Dragon database, using machine learning method to find out the descriptors and algorithms for fitting the fluorescence of different substituent compounds of naphthalene and coumarin became the purpose of this experiment. The R3m, Ss, and R7u+ descriptors are selected from 1664 descriptions in order to fit the fluorescence wavelength, through the comparison and voting between four different machine learning algorithms (decision tree regression, random forest regression, GBDT regression, extreme tree regression). Then, through the comparison and test of the test set accuracy, the random forest regression is a good function for dealing with nonlinear problems and selected as the final modeling tool. The number of layers used in random forest regression is 19 layers and 65 weak learners). These three descriptors are used in this experiment as descriptors with predicted fluorescence wavelengths.
After modeling, the average absolute error and the percentage error of the training set and the test set are analyzed. The average absolute error of the training set is 16 nm and the error percentage is 4%. The average absolute error of the test set is 26 nm. The percentage error is 6%. When analyzing the error results, it is also found that the degree of correlation between R3m and Ss depends on the complexity of the substituents, and the different complexity will have different effects on the molecules of different regions. If there is a high degree of correlation, that is, the substitution has multiple bonds and complexity, the prediction ability in the short wavelength range (especially purple light) is better; if the high correlation occurs on the long wavelength molecule, the model’s predictive power will be weaker.
[1 ] Fuchs, J. E., Wellenzohn, B., Weskamp, N., & Liedl, K. R., Matched Peptides: Tuning Matched Molecular Pair Analysis for Biopharmaceutical Applications. J Chem Inf Model. 2015, 55(11), 2315-2323.
[2 ] Horvath, D., Marcou, G., Varnek, A., Kayastha, S., de la Vega de Leon, A., & Bajorath, J., Prediction of Activity Cliffs Using Condensed Graphs of Reaction Representations, Descriptor Recombination, Support Vector Machine Classification, and Support Vector Regression. J Chem Inf Model. 2016, 56(9), 1631-1640.
[3 ] Roberto Todeschini, V. C. (2009). Molecular Descriptors for Chemoinformatics,Volumes I & II (Vol. I): John Wiley & Sons.
[4 ] Alfred Burger , R. b. M. E. W. (2003). Burger’s Medicinal Chemistry and Drug Discovery (D. J. Abraham Ed. 6 ed. Vol. 1): 3 John Wiley & Sons.
[5 ] Alexander Binder, M. B., Miriam Hägele, Stephan Wienert, Daniel Heim, Katharina Hellweg, Albrecht Stenzinger, Laura Parlow, Jan Budczies, Benjamin Goeppert, Denise Treue, Manato Kotani, Masaru Ishii, Manfred Dietel, Andreas Hocke, Carsten Denkert, Klaus-Robert Müller, Frederick Klauschen. Towards computational fluorescence microscopy: Machine learning-based integrated prediction of morphological and molecular tumor profiles. arXiv. 2018.
[6 ] Derek Wong, S. Y., Machine learning classifies cancer. Nature. 2018, 555, 446-447.
[7 ] Litterman, N. K., Lipinski, C. A., Bunin, B. A., & Ekins, S., Computational prediction and validation of an expert's evaluation of chemical probes. J Chem Inf Model. 2014, 54(10), 2996-3004.
[8 ] Fluorophore. (2018). Retrieved from https://en.wikipedia.org/wiki/Fluorophore
[9 ] Su, B. H., Tu, Y. S., Lin, O. A., Harn, Y. C., Shen, M. Y., & Tseng, Y. J., Rule-based classification models of molecular autofluorescence. J Chem Inf Model. 2015, 55(2), 434-445.
[10 ] Reaxys. (2018). Retrieved from: https://www.reaxys.com/#/search/quick
[11 ] Tetko, I. V. G., J.; Todeschini, R.; Mauri, A.; Livingstone, D.; Ertl, P.; Palyulin, V. A.; Radchenko, E. V.; Zefirov, N. S.; Makarenko, A. S.; Tanchuk, V. Y.; Prokopenko, V. V. . Virtual computational chemistry laboratory - design and description. J. Comput. Aid. Mol. Des. 2005, 19, 456-463.
[12 ] E-Dragon. (2005). Retrieved from http://www.vcclab.org
[13 ] Shahlaei, M., Descriptor selection methods in quantitative structure-activity relationship studies: a review study. Chem Rev. 2013, 113(10), 8093-8103.
[14 ] Andrea Mauri, V. C., Manuela Pavan, and Roberto Todeschini. DRAGON SOFTWARE: AN EASY APPROACH TO MOLECULAR DESCRIPTOR CALCULATIONS. MATCH Commun. Math. Comput. Chem. 2006, 56, 237-248.
[15 ] Quantitative structure–activity relationship. (2018). Retrieved from https://en.wikipedia.org/wiki/Quantitative_structure%E2%80%93activity_relationship
[16 ] Dearden, J. C., Cronin, M. T., & Kaiser, K. L., How not to develop a quantitative structure-activity or structure-property relationship (QSAR/QSPR). SAR QSAR Environ Res. 2009, 20(3-4), 241-266.
[17 ] E., M., Angew.Chem. 1970, 82, 605.
[18 ] J., W. H., Structural determination of paraffin boiling points. J. Am. Chem. Soc. 1947, 69, 17-20.
[19 ] Livingstone, D. (1995). Data Analysis for Chemists. Oxford University Press: Oxford.
[20 ] Le, T., Epa, V. C., Burden, F. R., & Winkler, D. A., Quantitative structure-property relationship modeling of diverse materials properties. Chem Rev. 2012, 112(5), 2889-2919.
[21 ] VanderPlas, J. (2016). Python Data Science Handbook: Essential Tools for Working with Data: O'Reilly Media.
[22 ] The Iris Dataset. Retrieved from http://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html
[23 ] Hong, K. SCIKIT-LEARN : SPAM COMMENT FILTER USING SVM. Retrieved from http://www.bogotobogo.com/python/scikit-learn/scikit_learn_Support_Vector_Machines_SVM_spam_filtermachine_learning_.php
[24 ] sklearn.datasets.load_boston. Retrieved from http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
[25 ] Spectrum-NET Advanced Machine Learning Framework. Retrieved from http://www.spectrumeffect.com/machine-learning-framework.html
[26 ] scikit-learn. Retrieved from http://scikit-learn.org/stable/index.html
[27 ] Shengqiao Li, A. F., Harshinder Singh, and Sidney C. Soderholm. Application of the Random Forest Method in Studies of Local Lymph Node Assay Based Skin Sensitization Data. J. Chem. Inf. Model. 2005, 45, 952-964.
[28 ] Vladimir Svetnik, A. L., Christopher Tong, J. Christopher Culberson, Robert P. Sheridan, and Bradley P. Feuston. Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling. J. Chem. Inf. Comput. Sci. 2003, 43, 1947-1958.
[29 ] Weida Tong, H. H., Hong Fang, Qian Xie, and Roger Perkins. Decision Forest: Combining the Predictions of Multiple Independent Decision Tree Models. J. Chem. Inf. Comput. Sci. 2003, 43, 525-531.
[30 ] Ravi K. Nandigam, D. A. E., Jon A. Erickson, Sangtae Kim, and Jeffrey J. Sutherland. Predicting the Accuracy of Ligand Overlay Methods with Random Forest Models. J. Chem. Inf. Model. 2008, 48, 2386-2394.
[31 ] Marchese Robinson, R. L., Palczewska, A., Palczewski, J., & Kidley, N., Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Data Sets. J Chem Inf Model. 2017, 57(8), 1773-1792.
[32 ] Coley, C. W., Barzilay, R., Green, W. H., Jaakkola, T. S., & Jensen, K. F., Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction. J Chem Inf Model. 2017, 57(8), 1757-1772.
[33 ] PAUL PILOTTE, M. (2016). Analytics-driven embedded systems, part 2 - Developing analytics and prescriptive controls. Retrieved from http://www.embedded-computing.com/embedded-computing-design/analytics-driven-embedded-systems-part-2-developing-analytics-and-prescriptive-controls
[34 ] Visible spectrum. (2018). Retrieved from https://en.wikipedia.org/wiki/Visible_spectrum
[35 ] Decision Tree Regression. Retrieved from http://scikit-learn.org/stable/auto_examples/tree/plot_tree_regression.html
[36 ] RandomForestRegressor. Retrieved from http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
[37 ] ExtraTreesRegressor. Retrieved from http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html
[38 ] GradientBoostingRegressor. Retrieved from http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
[39 ] Alpaydin, E. (2014). Introduction to Machine Learning (F. Bach Ed. third edition ed.): MIT Press Ltd.
[40 ] Hastie, T. T., Robert; Friedman, Jerome. (2008). The Elements of Statistical Learning (2/e ed.): Springer.
[41 ] Bootstrap aggregating. (2018). Retrieved from https://en.wikipedia.org/wiki/Bootstrap_aggregating
[42 ] Gradient boosting. (2018). Retrieved from https://en.wikipedia.org/wiki/Gradient_boosting
[43 ] Coefficient of determination. (2018). Retrieved from https://en.wikipedia.org/wiki/Coefficient_of_determination
[44 ] Chen, B., Sheridan, R. P., Hornak, V., & Voigt, J. H., Comparison of random forest and Pipeline Pilot Naive Bayes in prospective QSAR predictions. J Chem Inf Model. 2012, 52(3), 792-803.
[45 ] Viviana Consonni, R. T., and Manuela Pavan. Structure/Response Correlations and Similarity/Diversity Analysis by GETAWAY Descriptors. 1. Theory of the Novel 3D Molecular Descriptors. J. Chem. Inf. Comput. Sci. 2002, 42, 682-692.
[46 ] Consonni V, T. R., Pavan M, Gramatica P., Structure/response correlations and similarity/diversity analysis by GETAWAY descriptors. 2. Application of the novel 3D molecular descriptors to QSAR/QSPR studies., J Chem Inf Comput Sci. 2002, 42, 693-705.
[47 ] Lowell H. Hall, L. B. K., Electrotopological State Indices for Atom Types: A Novel Combination of Electronic,Topological, and Valence State Information. J. Chem. Inf. Comput. Sci. 1995, 35, 1039-1045.