簡易檢索 / 詳目顯示

研究生: 鄭吉峰
Cheng, Chi-Feng
論文名稱: 結合化學指紋輔助原子嵌入和自注意力模型進行蛋白質-配體交互作用預測
Combining Molecular Fingerprints and Atomic Embedding with a Self-Attention Model for Protein-Ligand Interaction Prediction
指導教授: 蔡明剛
Tsai, Ming-Kang
口試委員: 蔡明剛
Tsai, Ming-Kang
張鈞智
Chang, Chun-Chih
葉丞豪
Yeh, Chen-Hao
口試日期: 2023/07/14
學位類別: 碩士
Master
系所名稱: 化學系
Department of Chemistry
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 72
中文關鍵詞: 深度學習CPI化學指紋Transformer
英文關鍵詞: Deep learning, CPI, Chemical fingerprint, Transformer
研究方法: 實驗設計法數據分析
DOI URL: http://doi.org/10.6345/NTNU202301683
論文種類: 學術論文
相關次數: 點閱:158下載:3
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 在藥物研發中,Compound Protein Interaction是一個關鍵的領域,它關注藥物與蛋白質之間的相互作用,這些作用對於藥物的活性和效果至關重要。傳統上,CPI的研究主要依賴實驗室進行的耗時耗力的試驗,但隨著機器學習的快速發展,它在CPI研究中展現了許多優勢,它可以高效地處理大規模和複雜的生物信息數據,並自動學習特徵和模式,從而加速藥物研發的進程並降低成本。
    本研究旨在改進現有的CPI機器學習模型,以提升其預測能力。原始模型主要採用了Transformer模型的自注意機制來預測CPI反應性,這種機制能夠捕捉分子和蛋白質之間的局部和全局關係。我們認為進一步引入分子的化學指紋可以增加對分子特徵的理解,從而提高模型的性能。為此我們使用了PaDEL工具生成了GPCR資料集中所有分子的化學指紋。
    通過聚類分析,我們對資料集中不同化學指紋的分布情況進行了研究。這有助於我們理解分子的結構和性質之間的相似性和差異性。接著我們將這些化學指紋先後以三種方式引入模型訓練中,試圖從中探明其有效性並找出最適合的引入方法。首先,我們將化學指紋轉換為嵌入向量,以提供更全面的信息。其次,我們嘗試將化學指紋作為附加特徵引入模型,使模型能夠更完整的使用到化學指紋。最後,我們對化學指紋的數值進行TF-IDF的操作來擴展其變異性,以便模型能夠更好地理解分子之間的不同。
    在實驗結果中,我們比較了這三種模型在CPI預測性能上的差異,並分析了它們與先前聚類分析結果之間的關係。我們觀察到引入化學指紋後,模型的預測準確性和穩定性在特定化學指紋得到了改善,並且其與聚類分析結果之間存在一定的關聯性。

    Compound Protein Interaction (CPI) is a critical field in drug development that focuses on the interactions between drugs and proteins. This plays a crucial role in determining the activity and efficacy of drugs. Traditionally, CPI research heavily relied on laborious and time-consuming experimental assays. However, the rapid advancement of machine learning has demonstrated numerous advantages in CPI research, enabling efficient processing of large-scale and complex biological data while automatically learning features and patterns. As a result, it accelerates the drug development process and reduces costs significantly.
    This study aims to improve existing CPI machine learning models to enhance their predictive capabilities. The original model primarily employed the self-attention mechanism of the Transformer model which captures both local and global relationships between molecules and proteins. We believe that further incorporating molecular chemical fingerprints can enhance the understanding of molecular features and improve model performance. To achieve this, we utilized the PaDEL tool to generate chemical fingerprints for all molecules in the GPCR dataset.
    Through cluster analysis, we investigated the distribution patterns of different chemical fingerprints within the dataset. This analysis aided our understanding of the similarities and differences in molecular structures and properties. Subsequently, we introduced these chemical fingerprints into the model training process using three different approaches, aiming to determine their effectiveness and identify the most suitable integration method. Firstly, we transformed the chemical fingerprints into embedding vectors to provide more comprehensive information. Secondly, we attempted to incorporate the chemical fingerprints as additional features to enable the model to fully utilize the information contained in the fingerprints. Lastly, we applied TF-IDF operations to the numerical values of the chemical fingerprints to expand their variations, allowing the model to better understand the differences between molecules.
    In the experimental results, we compared the performance of these three models in CPI prediction and analyzed their relationship with the previous cluster analysis results. We observed that the introduction of chemical fingerprints improved the predictive accuracy and stability of the model, particularly for specific chemical fingerprint types, and exhibited certain correlations with the cluster analysis results.

    表目錄 iv 圖目錄 v 摘要 vii Abstract viii 第一章、 緒論 1 一、 研究動機與目的 1 二、 SMILES分子表達式規範介紹 3 (一)、元素及其電荷數 3 (二)、鍵結 4 (三)、環狀結構 4 (四)、立體結構 4 三、化學指紋介紹 6 (一)、分子表示 6 (二)、特徵提取 6 (三)、特徵編碼 6 (四)、指紋存儲和比對 7 四、 PaDEL介紹 8 五、 深度學習(Deep Learning)介紹 9 (一)、機器學習概述 9 (二)、深度學習概述 9 (三)、深度學習方法介紹 10 六、 Transformer及self-attention簡介 14 第二章、 研究方法 19 一、GPCR資料集 19 (一)、何為GPCR 19 (二)、GLASS資料庫的介紹 19 (三)、GPCR訓練資料集的建構 20 二、 TransformerCPI模型架構 22 (一)、蛋白質序列轉換區塊 22 (二)、配體分子資訊處理區塊 24 (三)、交互作用區塊 25 (四)、活性預測區塊 26 三、 原始模型的優化及測試 28 四、化學指紋的生成 29 (一)、CDK fingerprint 29 (二)、CDK extended fingerprint 29 (三)、Estate fingerprint 30 (四)、CDK graph only fingerprint 30 (五)、MACCS fingerprint 30 (六)、Pubchem fingerprint 30 (七)、Substructure fingerprint 30 (八)、Substructure fingerprint count 30 (九)、Klekota-Roth fingerprint 31 (十)、Klekota-Roth fingerprint count 31 (十一)、2D atom pairs 31 (十二)、2D atom pairs count 31 五、化學指紋嵌入層置入方法 32 (一)、化學指紋嵌入式跨注意力模型 32 (二)、化學指紋線性層跨注意力模型 33 (三)、化學指紋TF-IDF加權跨注意力模型 34 第三章、 研究結果 36 一、原模型訓練測試結果 36 二、化學指紋數值分析 40 (一)、inertia score 41 (二)、silhouette score 43 三、化學指紋加入模型訓練結果 45 (一)、化學指紋嵌入式跨注意力模型 45 (二)、化學指紋線性層跨注意力模型 48 (三)、化學指紋TF-IDF加權跨注意力模型 51 第四章、 結論 54 第五章、 參考文獻 56 第六章、 附錄 61

    (1) LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature, 2015, 521 (7553), 436-444.
    (2) Friederich, P.; Häse, F.; Proppe, J.; Aspuru-Guzik, A. Machine-learned potentials for next-generation matter simulations. Nat. Mater., 2021, 20 (6), 750-761.
    (3) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A. N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process., 2017, 30.
    (4) Wang, X.; Liu, J.; Zhang, C.; Wang, S. SSGraphCPI: A novel model for predicting compound-protein interactions based on deep learning. Int. J. Mol. Sci., 2022, 23 (7), 3780.
    (5) Bui-Thi, D.; Rivière, E.; Meysman, P.; Laukens, K. Predicting compound-protein interaction using hierarchical graph convolutional networks. PLoS One, 2022, 17 (7), e0258628.
    (6) Liu, H.; Sun, J.; Guan, J.; Zheng, J.; Zhou, S. Improving compound–protein interaction prediction by building up highly credible negative samples. Bioinformatics, 2015, 31 (12), i221-i229.
    (7) Chen, L.; Tan, X.; Wang, D.; Zhong, F.; Liu, X.; Yang, T.; Luo, X.; Chen, K.; Jiang, H.; Zheng, M. TransformerCPI: improving compound–protein interaction prediction by sequence-based deep learning with self-attention mechanism and label reversal experiments. Bioinformatics, 2020, 36 (16), 4406-4414.
    (8) Yap, C. W. PaDEL‐descriptor: An open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem., 2011, 32 (7), 1466-1474.
    (9) Ye, Z.-R.; Huang, I.-S.; Chan, Y.-T.; Li, Z.-J.; Liao, C.-C.; Tsai, H.-R.; Hsieh, M.-C.; Chang, C.-C.; Tsai, M.-K. Predicting the emission wavelength of organic molecules using a combinatorial QSAR and machine learning approach. RSC Adv., 2020, 10 (40), 23834-23841.
    (10) Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Model., 1988, 28 (1), 31-36.
    (11) Bajusz, D.; Rácz, A.; Héberger, K. Fingerprints, and Other Molecular Descriptions for Database Analysis and Searching. 2017.
    (12) Capecchi, A.; Probst, D.; Reymond, J.-L. One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J. Cheminform., 2020, 12 (1), 1-15.
    (13) Salton, G.; Fox, E. A.; Wu, H. Extended boolean information retrieval., Commun. ACM, 1983, 26 (11), 1022-1036.
    (14) Fernández-de Gortari, E.; García-Jacas, C. R.; Martinez-Mayorga, K.; Medina-Franco, J. L. Database fingerprint (DFP): an approach to represent molecular databases. J. Cheminform., 2017, 9 (1), 1-9.
    (15) Eilers, M.; Hornak, V.; Smith, S. O.; Konopka, J. B. Comparison of class A and DG protein-coupled receptors: common features in structure and activation. Biochemistry, 2005, 44 (25), 8959-8975.
    (16) Weis, W. I.; Kobilka, B. K. The molecular basis of G protein–coupled receptor activation. Annu. Rev. Biochem., 2018, 87, 897-919.
    (17) Zhou, Q.; Yang, D.; Wu, M.; Guo, Y.; Guo, W.; Zhong, L.; Cai, X.; Dai, A.; Jang, W.; Shakhnovich, E. I. Common activation mechanism of class A GPCRs. eLife, 2019, 8, e50279.
    (18) Chan, W. K.; Zhang, H.; Yang, J.; Brender, J. R.; Hur, J.; Özgür, A.; Zhang, Y. GLASS: a comprehensive database for experimentally validated GPCR-ligand associations. Bioinformatics, 2015, 31 (18), 3035-3042.
    (19) Liu, T.; Lin, Y.; Wen, X.; Jorissen, R. N.; Gilson, M. K. BindingDB: a web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res., 2007, 35 (suppl_1), D198-D201.
    (20) Wan, F.; Zhu, Y.; Hu, H.; Dai, A.; Cai, X.; Chen, L.; Gong, H.; Xia, T.; Yang, D.; Wang, M.-W. DeepCPI: a deep learning-based framework for large-scale in silico drug screening. GPB 2019, 17 (5), 478-495.
    (21) Helal, K. Y.; Maciejewski, M.; Gregori-Puigjane, E.; Glick, M.; Wassermann, A. M. Public domain HTS fingerprints: design and evaluation of compound bioactivity profiles from PubChem’s bioassay repository. J. Chem. Inf. Model., 2016, 56 (2), 390-398.
    (22) Dong, J.; Cao, D.; Miao, H.; Liu, S.; Deng, B.; Yun, Y.; Wang, N.; Lu, A.; Zeng, W. Bin, & Chen, AF. ChemDes: An integrated web-based platform for molecular descriptor and fingerprint computation. J. Cheminform., 2015, 7 (1), 1-10.
    (23) Awale, M.; Reymond, J.-L. Atom pair 2D-fingerprints perceive 3D-molecular shape and pharmacophores for very fast virtual screening of ZINC and GDB-17. J. Chem. Inf. Model., 2014, 54 (7), 1892-1907.
    (24) Orosz, Á.; Héberger, K.; Rácz, A. Comparison of descriptor-and fingerprint sets in machine learning models for ADME-Tox targets. Front. Chem., 2022, 10, 852893.
    (25) He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition., Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2016; pp 770-778.
    (26) Salton, G.; Buckley, C. Term-weighting approaches in automatic text retrieval. Inf. Process. Manag., 1988, 24 (5), 513-523.
    (27) Zhang, Y.; Kang, B.; Hooi, B.; Yan, S.; Feng, J. Deep long-tailed learning: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023.
    (28) Hartigan, J. A.; Wong, M. A. Algorithm AS 136: A k-means clustering algorithm. J. R. Stat. Soc., C: Appl. Stat., 1979, 28 (1), 100-108.
    (29) Pham, D. T.; Dimov, S. S.; Nguyen, C. D. Selection of K in K-means clustering. Proc. Inst. Mech. Eng. C 2005, 219 (1), 103-119.
    (30) Bender, A., Brown, N., Special Issue: Cheminformatics in Drug Discovery. ChemMedChem, 2018, 13, 467-469.
    (31) Sachdev, K.; Gupta, M. K. A comprehensive review of feature based methods for drug target interaction prediction. J. Biomed. Inform. 2019, 93, 103159.
    (32) Floridi, L.; Chiriatti, M. GPT-3: Its nature, scope, limits, and consequences. Minds Mach., 2020, 30, 681-694.
    (33) Alammar, J. The illustrated transformer. The Illustrated Transformer–Jay Alammar–Visualizing Machine Learning One Concept at a Time 2018, 27.
    (34) Karabağ, C.; Ortega-Ruíz, M. A.; Reyes-Aldasoro, C. C. Impact of Training Data, Ground Truth and Shape Variability in the Deep Learning-Based Semantic Segmentation of HeLa Cells Observed with Electron Microscopy. J. Imaging, 2023, 9 (3), 59.
    (35) Halgren, T. A. Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. J. Comput. Chem., 1996, 17 (5‐6), 490-519.
    (36) Lim, S.; Lu, Y.; Cho, C. Y.; Sung, I.; Kim, J.; Kim, Y.; Park, S.; Kim, S. A review on compound-protein interaction prediction methods: data, format, representation and model., Comput. Struct. Biotechnol. J., 2021, 19, 1541-1556.
    (37) Riley, P. Three pitfalls to avoid in machine learning. Nature, 2019, 572 (7767), 27-29.

    下載圖示
    QR CODE