研究生: |
劉宇錚 Yu-Jeng Liu |
---|---|
論文名稱: |
利用相鄰句子資訊探討人類疾病與基因之關係 Using Adjacent Sentences Information for Finding Relationship between Diseases and Genes |
指導教授: |
侯文娟
Hou, Wen-Juan |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2013 |
畢業學年度: | 101 |
語文別: | 中文 |
論文頁數: | 56 |
中文關鍵詞: | 規則學習 、疾病與基因關係 、生物醫學文獻探勘 |
英文關鍵詞: | rule learning,, gene-disease relationship, biomedical text mining |
論文種類: | 學術論文 |
相關次數: | 點閱:195 下載:8 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究嘗試在生醫文獻中找出人類遺傳疾病與基因的關聯度,並在人類遺傳疾病及基因之間得到一些規則或關聯性。若能自動從文獻中預測疾病與基因能達到某種程度的相關性,對於以後生醫研究人員在探討人類遺傳疾病與基因等等的文獻資料時,相信都可以利用此關聯性或規則快速了解兩者之間的關係,達到快速閱讀的目的,在節省人力成本及時間之餘,更希望透過此研究能加速生物醫學的發展速度。
本研究使用的資料為孟德爾遺傳學(Online Mendelian Inheritance in Man, OMIM)網站中提供的morbid所中包含的Mendelian Inheritance in Man (MIM)文獻。在本研究中,首先在文獻中找出含有morbid所提及的人類遺傳疾病與基因共存的句子,視為正確的句子;以及不包含morbid所提及的疾病與基因的句子,視為不正確的句子。透過Memory-Based Shallow Parser (MBSP)來分析這些段落中的句子,將會得到句子文法相關的資訊(例如詞性),接著將MBSP標記好的句子利用自製的學習系統學習規則,在學習前需要準備三個檔案,第一個檔案需要寫入規則的模式、句子的詳細資訊與規則所需的元素,本實驗所需的元素為SVO-relation,表示主詞-動詞-受詞之間的關係;第二個檔案是在學習規則時用到的正確句子的編號;第三個檔案是在學習規則時用到的不正確的句子。利用這些資料訓練出的規則,再加入本論文所提出的多重句子探勘演算法,以便擴展原有規則的結果而得到新的關係。最後,對於實驗結果產生出來的人類遺傳疾病與基因,本研究以準確度和回收率當作評估的標準,並記錄各個門檻值的結果。實驗在多重句子探勘得到最好的F-score為72.18%,此時的準確度為72.66%,回收率為71.71%;而未使用多重句子探勘得到最好的F-score為67.32%,此時的準確度為76.29%,回收率為60.24%。
In this study, we automatically find relations between human genetic diseases and genes from biomedical literatures. Thus, we can get some rules or relations between human genetic diseases and genes after mining biomedical literatures. Consequently, when biomedical researchers study about biomedical literatures between human genetic diseases and genes, they can understand the relations between diseases and genes by using the rules or the correlation that we proposed. Not only saving human resource cost and time, but also achieving the purpose of fast reading the literatures, we hope that our study can promote the speed of development of biomedical domain.
We use data provided by Mendelian Inheritance in Man (MIM) literatures of morbid from Online Mendelian Inheritance in Man (OMIM) database. We first find the paragraphs that include both the related human genetic diseases and genes mentioned in the morbid file and regard them as correct paragraphs. Then we find other paragraphs and reference as to incorrect paragraphs. After that, we use Memory-Based Shallow Parser (MBSP) to analyze the sentences so that we get the syntactic information such as parts of speech. To learn the rules need to be prepared three files, one file is rules pattern, sentences information and some elements of SVO-relation, SVO-relation is the relation of subject, verb and object. Second one is the number of correct sentences in learning rules. Third one is the number of incorrect sentences in learning rules. Using these rules, we then apply some multi-sentence mining algorithms to extend our results. At last we use precision and recall rates as the evaluation metrics in the experiments and record the results of all thresholds. The experiment’s results showed that the best F-score is 72.18% where the precision is 72.66% and the recall is 71.71% with Multi-Sentences Mining algorithm. And we get the best F-score is 67.32% where the precision is 76.29% and the recall is 60.24% without Multi-Sentences Mining algorithm.
ALEPH. Available from http://www.cs.ox.ac.uk/activities/machlearn/Aleph/aleph.html.
Brown Corpus. Available from http://langbank.engl.polyu.edu.hk/corpus/brown.html
Chen, J.Y., Shen, C. and Sivachenko, A.Y. (2006) “Mining Alzheimer disease relevant proteins from integrated protein interactome data,” Pacific Symposium on Biocomputing, vol. 11, 2006, pp. 367-378.
English stop word. Available from http://www.ranks.nl/resources/stopwords.html.
Genia Tagger. Available from http://www-tsujii.is.u-tokyo.ac.jp/GENIA/tagger/.
Hines, Y., Hu, L.M., Weng, H., Zuo, D., Rivera, M., Richardson, A. and LaBaer, J. (2003), “Analysis of genomic and proteomic data using advanced literature,” Journal of Proteome Research, vol. 2, 2003, pp. 405-412.
HUGO Gene Nomenclature Committee database. Available from http://www.genenames.org/ .
Kim, Jee-Hyub, Mitchell, Alex, Attwood, Teresa K. and Hilario, Melanie (2007) “Learning to extract relations for protein annotation”, Bioinformatics, Vol. 23, ISMB/ECCB 2007, pp. i256-i263.
MBT database. Available from http://ilk.uvt.nl/mbt/.
MEDLINE Fact Sheet. Available from http://www.nlm.nih.gov/pubs/factsheets/medline.html .
Memory-Based Shallow Parser. Available from www.clips.ua.ac.be/pages/MBSP#server .
Muggleton, S. and Readt, L.D. (1994) “Inductive logic programming theory and methods,” Journal of logic Programming, vol. 9, 1994, pp. 629-679.
Online Mendelian Inheritance in Man. Available from http://en.wikipedia.org/wiki/Online_Mendelian_Inheritance_in_Man
OMIM database. Available from http://www.ncbi.nlm.nih.gov/omim/
Penn Tag Set. Available from http://www.anc.org/OANC/penn.html
Srinivasan, Ashwin (2000) “The Aleph manual,” Technical Report, Computing Laboratory, Oxford University, 2000. Available from http://www.cs.ox.ac.uk/activities/ machlearn/ Aleph/aleph.html.
TiMBL database. Available from http://ilk.uvt.nl/timbl/.
Walter, Daelemans, Buchholz, Sabine and Veenstra, Jorn (1999) “Memory-based shallow parsing,” Proceedings of the EACL’99 workshop on Computational Natural Language Learning (CoNLL-99), pp. 53-60.
陳立哲,“生物資訊文獻中人類遺傳疾病與基因關連度之研究”,國立台灣師範大學資訊工程所碩士論文,2011年。
陳孝源,“人類基因與疾病關係之規則擷取”,國立台灣師範大學資訊工程所碩士論文,2012年。