研究生: |
郭博元 |
---|---|
論文名稱: |
結合統計與規則探討生醫文件疾病與基因之關係 A Hybrid Method for Discovering Disease-Gene Associations from Biomedical Texts |
指導教授: | 侯文娟 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2014 |
畢業學年度: | 102 |
語文別: | 中文 |
論文頁數: | 37 |
中文關鍵詞: | 規則學習 、統計方法 、疾病與基因關係 、生物醫學文獻探勘 |
英文關鍵詞: | Rule learning, Statistical method, Gene-disease relationship, Biomedical text mining |
論文種類: | 學術論文 |
相關次數: | 點閱:154 下載:4 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
本研究嘗試在生醫文獻中探討基因以及疾病的關聯度,所使用的資料為孟德爾遺傳學(Online Mendelian Inheritance in Man, OMIM)網站中提供的morbid中所包含的Mendelian Inheritance in Man (MIM)文獻。在本論文中,首先從生醫文獻找出含有人類遺傳疾病與基因之句子,視為正確的句子;以及不包含疾病與基因的句子,視為錯誤的句子。然後透過Memory-Based Shallow Parser (MBSP)標記句子以取得我們需要的資訊,模擬ALEPH系統進行規則的學習,並利用這些規則在本實驗的生醫文獻中,抓取單一句子以及相鄰句子配對到的基因與疾病,再使用統計方法中驗證值減期望值所得到的Z-Score值來判斷該配對是否可以列為有效配對,接著結合一些限制條件、Rule數之多寡等因素進行其他實驗,最後以Precision、Recall以及F-Score值當作評估的標準。
The study focuses on automatically extracting the relationships between human genetic diseases and genes from the biomedical literatures. The experimental data is retrieved from Mendelian Inheritance in Man (MIM) literatures of morbid in Online Mendelian Inheritance in Man (OMIM) database. To collect the corpus used in the research, the first step is to find the sentences that include both the related human genetic diseases and genes mentioned from the morbid file, and they are regarded as the correct sentences. In the second step, the sentences that neither have the related human genetic diseases nor the genes mentioned from the morbid file are randomly selected, and they are regarded as the incorrect sentences. Next, Memory-Based Shallow Parser (MBSP) is utilized to analyze these sentences to get some information in order to find rules in the following step. Then, some learning rules are obtained by simulating ALEPH system in the study. These generated rules are applied to catch the pairs of human genetic diseases and genes within one sentence or multi-sentences. The thesis also proposes a statistical approach, called Z-score method, to determine whether the pairs are valid or not. Finally, the experiments are made with considering some constraints and different numbers of rules. Furthermore, the evaluation metrics in the experiments are precision, recall rates, and F-scores.
Adamic, Lada A., Wilkinson, Dennis, Huberman, Bernardo A. and Adar, Eytan (2002). “A Literature Based Method for Identifying Gene-Disease Connections,” Proceedings of IEEE Computer Society Bioinformatics Conference 2002, 1: 109-117, 2002.
Al-Mubaid, Hisham and Singh, Rajit K. (2005). “A New Text Mining Approach for Finding Protein-to-Disease Associations,” American Journal of Biochemistry and Biotechnology, 1(3): 145-152, 2005.
ALEPH. Available from http://www.cs.ox.ac.uk/activities/machlearn/Aleph/aleph.html
Cheung, Warren A., Ouellette, B.F. Francis and Wasserman, Wyeth W. (2012). “Inferring Novel Gene-Disease Associations Using Medical Subject Heading Over-Representation Profiles,” Genome Medicine, 4: 75, 2012.
GENIA Corpus. Available from http://www.nactem.ac.uk/genia/
Maglott, D., Ostell, J., Pruitt, K.D. and Tatusova, T. (2007). “Entrez Gene: Gene-centered Information at NCBI,” Nucleic Acids Research, 35 (Database issue): D26-31, 2007.
MBT (Memory-Based Tagger-Generator and Tagger). Available from http://ilk.uvt.nl/
mbt/
Memory-Based Shallow Parser. Available from www.clips.ua.ac.be/pages/
MBSP#server
MeSH (Medical Subject Headings). Available from https://www.nlm.nih.gov/mesh/
meshhome.html
Mitchel, J.A., Aronson, A.R., Mork, J.G., Folk, L.C., Humphrey, S.M. and Ward, J.M. (2003). “Gene Indexing: Characterization and Analysis of NLM's GeneRIFs,” Proceedings of AMIA Annual Symposium, 460-464, 2003.
Muggleton, Stephen, and de Raedt, Luc (1994). “Inductive Logic Programming: Theory and methods,” The Journal of Logic Programming, 19-20: 629-679, 1994.
NLM (Natural Library of Medicine). Available from https://www.nlm.nih.gov/
OMIM (Online Mendelian Inheritance in Man). Available from http://www.ncbi.nlm.
nih.gov/omim
P-value. Available from http://en.wikipedia.org/wiki/P-value
Pruitt, K.D. and Maglott, D.R. (2001). “RefSeq and LocusLink: NCBI Gene-centered Resources,” Nucleic Acid Research, 29(1): 137-40, 2001.
Srinivasan, Ashwin (2000). “The Aleph Manual,” Technical Report, Computing Laboratory, Oxford University, 2000. Available from http://www.cs.ox.ac.uk/ activities/machlearn/Aleph/aleph.html
TiMBL (Tilburg Memory-Based Learner). Available from http://ilk.uvt.nl/timbl/
Wain, H.M., Lush, M., Ducluzeau, F. and Povey, S. (2002). “Genew: The Human Nomenclature Database,” Nucleic Acids Research, 30(1): 169-71, 2002.
陳孝源,“人類基因與疾病關係之規則擷取”,國立台灣師範大學資訊工程所碩士論文,2012年。
劉宇錚,“利用相鄰句子資訊探討人類疾病與基因之關係”,國立台灣師範大學資訊工程所碩士論文,2013年。
蔡育霖,“以機率模型為基礎之生醫文件指代消解方法”,國立台灣師範大學資訊工程所碩士論文,2013年。