研究生: |
許宸瑋 HSU, CHEN-WEI |
---|---|
論文名稱: |
流行疾病中文新聞面向事實自動擷取之研究 Fact Extraction for Epidemic Disease from Chinese News Articles |
指導教授: |
柯佳伶
Koh, Jia-Ling |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2017 |
畢業學年度: | 105 |
語文別: | 中文 |
論文頁數: | 59 |
中文關鍵詞: | 關鍵字選取 、面向事實擷取 、資訊結構化 |
英文關鍵詞: | keyword extraction, facet retrieval, structural form |
DOI URL: | https://doi.org/10.6345/NTNU202202349 |
論文種類: | 學術論文 |
相關次數: | 點閱:105 下載:16 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
當流行疾病發生時,使用者通常希望獲得更多有關於流行疾病的面向事實。本論文以中文流行疾病網路新聞為資料來源,研究如何從流行疾病新聞中自動擷取出疫情、症狀面向事實句,並從面向事實句中擷取出語意三元詞組進行結構化表示,以幫助有效率地查詢流行疾病的疫情發展狀況及症狀演變,並可作為建立知識庫的基礎。本論文提出的方法,對疫情及症狀面向事實句各建立一個分類模型,用來預測擷取新聞中對應的面向事實句。為了達到有效分類,本論文從已標示的面向事實句及非面向事實句中,以統計分析擷取出對分類較有效果的面向關鍵字,以這些關鍵字為基礎來建立每個句子的面向句分類特徵值。此外,由於不同流行病皆需給定訓練資料,本論文提出一個面向事實句自動標示的方法,可減少人工標示訓練資料的成本。此外,根據句子中詞彙的語法出現相依性分析,本論文方法可取出面向事實句的語意三元詞組及時間地點等屬性,建立面向事實的結構化表示。實驗結果顯示本論文提供的方法在面向事實句的選取、語意三元詞組的擷取都達到良好的效果。
When a pandemic occurs, users would like to get more information about the epidemic on various aspects. In this thesis, the Chinese news documents about epidemic diseases collected from internet are considered as the data source. We studied how to extract the sentences describing epidemic or symptom facets of the diseases from the news documents. Besides, the semantic triples are extracted from the sentences to help efficiently inquire the development of epidemic and the evolution of symptoms, which provide a basis for constructing a knowledge base. In the proposed method, two classification models are constructed for extracting the sentences of epidemic or symptom facets, respectively. In order to achieve effective classification, we used a statistical analysis method to extract the keywords that are more effective for distinguish the facet sentences from the non-facet sentences. Based on these keywords, various feature values for classification are established for each sentence. In addition, in order to reduce the manually labeling cost of training data for various epidemics, we proposed a method to automatically label the facet and non-facet training sentences. Finally, according to the grammatical analysis result on each facet sentence, the semantic triples and the corresponding time and place are extracted to establish a structured representation for the facet information. The results of experiments show that the methods provided in this thesis perform well on both selecting the facet sentences and retrieving the semantic triples.
參考文獻
[1] S. Panem, M. Gupta, and V. Varma. Structured information extraction from natural disaster events on twitter. Proceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning, Pages 1-8. ACM, 2014.
[2] R. Wang, W. J. Huang, W. Chen, T. Wang, and K. Lei. Asem: mining aspects and sentiment of events from microblog. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Pages 1923-1926. ACM, 2015.
[3] B. Hu, B. Liu, N. Z. Gong, D. Kong, and H. Jin. Protecting your children from inappropriate content in mobile apps: an automatic maturity rating framework. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Pages 1111-1120. ACM, 2015.
[4] D. Kershaw, Matthew, and P. Stacey. Towards modelling language innovation acceptance in online social network. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, Pages 553-562. ACM, 2016.
[5] R. E. Prasojo, M. Kacimi, and W. Nutt. Entity and aspect extraction for organizing news comments. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Pages 233-242. ACM, 2015.
[6] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. Springer, 2007.
[7] S. Sekine, and C. Nobata. Definition dictionaries and tagger for extended named entity hierarchy. In LREC, 2004.
[8] E. Kuzey, and G. Weikum. Extraction of temporal facts and events from Wikipedia. In Proceedings of the 2nd Temporal Web Analytics Workshop, Pages 25-32. ACM, 2012.
[9] Y. Wang, M. Zhu, L. Qu, M. Spaniol, and G. Weikum. Timely YAGO:harvesting,querying,and visualizing temporal knowledge from Wikipedia. In Proceedings of the 13th International Conference on Extending Database Technology, Pages 697-700. EDBT, 2010.
[10] X. Rong. Word2vec parameter learning explained. arXiv preprint arXiv:1411.2738, 2014.
[11] D. Preotiuc-Pietro, J. Carpenter, S.Giorgi, and L.Ungar. Studying the dark triad of personality through twitter behavior.In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Pages 761-770. CIKM, 2016.
[12] Z. Zhao, P. Resnick, and Q. Mei. Enquiring minds: early detection of rumors in social media from enquiry posts. In Proceedings of the 24th International Conference on World Wide Web, Pages 1395-1405. WWW, 2015.
[13] D. C. Howell. Chi square test-analysis of contingency tables. In International encyclopedia of statistical science,Pages 250-252.Springer Berlin Heidelberg, 2011.
[14] X. Liu, Z. Nie, N. Yu, and J. R. Wen. Biosnowball: automated population of wikis. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, Pages 969-978. KDD, 2010.
[15] S. Oramas, M. Sordo, and L. Espinosa-Anke. A rule-based approach to extracting relations from music tidbits. In Proceedings of the 24th International Conference on World Wide Web, Pages 661-666. WWW, 2015.
[16] W. Yih, J. Goodman, and V. R. Carvalho. Finding advertising keywords on web pages. In Proceedings of the 15th international conference on World Wide Web, Pages 213-222. WWW, 2006.
[17] Y. Even-Zohar, and D. Roth. A sequential model for multi class classification. In arXiv preprint cs/0106044. EMNLP-01, 2001.
[18] Sun, and Qian. Transfer learning for bilingual content classification. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Pages 2147-2156. ACM, 2015.
[19] Kavuluru, Ramakanth. Classification of helpful comments on online suicide watch forums. In Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Pages 32-40. BCB, 2016.
[20] 葉懿萱. 網頁搜尋結果重要面向事實內容自動擷取之研究. 臺灣師範大學資訊工程系碩士論文2014.