簡易檢索 / 詳目顯示

研究生: 許宸瑋
HSU, CHEN-WEI
論文名稱: 流行疾病中文新聞面向事實自動擷取之研究
Fact Extraction for Epidemic Disease from Chinese News Articles
指導教授: 柯佳伶
Koh, Jia-Ling
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2017
畢業學年度: 105
語文別: 中文
論文頁數: 59
中文關鍵詞: 關鍵字選取面向事實擷取資訊結構化
英文關鍵詞: keyword extraction, facet retrieval, structural form
DOI URL: https://doi.org/10.6345/NTNU202202349
論文種類: 學術論文
相關次數: 點閱:105下載:16
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 當流行疾病發生時,使用者通常希望獲得更多有關於流行疾病的面向事實。本論文以中文流行疾病網路新聞為資料來源,研究如何從流行疾病新聞中自動擷取出疫情、症狀面向事實句,並從面向事實句中擷取出語意三元詞組進行結構化表示,以幫助有效率地查詢流行疾病的疫情發展狀況及症狀演變,並可作為建立知識庫的基礎。本論文提出的方法,對疫情及症狀面向事實句各建立一個分類模型,用來預測擷取新聞中對應的面向事實句。為了達到有效分類,本論文從已標示的面向事實句及非面向事實句中,以統計分析擷取出對分類較有效果的面向關鍵字,以這些關鍵字為基礎來建立每個句子的面向句分類特徵值。此外,由於不同流行病皆需給定訓練資料,本論文提出一個面向事實句自動標示的方法,可減少人工標示訓練資料的成本。此外,根據句子中詞彙的語法出現相依性分析,本論文方法可取出面向事實句的語意三元詞組及時間地點等屬性,建立面向事實的結構化表示。實驗結果顯示本論文提供的方法在面向事實句的選取、語意三元詞組的擷取都達到良好的效果。

    When a pandemic occurs, users would like to get more information about the epidemic on various aspects. In this thesis, the Chinese news documents about epidemic diseases collected from internet are considered as the data source. We studied how to extract the sentences describing epidemic or symptom facets of the diseases from the news documents. Besides, the semantic triples are extracted from the sentences to help efficiently inquire the development of epidemic and the evolution of symptoms, which provide a basis for constructing a knowledge base. In the proposed method, two classification models are constructed for extracting the sentences of epidemic or symptom facets, respectively. In order to achieve effective classification, we used a statistical analysis method to extract the keywords that are more effective for distinguish the facet sentences from the non-facet sentences. Based on these keywords, various feature values for classification are established for each sentence. In addition, in order to reduce the manually labeling cost of training data for various epidemics, we proposed a method to automatically label the facet and non-facet training sentences. Finally, according to the grammatical analysis result on each facet sentence, the semantic triples and the corresponding time and place are extracted to establish a structured representation for the facet information. The results of experiments show that the methods provided in this thesis perform well on both selecting the facet sentences and retrieving the semantic triples.

    目錄 摘 要 i ABSTRACT ii 誌謝 iii 附圖目錄 v 附表目錄 vi 第一章 緒論 1 1.1 研究動機 1 1.2 研究目的 1 1.3 論文方法 6 1.4 論文架構 8 第二章 文獻探討 9 2.1 關鍵字特徵選取 9 2.2 事實資訊擷取 10 第三章 面向事實句選取方法 13 3.1 資料來源擷取 13 3.2 資料前處理 13 3.3 特徵關鍵詞選取 16 3.4 句子分類特徵 17 3.5 新聞事實句訓練資料標示 23 3.6 建立分類模型 25 第四章 語意三元詞組擷取方法 28 4.1 語意三元詞組擷取 28 4.2 語意三元詞組資訊補足 32 第五章 實驗評估 36 5.1 實驗資料 36 5.2 面向事實句挑選之實驗評估 37 5.3 語意三元詞組擷取評估 47 第六章 結論與未來研究方向 49 6.1結論 49 6.2未來研究方向 49 參考文獻 51 附錄一 中研院詞性標記列表 54 附錄二 相依性分析之有向邊說明 57 附錄三 語意腳色標註定義說明 58 附錄四 LTP詞性標註說明 59

    參考文獻
    [1] S. Panem, M. Gupta, and V. Varma. Structured information extraction from natural disaster events on twitter. Proceedings of the 5th International Workshop on Web-scale Knowledge Representation Retrieval & Reasoning, Pages 1-8. ACM, 2014.
    [2] R. Wang, W. J. Huang, W. Chen, T. Wang, and K. Lei. Asem: mining aspects and sentiment of events from microblog. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Pages 1923-1926. ACM, 2015.
    [3] B. Hu, B. Liu, N. Z. Gong, D. Kong, and H. Jin. Protecting your children from inappropriate content in mobile apps: an automatic maturity rating framework. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Pages 1111-1120. ACM, 2015.
    [4] D. Kershaw, Matthew, and P. Stacey. Towards modelling language innovation acceptance in online social network. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, Pages 553-562. ACM, 2016.
    [5] R. E. Prasojo, M. Kacimi, and W. Nutt. Entity and aspect extraction for organizing news comments. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Pages 233-242. ACM, 2015.
    [6] S. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, and Z. Ives. Dbpedia: A nucleus for a web of open data. Springer, 2007.
    [7] S. Sekine, and C. Nobata. Definition dictionaries and tagger for extended named entity hierarchy. In LREC, 2004.
    [8] E. Kuzey, and G. Weikum. Extraction of temporal facts and events from Wikipedia. In Proceedings of the 2nd Temporal Web Analytics Workshop, Pages 25-32. ACM, 2012.
    [9] Y. Wang, M. Zhu, L. Qu, M. Spaniol, and G. Weikum. Timely YAGO:harvesting,querying,and visualizing temporal knowledge from Wikipedia. In Proceedings of the 13th International Conference on Extending Database Technology, Pages 697-700. EDBT, 2010.
    [10] X. Rong. Word2vec parameter learning explained. arXiv preprint arXiv:1411.2738, 2014.
    [11] D. Preotiuc-Pietro, J. Carpenter, S.Giorgi, and L.Ungar. Studying the dark triad of personality through twitter behavior.In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, Pages 761-770. CIKM, 2016.
    [12] Z. Zhao, P. Resnick, and Q. Mei. Enquiring minds: early detection of rumors in social media from enquiry posts. In Proceedings of the 24th International Conference on World Wide Web, Pages 1395-1405. WWW, 2015.
    [13] D. C. Howell. Chi square test-analysis of contingency tables. In International encyclopedia of statistical science,Pages 250-252.Springer Berlin Heidelberg, 2011.
    [14] X. Liu, Z. Nie, N. Yu, and J. R. Wen. Biosnowball: automated population of wikis. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, Pages 969-978. KDD, 2010.
    [15] S. Oramas, M. Sordo, and L. Espinosa-Anke. A rule-based approach to extracting relations from music tidbits. In Proceedings of the 24th International Conference on World Wide Web, Pages 661-666. WWW, 2015.
    [16] W. Yih, J. Goodman, and V. R. Carvalho. Finding advertising keywords on web pages. In Proceedings of the 15th international conference on World Wide Web, Pages 213-222. WWW, 2006.
    [17] Y. Even-Zohar, and D. Roth. A sequential model for multi class classification. In arXiv preprint cs/0106044. EMNLP-01, 2001.
    [18] Sun, and Qian. Transfer learning for bilingual content classification. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Pages 2147-2156. ACM, 2015.
    [19] Kavuluru, Ramakanth. Classification of helpful comments on online suicide watch forums. In Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, Pages 32-40. BCB, 2016.
    [20] 葉懿萱. 網頁搜尋結果重要面向事實內容自動擷取之研究. 臺灣師範大學資訊工程系碩士論文2014.

    下載圖示
    QR CODE