簡易檢索 / 詳目顯示

研究生: 陳煒鈞
Chen, Wei-Jiun
論文名稱: 基於集成學習方法進行謠言偵測
Using Ensemble Learning Methods on Social Media Rumours Detection
指導教授: 侯文娟
Hou, Wen-Juan
口試委員: 方瓊瑤
Fang, Chiung-Yao
郭俊桔
Kuo, June-Jei
侯文娟
Hou, Wen-Juan
口試日期: 2022/06/20
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 44
中文關鍵詞: 語言模型深度學習假新聞集成學習
英文關鍵詞: Language Model, Deep Learning, Fake news, Ensemble Learning
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202200772
論文種類: 學術論文
相關次數: 點閱:116下載:0
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 網路社交媒體充斥著假消息,連牛津辭典在2016年都將"Post-Truth"列為一個詞彙,錯誤的資訊可能對人造成危害,所以建構一個能夠辨識網路上各種不一樣說法、消息的系統是一個重要的議題。本研究利用預訓練語言模型搭配文字以外的特徵建立出一套辨識謠言的系統,辨識在社交媒體Twitter及Reddit使用者發表內容的真偽。

    本論文的資料集來自SemEval 2019 RumourEval: Determining rumour veracity and support for rumours (SemEval 2019 Task 7)的任務B,該任務將Twitter及Reddit上的句子經由人工標註分為3類,真(True)、假(False)、未驗證(Unverified),本研究先經由資料增強的方式增加資料量,接著以不同的語言模型(RoBERTa、ALBERT)及傳統分類(SVM)個別進行訓練,再將不同的模型組合進行集成學習(Ensemble Learning),訓練並給予不同的權重,最後加上後處理達到Marco F1 72 %,RMSE 0.5879的成績。

    Media is full of false claims. Even Oxford Dictionaries named “post-truth” as the word in 2016. Misinformation can be harmful to people, so constructing a system that can identify true/false news and statements is an important issue. In this thesis, we use a pretraining language model and some external features to build a system that can identify rumours that are published by users on social media Twitter and Reddit.

    In this study, the dataset for our experiments is from task B of SemEval 2019 RumourEval: Determining rumour veracity and support for rumours (SemEval 2019 Task 7). The task divides sentences on Twitter and Reddit into three categories by human annotation, and the label of sentences can be "True","False",and "Unverified". Our research first increases the amount of data by means of data augmentation. Secondly, we train different language models (RoBERTa、ALBERT) and traditional classifiers (SVM) individually. Next, different models are combined for ensemble learning and different weights are given through training. After applying post-processing, the scores of Macro F1 72% and RMSE 0.5879 are achieved.

    第一章 緒論 1 第一節 研究動機 1 第二節 任務描述 1 第三節 資料集 3 第四節 論文架構 4 第二章 研究背景 7 第一節 集成學習 7 1.1 Bagging 7 1.2 Boosting 9 1.3 Model Stacking 11 第二節 Transformer 12 2.1 Multi-Head Attention 12 2.2 Positional Encoding 14 第三節 Contextualized Word Embeddings 15 第三章 文獻探討 17 第一節 過去的SemEval比賽 17 1.1 SemEval-2019 Task 7 17 1.2 SemEval-2021 18 第二節 謠言偵測的預訓練模型 18 2.1 RoBERTa 18 2.2 ALBERT 19 2.3 微調與連接結果 19 第四章 研究方法與步驟 20 第一節 緒論 20 第二節 前處理 22 2.1 資料前處理 22 2.2 資料增強 23 第三節 模型架構 24 3.1 單個預訓練語言模型 24 3.2 Glove + Support Vector Machine 27 3.3 RoBERTa + ALBERT進行Model Stacking 28 3.4 RoBERTa + ALBERT以Bagging為概念建立的模型 28 第四節 後處理 28 第五章 實驗結果與討論 33 第一節 評估方法 33 第二節 實驗結果 34 第三節 討論與分析 34 第六章 結論與未來研究方向 38 第一節 結論 38 第二節 未來研究方向 38 參考文獻 40

    [1] C.Castillo, M.Mendoza, and B.Poblete. Information credibility on twitter. In
    Proceedings of the 20th international conference on World wide web, pages
    675–684, 2011, ACM.

    [2] Genevieve Gorrell, Elena Kochkina, Maria Liakata, Ahmet Aker, Arkaitz
    Zubiaga, Kalina Bontcheva, Leon Derczynski.SemEval-2019 Task 7: RumourEval, Determining Rumour Veracity and Support for Rumours. In
    Proceedings of the 13th International Workshop on Semantic Evaluation,Pages845–854,Minneapolis, Minnesota, USA,June 2019. Association for
    Computational Linguistics

    [3] D. Opitz ,R. Maclin.Popular Ensemble Methods: An Empirical Study. In Journal of Artificial Intelligence Research 11 pp. 169-198. Aug 1, 1999

    [4] Leon Derczynski, Kalina Bontcheva, Maria Liakata, Rob Procter, Geraldine
    Wong Sak Hoi, Arkaitz Zubiaga. SemEval-2017 Task 8: RumourEval: Determining rumour veracity and support for rumours. In Proceedings of the
    11th International Workshop on Semantic Evaluation (SemEval-2017), Pages
    69–76.Vancouver, Canada.August 2017. Association for Computational Linguistics

    [5] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
    Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is All You Need. In
    Advances in Neural Information Processing Systems, pages 5998-6008. Curran
    Associates, Inc. 2017

    [6] Elman, Jeffrey L. Finding structure in time In Cognitive science,pages 179-211,
    1990. Wiley Online Library

    [7] Ilya Sutskever and Oriol Vinyals and Quoc V. Le. Sequence to Sequence Learning with Neural Networks. 2014. arXiv:1409.3215

    [8] Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey. Efficient estimation of word representations in vector space. In arXiv preprint
    arXiv:1301.3781 2013

    [9] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L.
    Zettlemoyer, Deep contextualized word representations. In Proceedings of the
    2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2227–2237, 2018.

    [10] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, BERT: Pre-training of
    deep bidirectional transformers for language understanding. In Proceedings of
    the 2019 Conference of the North American Chapter of the Association for
    Computational Linguistics: Human Language Technologies, 2019

    [11] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, A large annotated
    corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, (Lisbon, Portugal), pages 632– 642,2015. Association for Computational Linguistics, Sept

    [12] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, Improving language understanding by generative pre-training, 2018.

    [13] M. Fajcik, L. Burget and P. Smrz. BUT-FIT at SemEval-2019 Task 7: Determining the Rumour Stance with Pre-Trained Deep Bidirectional Transformers.
    In Proceedings of the 13th International Workshop on Semantic Evaluation 13
    (2019), pages 1097-1104,2019.

    [14] ] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee,
    and L. Zettlemoyer. Deep contextualized word representations. preprint
    arXiv:1802.05365, 2018.

    [15] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever. Improving language understanding with unsupervised learning. OpenAI, 2018.

    [16] Q. Li, Q. Zhang and L. Si. eventAI at SemEval-2019 Task 7: Rumor Detection
    on Social Media by Exploiting Content, User Credibility and Propagation
    Information. In Proceedings of the 13th International Workshop on Semantic
    Evaluation, pages 855–859, 2019.

    [17] J. A. Meaney, Steven Wilson, Luis Chiruzzo, Adam Lopez, and Walid Magdy.
    2021. SemEval 2021 Task 7: HaHackathon, Detecting and Rating Humor and
    Offense. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 105–119, Online. Association for Computational
    Linguistics.

    [18] Bingyan Song, Chunguang Pan, Shengguang Wang, and Zhipeng Luo. 2021.
    DeepBlueAI at SemEval-2021 Task 7: Detecting and Rating Humor and Offense with Stacking Diverse Language Model-Based Methods. In Proceedings
    of the 15th International Workshop on Semantic Evaluation (SemEval-2021),
    pages 1130–1134, Online. Association for Computational Linguistics.

    [19] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi
    Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Veselin Stoyanov.
    RoBERTa: A Robustly Optimized BERT Pretraining Approach. July 2019.
    arXiv:1907.11692

    [20] Lan, Zhenzhong, Chen Mingda, Goodman Sebastian, Gimpel Kevin, Sharma
    Piyush, Soricut Radu. ALBERT: A Lite BERT for Self-supervised Learning
    of Language Representations. September 2019. arXiv:1909.11942

    無法下載圖示 本全文未授權公開
    QR CODE