研究生: |
文宣 Wen, Hsuan |
---|---|
論文名稱: |
人工智慧如何自動辨識電腦生成新聞之研究 Research on how artificial intelligence can automatically recognize computer-generated news |
指導教授: |
曾元顯
Tseng, Yuen-Hsien |
口試委員: |
吳怡瑾
Wu, I-Chin 李龍豪 Lee, Lung-Hao 曾元顯 Tseng, Yuen-Hsien |
口試日期: | 2022/06/22 |
學位類別: |
碩士 Master |
系所名稱: |
圖書資訊學研究所 Graduate Institute of Library and Information Studies |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 中文 |
論文頁數: | 85 |
中文關鍵詞: | 人工智慧 、文字生成 、自然語言處理 、語言學 |
英文關鍵詞: | Artificial Intelligence, Text Generation, Natural Language Processing, Linguistics |
研究方法: | 實驗設計法 |
DOI URL: | http://doi.org/10.6345/NTNU202201269 |
論文種類: | 學術論文 |
相關次數: | 點閱:223 下載:72 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
在人工智慧迅速發展的這個時代,開始有了機器自動生成新聞的技術,但機器生成的新聞內容並非全然正確時,檢視資訊的來源及內容就變成非常重要的一環,現今機器也能協助人類進行文章分類判斷,那機器到底為何能夠如此強大?
本研究為探討在中文經濟新聞的範疇內,電腦生成的文章特徵是否與其他相關文獻中提及的電腦生成英文文章相同,而BERT對於經由語言學要素中,針對語意、語用及語法所設計的五個實驗進行修改後的中文文章,是否仍然可以準確的判斷出一篇文章為電腦生成或人工撰寫,並找到BERT判斷的關鍵因素為何,實驗結論如下:
1. 無論是在英文或中文文章中,只要是電腦生成的文章,特徵基本上是相同的。
2. BERT在判斷一篇中文新聞為人類撰寫或電腦生成時,可能判斷的依據主要在於語意及語法兩個部分。
3. 一篇中文約300~350字的新聞,若只更動語意的部分,如將語句長度縮短,或是將逗點之間的句子隨機做位置上的調換,可使BERT準確度出現些許下降;若進而更動到語法的部分,例如使用Google翻譯,將一篇文章的詞彙結構打亂,則可以使BERT判斷的準確度大幅下降。
In the era of rapid development of artificial intelligence, the technology of automatic news generation by machine has been introduced, but when the news content generated by machine is not entirely correct, it becomes very important to examine the source and content of information.
In this study, we investigate whether the characteristics of computer-generated articles are the same as those of computer-generated English articles mentioned in other related literature in the context of Chinese economic news, and whether BERT can still accurately determine whether an article is computer-generated or human-generated after five experiments designed for semantic, pragmatic and syntactic elements in linguistics. The conclusions of the experiments are as follows:
1. The characteristics of computer-generated articles are basically the same whether they are in English or Chinese.
2. When BERT determines whether a Chinese news article is human-written or computer-generated, it may be based on the semantic and syntactic components.
3. For a Chinese news article of 300-350 words, if only the semantic part is changed, such as shortening the length of the sentences or randomly swapping the position of the sentences between commas, the accuracy of BERT can be slightly reduced; if the syntactic part is further changed, such as using Google Translate to mess up the word structure of an article, the accuracy of BERT's judgment can be significantly reduced. For example, if we use Google Translate to disrupt the word structure of an article, the accuracy of BERT judgment can be significantly reduced.
Bloom, L., & Lahey, M. (1978). Language development and language disorders. New York, NY: John Wiley & Sons., p.11
Carlson, M.. (2015). The Robotic Reporter. Digital journalism, 3(3), 416‑431. doi:10.1080/21670811.2014.976412
Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. 2018-10-11. arXiv:1810.04805v2
Dörr, K., & Hollnbuchner, K. (2017). Ethical challenges of algorithmicjournalism. Digital Journalism, 5(4), 404-419.
iParadigms. TurnitIn.com. Digital assessment suite. Nov 17, 2011 (available online at http://turnitin.com)
Ippolito, D., Duckworth, D., Callison-Burch, C., & Eck, D. (2020). Automatic detection of generated text is easiest when humans are fooled. arXiv preprint arXiv:1911.00650.
Losee, R. M. Natural language processing in support of decision-making: phrases and part-of-speech tagging. Information Processing & Management, 37(6), 2001, pp.769-787.
Maurer, H., Kappe, F., & Zaka, B. Plagiarism - A Survey. Journal of Universal Computer Science, 12(8), 2006, pp. 1050-1084.
Morizeyao. (2019, December 09). Morizeyao GPT2-Chinese. Retrieved From https://github.com/Morizeyao/GPT2-Chinese
Mozgovoy, M., Kakkonen, T., & Cosma, G. Automatic student plagiarism detection: future perspectives. Journal of Educational Computing Research, 43(4), 2010, pp.511-531
Pavlick, Ellie. Compositional Lexical Semantics in Natural Language Inference. University of Pennsylvania ProQuest Dissertations Publishing, 2017.
Rowan Zellers, Ari Holtzman, Hannah Rashkin, Yonatan Bisk, Ali Farhadi, Franziska Roesner, Yejin Choi(2020).Defending Against Neural Fake News. arXiv preprint arXiv:1905.12616.
Sample, Ian.(2017年11月5日)。〈Computer says no: why making AIs fair, accountable and transparent is crucial.〉《衛報》取自: https://www.theguardian.com/science/2017/nov/05/computer-says-no-why-making-ais-fair-accountable-and-transparent-is-crucial
Shane Storks, Qiaozi Gao, Joyce Y. Chai(2020). Recent Advances in Natural Language Inference: A Survey of Benchmarks, Resources, and Approaches. arXiv preprint arXiv: 1904.01172
戴廷芳(2015年2月3日)。〈新聞記者要失業了嗎?AI機器軟體一季能寫3千則新聞〉《iThome電腦報周刊》。取自:
https://www.ithome.com.tw/news/93868
林郁綺(2021)。利用人工智慧技術偵測中文假新聞。國立臺灣師範大學圖書資訊學研究所碩士論文,台北市。 取自https://hdl.handle.net/11296/g3reqt
林婷嫻(2018)。斷開中文的鎖鍊!自然語言處理(NLP) . 取自中央研究院 研之有物:https://aiacademy.tw/what-is-nlp-natural-language-processing/
劉昌德(2020),新聞機器人為誰「勞動」?自動化新聞學引入新聞產製的影響及論述,中華傳播學刊,頁147-186。
勵心如(2015年9月11日)。〈60秒完稿 騰訊機器人搶記者頭路〉《旺報》取自: https://www.chinatimes.com/newspapers/20150911000982-260303?chdtv
賴志遠(2018)。國際人工智慧政策推動現況 . Retrieved from https://portal.stpi.narl.org.tw/index/article/10418
舒兆民(2021)。現代漢語語言學概要。台灣:新學林,頁8。
蘇嘉穎(2012)。利用語法結構與語意相似度建立改寫句子抄襲偵測方法。國立成功大學資訊管理研究所碩士論文,台南市。 取自https://hdl.handle.net/11296/e7r5ak