簡易檢索 / 詳目顯示

研究生: 陳美瑜
Mei-Yu Chen
論文名稱: 中文文本作者辨識研究: 以社群網站--臉書為例
Chinese Authorship Identification: A case study based on Social Corpus -- Facebook
指導教授: 謝舒凱
Hsieh, Shu-Kai
學位類別: 碩士
Master
系所名稱: 英語學系
Department of English
論文出版年: 2013
畢業學年度: 101
語文別: 英文
論文頁數: 137
中文關鍵詞: 作者辨識風格學SVM向量機文件探勘個人化差異情緒自然語料社群短語
英文關鍵詞: authorship identification, Stylometry, SVM, text mining, individual difference, emotion, naturalistic data, social media short text
論文種類: 學術論文
相關次數: 點閱:271下載:18
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 個人寫作風格差異(風格學)一直是熱門研究主題。從語言學角度觀察,研究人員嘗試各種量化方法及建立各種指數希望能將「個人差異」量化 (Tweedie & Baayen, 1998; Mosteller & Wallace, 1964; Burrows, 2002, 2003, 2007; Hoover, 2004)。而從資訊科學領域來看,現今社會對「語言鑑識」或「文件作者分類」有漸增的需求,因為在數位化的時代,人們需要這項技術來幫助偵測漸增的網路匿名犯罪,或是幫助數位化文件作者分類。
    此篇論文首先介紹兩種學科對於個人寫作風格差異的研究方法,並且進行兩項實驗。實驗採用現今流行的社群網站Facebook 上的個人語料來探索中文的字(characters)與詞(words)能對個人寫作差異提供多少解釋力,並且探勘其他的文件風格,諸如:結構、主觀化、情緒特徵等,能對社群短語的作者判斷提供多少幫助。並且此研究坦討於常見的特徵權重 (tf-idf、詞頻、比例分布)計算中,何種權值能提供較佳的準確值。本實驗採用新式向量機套件— LibLinear 做為作者分類器,此分類器套件特殊的設計使其更適應於高維度的特徵訓練,例如「文件分類」這種需包含為數眾多的詞作為特徵值的任務。且不同於一般的分類器,Liblinear 能提供每項特徵對應不同分類別的的貢獻分數,因而能幫助研究者檢視何種特徵最能代表該作者類別。
    從實驗一的結果得知,tf-idf 特徵權的表現略比比例分布佳,但並未比詞頻的表現好。這個結果顯示在此類社群短語中,不論是在單則文章中或是整個實驗語料庫中,關鍵詞鮮少重複出現。 原因有可能來自於在社群網站當張,短語的特性使其所能包含的文字較少,以及人們在此種社交平台上傾向不斷更換主題的特性。 因此tf-idf 這種降低功能詞權重並提高文章關鍵詞權重的計算方式,沒能在此類短語文章屬性中見其專長,反而簡單的詞頻計算方式表現更佳。並且,這種結果或許反映了在功能詞與內容詞兩種特徵的比較上,tf-idf預設功能詞特徵對於作者辨識不重要的假設或許並不適當。
    實驗二展示中文不同階層的詞彙 (例如:字、詞、二字詞、字與詞混合)能提供的作者辨識度。另一個常見於中文作者辨識的議題是關於中文的斷詞問題。不同於字母系統的語言,中文在語言表層結構上並不存在字元間隔以區分單詞。因此先前許多針對中文作者辨識的研究選擇使用不分詞的方法進行分類辨識。本文中的第二項實驗以 CKIP 進行中文分詞,並且同時採用不分詞與分詞後的結果作為特徵值,以探索中文中不同字詞單元分別能提供的作者分類鑑識力(包括以字為本及以詞為本的一字詞單位、以詞為本的二字詞單位,以及混合字與詞)。結果顯示以詞為本的特徵值分類表現優於以字為本的特徵值。同時在第二個實驗中加入了字詞以外的特徵集(包含結構特徵、主觀化特徵、情緒特徵)。結果顯示主觀化特徵與情緒特徵在社群語料文類中的重要性。

    Individual’s writing difference (Stylometry) has been a popular research interest. In Linguistics, researchers want to know whether the individual difference can be quantified and measured by a certain indexes or statistics (Tweedie & Baayen, 1998; Mosteller & Wallace, 1964; Burrows, 2002, 2003, 2007; Hoover, 2004). From Information Technology perspectives, nowadays, there’s an increasing need for document forensics to detect the authorship of anonymous documents either to help investigate internet crimes or to serve different document classification purpose.
    This paper introduces different ways of measuring the individual writing difference from both Linguistic and Information Technology disciplines. Two experiments are carried out on individuals’ texts collected from the prevalent social media platform—Facebook—to investigate to what extent Chinese characters and lexicons can capture the individual’s writing difference, and to what extent other textual attributes, such as the structural, subjectivity, and emotion clues can contribute to this kind of social short texts. Also, this study examines three different kinds of feature weighting methods (i.e. tf-idf, frequency, ratio) and compare their efficiency in the short texts classification. A recently released SVM classifier, LibLinear, is adopted. The special design of this software package not only makes it more adapted to document classification tasks, where the dimension of features is extremely high, but also can provide ranking scores of each feature that tell the researchers which feature in the feature set can best discriminate and represent a specific category.
    From the result of the first experiment, tf-idf weighting outperforms the measure of the ratio, but didn’t outperform the measure of frequency. The result shows that in this kind of social short texts, keywords seldom repeat themselves no matter locally or universally. This might be attributed to the relative short length to include more words in a single post and also the characteristics of social platform that people change topics frequently. Therefore, the benefit of tf-idf that degrades the weighting of functions while promoting that of locally frequent content words doesn’t show extra discriminant power compared to the more simple measure of frequency. Also, the preassumption of tf-idf that assumes function words won’t provide information about the author’s preference might not be adequate.
    Another common issue when carrying out Chinese authorship identification is the segmentation problem. Unlike alphabetic langagues, Chinese doesn’t have word boundaries on the surface structure. Thus, much previous research chose to tackle this language by non-segmented approaches. The second experiment demonstrates the discriminating power of different levels of lexicons (i.e. character-based and word-based unigram, word-based bigram, mix of character and words) in Chinese authorship identification. The result shows that word-based features have much better performance than character-based features. Also, in the second experiment, different feature levels are taken into account (i.e. the structural level, subjectivity level, and emotion level). The result shows the important role of subjectivity and emotion clues to the genre of the social media texts.

    Abstract ............................................. i 摘要 ..................................................iii Acknowledgments ...................................... v Content................................................vii List of Tables.........................................ix List of Figures........................................xi Chapter 1 Introduction.................................. 1 Chapter 2 Literature Review..............................6 2.1 Stylometry.........................................7 2.1.1 Jayne (1980) and Opas (1996)...................7 2.1.2 Burrows (2002, 2003, 2007).....................10 2.1.3 Discussion on Quantitative-Statistical Approach 17 2.2 Authorship Analysis................................20 2.2.1 Authorship Identification......................21 2.2.2 Characteristics of authorship identification...22 2.2.3 Feature Selection..............................28 2.2.4 Feature Weighting..............................33 2.2.5 Machine Learning Technique.....................37 Chapter 3 Methodology....................................41 3.1 Rationale— Background and Motivation...............41 3.2 Corpus Construction................................44 3.2.1 Design criteria................................45 3.2.2 Data Collection................................46 3.2.3 Pre-Processing: Data Cleansing and Segmentation.47 3.3 Machine Learning Process...........................48 3.3.1 Choosing Classifier – LibLinear................48 3.3.2 Machine Learning Process.......................51 3.4 Experiment 1.......................................53 3.4.1 Purpose and Design.............................53 3.4.2 Feature Set....................................54 3.5 Experiment 2.......................................55 3.5.1 Purpose and Design.............................55 3.5.2 Features Set...................................57 Chapter 4 Result and Discussion..........................64 4.1 Experiment 1.......................................64 4.2 Experiment 2.......................................67 4.3 In comparison with the quantitative-statistical approach.................................................83 4.4 In comparison with the machine learning approach...85 4.5 Linguistic Discussion..............................87 Chapter 5 Conclusion.....................................89 Bibliography.............................................91 Appendix A: The topmost IG-3000 words....................99 Appendix B: Error analysis...............................125 Author 1..............................................126 Author 2..............................................128 Author 3..............................................130 Author 4..............................................131 Author 5..............................................132 Author 6..............................................133

    Abbasi, A., & Hsinchun, Chen. (2005). Applying authorship analysis to extremist-group Web forum messages. Intelligent Systems, IEEE, 20(5), 67-75. doi: 10.1109/MIS.2005.81
    Abbasi, Ahmed, & Hsinchun, Chen. (2008). Writeprints: A Stylometric Approach to Identity-Level Identification and Similarity Detection in Cyberspace. ACM Transactions on Information Systems, 26(2), 7 1-7 29.
    Argamon, Shlomo, Šarić, Marin, & Stein, Sterling S. (2003). Style mining of electronic messages for multiple authorship discrimination: first results. Paper presented at the Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining.
    Auria, L., & Moro, R. A. (2007). Advantages and Disadvantages of Support Vector Machines (SVMs). Paper presented at the Credit Risk Assessment Revisited: Methodological Issues and Practical Implications.
    Baayen, R.H., Van Halteren, H., & Tweedie, F.J. (1996). Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11(3), 121-132. doi: 10.1093/llc/11.3.121
    Bennett, William Ralph. (1976). Scientific and engineering problem-solving with the computer: Prentice Hall PTR.
    Biber, Douglas. (1991). Variation across speech and writing: Cambridge University Press.
    Burrows, J.F. (1989). "An ocean where each kind...": Statistical analysis and some major determinants of literary style. Computers and the Humanities, 23, 309-321.
    Burrows, John. (2002). 'Delta': a Measure of Stylistic Difference and a Guide to Likely Authorship. Lit Linguist Computing, 17(3), 267-287. doi: 10.1093/llc/17.3.267
    Burrows, John (2003). Questions of Authorship: Attribution and Beyond. Computers and the Humanities, 5-32.
    Burrows, John (2007). All the Way Through: Testing for Authorship in Different Frequency Strata. Lit Linguist Computing, 22(1), 27-47.
    Burrows, John F. (1987). Word-patterns and story-shapes: The statistical analysis of narrative style. Literary and linguistic Computing, 2(2), 61-70.
    Chaikin, David. (2006). Network investigations of cyber attacks: the limits of digital evidence. Crime, Law and Social Change, 46(4-5), 239-256. doi: 10.1007/s10611-007-9058-4
    Chang, Chih-Chung, & Lin, Chih-Jen. (2011). LIBSVM : a library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27:21--27:27.
    Chen, Keh-Jiann, & Liu, Shing-Huan. (1992). Word identification for Mandarin Chinese sentences. Paper presented at the Proceedings of the 14th conference on Computational linguistics-Volume 1.
    D.L.Wallace, F. Mosteller and. (1998). Text categorization with support vector machines: Learning with many relevant features. Paper presented at the European Conference on Machine Learning (ECML).
    Diederich, J., Kindermann, J., Leopold, E., and Paass, G. (2003). Authorship attribution with support vector machines. APPLIED INTELLIGENCE, 19(1-2), 109-123.
    Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin., C.-J. (2008). LIBLINEAR: A Library for Large Linear Classification. Journal of Machine Learning Research, 9, 1871-1874.
    Hadjidj, Rachid, Debbabi, Mourad, Lounis, Hakim, Iqbal, Farkhund, Szporer, Adam, & Benredjem, Djamel. (2009). Towards an integrated e-mail forensic analysis framework. digital investigation, 5(3), 124-137.
    Holmes, David I. (1992). A stylometric analysis of Mormon scripture and related texts. Journal of the Royal Statistical Society. Series A (Statistics in Society), 91-120.
    Holmes, David I, & Forsyth, Richard S. (1995). The Federalist revisited: New directions in authorship attribution. Literary and Linguistic Computing, 10(2), 111-127.
    Hoover, David L. (2004). Testing Burrows's delta. Literary and Linguistic Computing, 19(4), 453-475.
    Hope, Jonathan. (1994). The Authorship of Shakespeare's Plays: A socio-linguistic study: Cambridge University Press.
    Houvardas, John, & Stamatatos, Efstathios. (2006). N-gram feature selection for authorship identification Artificial Intelligence: Methodology, Systems, and Applications (pp. 77-86): Springer.
    Iqbal, Farkhund, Binsalleeh, Hamad , Fung, Benjamin, & Debbabi, Mourad. (2010). Mining writeprints from anonymous e-mails for forensic investigation. digital investigation, 7(1), 56-64.
    Iqbal, Farkhund, Hadjidj, Rachid, Fung, Benjamin, & Debbabi, Mourad. (2008). A novel approach of mining write-prints for authorship attribution in e-mail forensics. digital investigation, 5, S42-S51.
    Iqbal, Farkhund, Khan, Liaquat A., Fung, Benjamin C. M. , & Debbabi, Mourad. (2010). e-mail authorship verification for forensic investigation. Paper presented at the Proceedings of the 2010 ACM Symposium on Applied Computing, Sierre, Switzerland.
    Jaynes, J. T. . (1980). A search for trends in the poetic style of W.B. Yeats. Association for Literary and Linguistic Computing Journal, 1, 11-19.
    Jianbin Ma, Guifa Teng, Shuhui Chang, Xiaoru Zhang, Ke Xiao. (2011). Social Network Analysis Based on Authorship Identification for Cybercrime Investigation. In M. Chau, G. A. Wang, X. Zheng, H. Chen, D. Zeng & W. Mao (Eds.), Intelligence and Security Informatics (Vol. 6749, pp. 27-35): Springer Berlin Heidelberg.
    Jianbin Ma, Ying Li, and Guifa Teng. (2008). Identifying Chinese E-mail Documents' Authorship for the Purpose of Computer Forensics.
    Jianbin Ma, Ying Li, Guifa Teng, Fang Wang, Yang Zhao (2008). Sequential Pattern Mining for Chinese E-mail Authorship Identification. Paper presented at the Innovative Computing Information and Control, 2008. ICICIC '08. 3rd International Conference on.
    Joachims, Thorsten. (1998). Text categorization with support vector machines: Learning with many relevant features: Springer.
    Keselj, Vlado , Peng, Fuchun, Cercone, Nick, & Thomas, Calvin. (2003). N-gram-based author profiles for authorship attribution. Paper presented at the In Proceedings of the Pacific Association for Computational Linguistics.
    Koppel, Moshe , Argamon, Shlomo, & Shimoni, Anat Rachel. (2002). Automatically Categorizing Written Texts by Author Gender. Literary and Linguistic Computing, 17(4), 401-412. doi: 10.1093/llc/17.4.401
    Ma, Jianbin, Teng, Guifa, Zhang, Yuxin, Li, Yueli, & Li, Ying. (2009). A cybercrime forensic method for chinese web information authorship analysis Intelligence and Security Informatics (pp. 14-24): Springer.
    Martindale, C., & McKenzie, D. (1995). On the utility of content analysis in author attribution: The Federalist. Computers and the Humanities, 29, 259-270.
    Opas, LISA LENA. (1996). A Multi-Dimensional Analysis of Style in Samuel Beckett's Prose Works. Research in Humanities Computing, 4, 81-114.
    Peng, Fuchun, Schuurmans, Dale, Wang, Shaojun, & Keselj, Vlado. (2003). Language independent authorship attribution using character level language models. Paper presented at the Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics-Volume 1.
    Rong Zheng, Jiexun Li, Hsinchun Chen, Zan Huang. (2006). A Framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society for Information Science & Technology, 57(3), 378-393. doi: 10.1002/asi.v57:3
    Rudman, J. (1998). The state of authorship attribution studies: Some problem and solutions. Computers and the Humanities, 31, 351-365.
    Stamatatos, Efstathios, Fakotakis, Nikos, & Kokkinakis, George. (1999). Automatic authorship attribution. Paper presented at the Proceedings of the ninth conference on European chapter of the Association for Computational Linguistics.
    Stamatatos, Efstathios, Fakotakis, Nikos, & Kokkinakis, George. (2000). Automatic text categorization in terms of genre and author. Computational linguistics, 26(4), 471-495.
    Tsuboi, Yuta, & Matsumoto, Yuji. (2002). Authorship identification for heterogeneous documents. IPSJ SIG Notes, 17-24.
    Tweedie F.J., & Baayen, R.H. (1998). How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities, 32, 323-352.
    Vel, O. de, Anderson, A., Corney, M., & Mohay, G. (2001). Mining e-mail content for author identification forensics. SIGMOD Rec., 30(4), 55-64. doi: 10.1145/604264.604272
    Whissell, Cynthia. (1996). Traditional and emotional stylometric analysis of the songs of Beatles Paul McCartney and John Lennon. Computers and the Humanities, 30(3), 257-265.
    William B. Cavnar , John M. Trenkle. (1994). N-grambased text categorization. Paper presented at the In Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval.
    Yang, Yiming. (1999). An evaluation of statistical approaches to text categorization. Information retrieval, 1(1-2), 69-90.
    Yu, H.-F., Ho, C.-H., Juan, Y.-C., & Lin, C.-J. (2013). LibShortText: A Library for Short-text Classification and Analysis.

    下載圖示
    QR CODE