研究生: |
顏安孜 Yen, An-Zi |
---|---|
論文名稱: |
中文部落格文章之相關性擷取與意見傾向分析之研究 Topic-Relevant Document Extraction and Opinion Analysis in Chinese Blog Posts |
指導教授: |
侯文娟
Hou, Wen-Juan |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2015 |
畢業學年度: | 103 |
語文別: | 中文 |
論文頁數: | 93 |
中文關鍵詞: | 情感分析 、查詢詞擴充 、主題相關文章擷取 、意見極性分類 |
英文關鍵詞: | Sentiment Analysis, Query Expansion, Topic-relevant Document Retrieval, Opinion Polarity Classification |
論文種類: | 學術論文 |
相關次數: | 點閱:183 下載:22 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著網路技術的發展,越來越多人透過網路分享自己的評論意見,如何在龐大的網路文章中,自動化分類文章意見傾向,是情感分析(Sentiment Analysis)重要的研究方向。在本論文中,本研究針對政論性文章,提出能擷取出與特定主題相關文章,並且進行文章的意見傾向分析的方法,意見傾向分類為正面、中立和負面。
為了能精確的分類文章,本研究提出非監督式和監督式學習方法,實驗分為擷取主題相關文章與主題相關文章意見傾向分析兩大部分。在非監督式方法中,本研究利用點對點相互資訊(Pointwise Mutual Information, PMI)的公式計算文中名詞和主題的相關程度,將相關程度高的名詞作為查詢擴充詞彙,若文章中包含主題詞或查詢擴充辭彙則代表與主題相關。然後,本研究分析主題相關文章中的句子結構,以lexicon-based的方法給予句子極性,並且探討句子中包含否定詞、轉折詞和句尾為問號對於極性的影響。
在監督式方法中,本研究選擇使用向量支援機器(SVM)進行文章分類,在主題相關文章擷取的實驗中,透過卡方檢驗(Chi-square test, CHI)的公式計算訓練資料的辭彙和類別為相關的分數,並將分數排序前20名的詞彙以兩個或三個為一組,本研究發現有些詞彙組合在同一篇文章中出現代表與主題相關。在主題相關文章意見傾向分析的實驗結果顯示,以詞彙在不同極性文章出現頻率選取訓練詞彙比使用卡方檢驗進行特徵挑選好,而特徵使用詞彙在訓練資料中的極性,比使用情感辭典的詞彙極性的結果好。
最後,比較非監督式與監督式學習方法的主題相關文章之意見傾向分析實驗結果,顯示監督式方法的結果比非監督式的方法好,精確率因為實驗主題不同,最高為70.84%,最低為65.49%。
With the development of the internet technology, a lot of people express their opinions as reviews or comments on the Internet. Classifying the opinion polarity of documents automatically becomes an important research direction of sentiment analysis.
In the thesis, the experiment data are political articles, some methods are designed to extract documents which are related to the topic and analyze the opinion polarity of documents. The polarities are classified as positive, neutral and negative.
For the purpose of correctly classifying documents, the unsupervised learning and supervised learning methods are adopted. The experiments consist of the extraction of the topic-relevant documents and the analysis of the opinion polarity of the document. In the unsupervised learning method, the Pointwise Mutual Information score of each noun phrase is computed in order to extract the query expansion terms. Then, the topic-relevant documents are extracted by utilizing the topic-relevant terms and topic seed words. Next, we analyze the structures of the sentences where the lexicon-based method is utilized to determine the opinion polarity of the sentence. In addition, the issues of whether the sentence that contains negative words, transitional expressions and question mark will influence the opinion polarity are investigated.
Furthermore, in the supervised learning method, the machine learning classifier SVM is employed to classify documents. In the experiment of extracting topic-relevant documents, the score of relevance between words and the topic is computed by the Chi-square test formula. Within the top twenty ranks, we discover that some pair words or trio words appearing in the document represent that the document is relevant to the topic. The experimental results of the opinion polarity show that extracting the training terms by the specific frequency condition is better than the feature selection based on the Chi-square test. Moreover, the result of feature selection shows that using the polarity of each word in the training data is better than using the polarity of the sentiment words in the sentiment lexicon.
Finally, comparing the results of the unsupervised learning and the supervised learning methods in the analysis of the opinion polarity, the supervised learning method is better than the unsupervised learning one. Among the different experiment topics, the highest precision is 70.84%, and the lowest precision is 65.49%.
Andreevskaia, A. and Bergler, S. (2008). When Specialists and Generalists Work Together: Overcoming Domain Dependence in Sentiment Tagging. Proceedings of ACL, pp. 290-298.
Blei, D.M., Ng, A.Y., and Jordan, M.I. (2003). Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3, pp. 993-1022.
Chang, C.C. and Lin, C.J., LibSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm/ , 2008.
Chen, K.J. and Liu, S.H. (1992). Word Identification for Mandarin Chinese Sentences. Proceedings of COLING 1992, pp. 101-107.
Cilibrasi, R.L. and Vitanyi, P.M. (2007). The Google Similarity Distance. IEEE Transactions on Knowledge and Data Engineering, Vol.19, No.3, pp. 370-383.
CKIP中文斷詞系統. Available from http://ckipsvr.iis.sinica.edu.tw/
Duan, X., He, T. and Song, L. (2010). Research on Sentiment Classification of Blog based on PMI-IR, Proceedings of 2010 International Conference on Natural Language Processing and Knowledge Engineering, pp. 1-6.
Facebook. Available from http://www.facebook.com
FumouDiscuss. Available from http://webptt.com/m.aspx?n=bbs/FuMouDiscuss/index.html
Ghorpade, T. and Ragha, L. (2012). Featured Based Sentiment Classification for Hotel Reviews using NLP and Bayesian Classification. Proceedings of 2012 International Conference on Communication, Information & Computing Technology, pp. 1-5.
Harman, D. (1988). Towards Interactive Query Expansion, Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 322-323.
Hofmann, T. (1999,). Probabilistic Latent Semantic Indexing. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50-57.
Huang, S., Han, W., Que, X. and Wang, W. (2013). Polarity Identification of Sentiment Words based on Emoticons, Proceedings of 2013 9th International Conference on Computational Intelligence and Security, pp. 134 - 138.
ICTCLAS. Available from http://ictclas.nlpir.org/
Jaynes, E.T. (1957). Information Theory and Statistical Mechanics. Physical review, 106(4), pp. 620-630.
John, G.H. and Langley, P. (1995). Estimating Continuous Distributions in Bayesian Classifiers. Proceedings of the Eleventh conference on Uncertainty in Artificial Intelligence, pp. 338-345.
Khan, K., Baharudin, B.B. and Khan, A. (2009). Mining Opinion from Text Documents: A Survey. Proceedings of 2009 3rd IEEE International Conference on Digital Ecosystems and Technologies, pp. 217-222.
Ku, L.W., and Chen, H.H. (2007). Mining Opinions from the Web: Beyond Relevance Retrieval. Journal of the American Society for Information Science and Technology, 58(12), pp. 1838-1850
Ku, L.W., Liang, Y.T., and Chen, H.H. (2006). Opinion Extraction, Summarization and Tracking in News and Blog Corpora. AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, Vol. 100107.
Landis, J.R. and Koch, G.G. 1977. The Measurement of Observer Agreement for Categorical Data Biometrics, pp. 159-174.
Li, S., He, H., Xu, W. R. and Guo, J. (2009). Automatic Chinese Sentiment Word Extraction based on Maximum Entropy. Proceedings of the 2009 International Conference on Wavelet Analysis and Pattern Recognition, Baoding, pp. 437- 441.
Li, Z. H., Xu, Y. and Geva, S. (2008). Text Mining based Query Expansion for Chinese IR. Proceedings of the Australasian Language Technology Association Workshop 2008, pp. 73-78.
Lu, B. and Tsou, B.K. (2010). Combining a Large Sentiment Lexicon and Machine Learning for Subjectivity Classification. Proceedings of the Ninth International Conference on Machine Learning and Cybernetics, pp. 3311-3316.
Luo, J., Meng, B., Tu, X.H. and Gu, J.G. (2010). Selecting Good Expansion Terms based on Google Similarity Distance. Proceedings of 2010 2nd International Conference on Future Computer and Communication, V2-711-V2-714.
Pang, B. and Lee, L. (2008). Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval, 2(1-2), pp. 1-135.
PTT八卦版. Available from http://webptt.com/m.aspx?n=bbs/Gossiping/index.html
Qiu, L., Zhang, W., Hu, C. and Zhao, K. (2009). SELC: a Self-supervised Model for Sentiment Classification. Proceedings of CIKM, pp. 929-936.
Quinlan, J.R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo, CA.
Sim, J. and Wright, C.C. (2005). The Kappa Statistic in Reliability Studies: Use, Interpretation, and Sample Size Requirements, Physical Therapy, 85, pp. 257-268.
Stanford Parser. Available from http://nlp.stanford.edu/software/lex-parser.shtml
Stop Word List. Available from https://sites.google.com/site/kevinbouge/stopwords-lists
Sui, H., Jianping, Y., Hongxian, Z. and Wei, Z. (2012). Sentiment Analysis of Chinese Micro-blog Using Semantic Sentiment Space Model. Proceedings of 2012 2nd International Conference on Computer Science and Network Technology, pp.1443-1447.
Tu, X.H., He, T.T., Luo, J., Chen, J.G., Chen, L. and Yang, Z.K. (2008). Chinese Query Expansion Based on Topic-Relevant Terms. Proceedings of International Conference on Natural Language Processing and Knowledge Engineering, pp. 1 -5.
Udn Blogs. Available from http://blog.udn.com/
Vapnik, N.V. (1995). The Nature of Statistical Learning Theory. Springer.
Viera, A.J. and Garrett, J.M. (2005). Understanding Interobserver Agreement: the Kappa Statistic, Family Medicine, 37(5), pp. 360-363.
Wang, B., Min, Y., Huang, Y., Liu, Y., Li, X., Sun, Y. and Sun, C. (2013). Chinese Reviews Sentiment Classification based on Quantified Sentiment Lexicon and Fuzzy Set. Proceedings of 2013 International Conference on Information Science and Technology, pp.677-680.
Wang, J.H. and Lee, C.C (2011). Unsupervised Opinion Phrase Extraction and Rating in Chinese Blog Posts. Proceedings of 2011 IEEE International Conference on Privacy, Security, Risk, and Trust, and IEEE International Conference on Social Computing, pp. 820-823.
Yahoo奇摩新聞搜尋引擎. Available from https://tw.news.yahoo.com/
Yang, Y. and Pedersen, J. O. (1997). A Comparative Study on Feature Selection in Text Categorization. ICML, Vol. 97, pp. 412-420.
Yang, Y. and Zhou, Y.Q. (2011). Chinese Sentiment Classification based on Semantic Structure of Sentences. Proceedings of 2011 International Conference on Computer Science and Network Technology, pp. 1745-1749.
Ye, Q., Zhang, Z. and Law, R. (2009). Sentiment Classification of Online Reviews to Travel Destinations by Supervised Machine Learning Approaches. Expert Systems with Applications, vol. 36, pp. 6527-6535.
Zan, H., Kou, K., Tian, J. and Sin, R. (2010). Applications of Chinese Sentiment Categorization to Digital Products Reviews. Proceedings of 2010 International Conference on Natural Language Processing and Knowledge Engineering, pp.1-5.
Zhai, Z., Liu, B., Wang, J., Xu, H., and Jia, P. (2011). Product Feature Grouping for Opinion Mining Using Soft-Constraints and EM. Intelligent Systems, IEEE, vol. PP, issue no.99, pp. 1.
Zhai, Z., Xu, H. and Jia, P. (2010). An Empirical Study of Unsupervised Sentiment Classificationof Chinese Reviews. Tsinghua Science & Technology, 15(6), pp. 702-708.
Zhang, H., Yu, Z., Xu, M. and Shi, Y (2012). An Improved Method to Building a Score Lexiconfor Chinese Sentiment Analysis. Proceedings of 2012 Eighth International Conference Semantics, Knowledge and Grids, pp. 241 - 244.
Zheng, W. and Ye, Q. (2009). Sentiment Classification of Chinese Traveler Reviews by Support Vector Machine Algorithm, 2009 Third International Symposiumon Intelligent Information Technology Applications, vol. 3, pp. 335-338,.
Zhuo, S., Wu, X. and Luo, X. (2014). Chinese Text Sentiment Analysis based on Fuzzy Semantic Model. Proceedings of 2014 IEEE 13th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), pp. 535-540.
石琢暐,支援向量機簡介,2011年,Available from http://eeil.ime.ncku.edu.tw/knowledgebase/zhi-yuan-xiang-liang-ji-support-vector-machine
林揚書,網際網路新聞文章心情偵測之研究,國立交通大學資訊工程所碩士論文,2009年。
知網情感分析用詞語集. Available from http://www.keenage.com/
游和正,黃挺豪,陳信希,領域相關詞彙極性分析及文件情緒分類之研究. 中文計算語言學期刊,2012年。
黃建銘,支撐向量機的自動參數選擇,國立台灣科技大學資訊工程系碩士論文,2005年。