研究生: |
林士翔 Shih-Hsiang Lin |
---|---|
論文名稱: |
語音文件摘要 - 特徵、模型與應用 Speech Summarization - Features, Models and Applications |
指導教授: |
葉耀明
Yeh, Yao-Ming 陳柏琳 Chen, Berlin |
學位類別: |
博士 Doctor |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2011 |
畢業學年度: | 99 |
語文別: | 英文 |
論文頁數: | 130 |
中文關鍵詞: | 語音摘要 、散度測量 、訓練資料不平衡 、風險感知 、資訊檢索 |
英文關鍵詞: | speech summarization, Kullback-Leibler divergence, Imbalanced Data, Risk-Aware, information retrieval |
論文種類: | 學術論文 |
相關次數: | 點閱:183 下載:3 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
語音文件摘要容易受語音辨識錯誤的影響,進而導致在使用傳統文字文件
摘要方法時並無法正確地摘要出語音文件中重要文句。相對於文字文件,語音文
件在從事語音摘要時卻額外地提供了許多的資訊:諸如聲韻特徵(Prosodic
Features)、聲學特徵(Acoustic Features)、語者(Speaker Roles)或情感(Emotion)資訊等,都是從事語音文件摘要時可以善加利用的額外語句特徵。本論文以特徵(Features)、模型(Models)與應用(Applications)等三個不同構面進行語音文件摘要之研究。在特徵層面,我們探討如何使用不同的詞圖結構表示語音辨識候選詞序列(Recognition Hypotheses),進而解決傳統因為只利用單一最佳辨識詞序列(1-Best)所造成的辨識錯誤影響。在模型方面,我們基於Kullback-Leibler (KL) 散度測量(Divergence Measure)方法提出了一個非監督式(Unsupervised)的摘要模型,此摘要模型允許利用文字以外的資訊線索增進散度測量正確性,進而減緩因為語音辨識錯誤所造成的問題。同時,針對監督式(Supervised)的摘要模型,我們提出了三種不同的訓練準則進行摘要模型訓練,以解決訓練資料不平衡(Imbalanced Data)所導致的負面影響。架構在此二類不同的摘要模型之上,我們進而提出了一個風險感知(Risk-Aware)的摘要架構,此架構透過監督式與非監督式摘要模型的結合,不僅能保有其各自的優點更進而克服各自方法的侷限。我們亦導入了不同的減損函式(Loss Function),以便考量語句-語句或者是文章-語句間的冗餘性與連貫性關係。對於應用層面,我們探討如何將摘要技術整合至資訊檢索技術上。本論文所提出之方法均實驗在廣播新聞語料,實驗結果亦證明本論文所提出之方法可大幅地改善現有摘要方法的效能。
Speech summarization is inevitably faced with the problem of incorrect information caused by recognition errors. However, it also presents opportunities that do not exist for text summarization; for example, information cues from prosodic analysis including speaker emotions can help the determination of importance and structure of spoken documents. In this dissertation, we discuss the problem of speech summarization from three aspects: features, models and applications. For the feature aspect, we investigate various ways to robustly represent the recognition hypotheses of spoken documents beyond the top scoring ones to alleviate negative eects caused by speech recognition errors. For the model aspect, an unsupervised Kullback-Leibler (KL) divergence based summarization method which
has the capability to accommodate more information cues to alleviate the problem caused by speech recognition errors is presented. We also investigate three disparate training criteria to train a supervised summarizer in a preference-sensitive manner, to overcome the problem of imbalanced data existing in speech summarization. Building on these methods, we propose a risk-aware summarization framework that naturally combines supervised and unsupervised summarization models to inherit their individual merits as well as to overcome their inherent limitations. Various loss functions and modeling paradigms are introduced, providing a principled way to render the redundancy and coherence relationships among sentences and between sentences and the whole document, respectively. For the application aspect,
we demonstrate the possibility of integrating summarization techniques into information retrieval tasks. Experimental results on the broadcast news summarization task suggest that our proposed methods can give substantial improvements over conventional summarization methods.
Allan, G. K. J. (2007). Selective user interaction. In ACM Conference on Information and Knowledge Management, pp. 923 - 926.
Aubert, X. L. (2002). An overview of decoding techniques for large vocabulary continuous speech recognition. Computer Speech and Language, 6, (1), pp. 89 - 114.
Bahl, L., Brown, P., Souza, P. and Mercer, R. (1986). Maximum mutual information estimation of hidden Markov model parameters for speech recognition. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 49 - 52.
Barzilay, R. and Elhadad, M. (1997). Using lexical chains for text summarization. In Proc. of Workshop on Intelligent Scalable Text Summarization, pp. 10 - 17.
Baxendale, P. (1958). Machine-made index for technical literature - an experiment. IBM Journal of Research and Development, pp. 354 - 361.
Bendersky, M. and Croft, W. B. (2008). Discovering key concepts in verbose queries. In Proc. of Annual International ACM Conference on Research and Development in Information Retrieval, pp. 491 - 498.
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis. Springer-Verlap.
Blei, D. M., Ng, A. Y. and Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, pp. 993 - 1022.
Carbonell, J. and Goldstein, J. (1998). The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proc. of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 335 - 336.
Celikyilmaz, A. and Hakkani-Tur, D. (2010). A hybrid hierarchical model for multi-document summarization. In Proc. of Annual Meeting of the Association for Computational Linguistics, pp. 815 - 824.
Chelba, C., Silva, J. and Acero, A. (2007). Soft indexing of speech content for search in spoken documents. Computer Speech and Language, 21, (3), pp. 458 - 478.
Chelba, C., Hazen, T. J. and Saraclar, M. (2008). Retrieval and browsing of spoken content. IEEE Signal Processing Magazine, 25, (3), pp. 39 - 49.
Chen, B. (2009). Latent topic modeling of word co-occurrence information for spoken document retrieval. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 3961 - 3964.
Chen, B., Kuo, J. W. and Tsai, W. H.. (2004a). Lightly supervised and data-driven approaches to Mandarin broadcast news transcription. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 777 - 780.
Chen, B., Wang, H. M. and Lee, L. S. (2002). Discriminating capabilities of syllable-based features and approaches of utilizing them for voice retrieval of speech information in Mandarin Chinese. IEEE Transactions on Speech and Audio Processing, 10, (5), pp. 303 - 314.
Chen, B., Wang, H. M. and Lee, L. S. (2004b). A discriminative HMM/n-gram-based retrieval approach for Mandarin spoken documents. ACM Transactions on Asian Language Information Processing, 3, (2), pp. 128 - 145.
Chen, G. Y., Chiu, H. S. and Chen, B. (2010). Latent topic modeling of word vicinity information for speech recognition. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 5394 - 5397.
Chen, Y. T., Chen, B. and Wang, H. M. (2009). A probabilistic generative framework for extractive broadcast news speech summarization. IEEE Transactions on Audio, Speech and Language Processing, 17, (1), pp. 95 - 106.
Chia, T. K., Sim, K. C., Li, H. Z. and Ng, H. T. (2008). A lattice-based approach to query-by-example spoken document retrieval. In Proc. of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 363 - 370.
Chiu, H. S. and Chen, B. (2007). Word topical mixture models for dynamic language model adaptation. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 169 - 172.
Christensen, H., Gotoh, Y. and Renals, S. (2008). A cascaded broadcast news highlighter. IEEE Transactions on Audio, Speech and Language Processing, 16, (1), pp. 151 - 161.
Conroy, J. M. and O'leary, D. P. (2001). Text summarization via hidden Markov models. In Proc. of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 406 - 407.
Edmundson, H. P. (1968). New methods in automatic extraction. Journal of the ACM, 16, (2), pp. 264 - 285.
Erkan, G. and Radev, D. R. (2004). LexRank: graph-based lexical centrality as salience in text summarization. Journal or Artificial Intelligence Research, 22, pp. 457 - 479.
Ferrier, L. (2001). A maximum entropy approach to text summarization. School of Artificial Intelligence, Division of Informatics, University of Edinburgh.
Furui, S., Kikuchi, T., Shinnaka, Y. and Hori, C. (2004). Speech-to-text and speech-to-speech summarization of spontaneous speech. IEEE Transactions on Speech and Audio Processing, 12, (4), pp. 401 - 408.
Galley, M. (2006). A skip-chain conditional random field for ranking meeting utterances by importance. In Proc. of Conference on Empirical Methods in Natural Language Processing, pp. 364 - 372.
Galley, M., McKeown, K., Hirschberg, J. and Shriberg, E. (2004). Identifying agreement and disagreement in conversational speech: Use of bayesian networks to model pragmatic dependencies. In Proc. of Annual Meeting of the Association for Computational Linguistics, pp. 669 - 676.
Gillick, D., Riedhammer, K., Favre, B. and Hakkani-Tur, D. (2009). A global optimization framework for meeting summarization. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 4769 - 4772.
Goel, V. and Byrne, W. (2000). Minimum Bayes-risk automatic speech recognition. Computer Speech and Language, 14, (2), pp. 115 - 135.
Gong, Y. and Liu, X. (2001). Generic text summarization using relevance measure and latent semantic analysis. In Proc. of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 19 - 25.
Griffiths, T. L., Steyvers, M. and Tenenbaum, J. B. (2007). Topics in semantic representation. Psychological Review, 114, pp. 211 - 244.
Haghighi, A. and Vanderwende, L. (2009). Exploring content models for multi-document summarization. In Proc. of Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics Annual Meeting, pp. 362 - 370.
Hirohata, M., Shinnaka, Y., Iwano, K. and Furui, S. (2006). Sentence-extractive automatic speech summarization and evaluation techniques. Speech Communication, 48, (9), pp. 1151 - 1161.
Hirschberg, J. (2002). Communication and prosody: functional aspects of prosody. Speech Communication, 36, (2), pp. 31 - 43.
Hofmann, T. (2001). Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42, pp. 177 - 196.
Hori, C and Furui, S. (2001). Advances in Automatic Speech Summarization. In Proc. of Annual Conference of the International Speech Communication Association, pp. 1771 - 1774.
Hori, T., Hori, C. and Minami, Y. (2003). Speech summarization using weighted finite-state transducers. In Proc. of Annual Conference of the International Speech Communication Association, pp. 2871 - 2820.
Joachims, T. (2002). Learning to Classify Text using Support Vector Machines, pp. Methods, Theory, and Algorithms. Kluwer Academic.
Koumpis, K. and Renals, S. (2000). Transcription and summarization of voicemail speech. In Proc. of International Conference on Spoken Language Processing, pp. 688 - 891.
Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22, (1), pp. 79 - 86.
Kumar, N. (1997). Investigation of silicon-auditory models and generalization of linear discriminant analysis for improved speech recognition. Ph.D. Dissertation, John Hopkins University.
Kumar, S. and Byrne, W. (2004). Minimum Bayes-risk decoding for statistical machine translation. In Proc. of Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics Annual Meeting.
Kumaran, G. and Allan, J. (2007). Selective user interaction. In Proc. of ACM Conference on Information and Knowledge Management, pp. 923 - 926.
Kumaran, G. and Carvalho, V. R. (2009). Reducing long queries using query quality predictors. In Proc. of Annual International ACM Conference on Research and Development in Information Retrieval, pp. 564 - 571.
Kuo, J. J. and Chen, H. H. (2008). Multi-document summary generation using informative and event words. ACM Transactions on Asian Language Information Processing, 7, (1), pp. 3, pp. 1 - 3, pp. 23.
Kupiec, J., Pedersen, J. and Chen, F. (1999). A trainable document summarizer. In Proc. of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 68 - 73.
Lavrenko, V. and Croft, W. (2001). Relevance based language models. In Proc. of Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 120 - 127.
Lease, M. and Allan, J. (2009). Regression rank: Learning to meet the opportunity of descriptive queries. In Proc. of European Conference on Information Retrieval, pp. 90 - 101.
Lee, L. S. and Chen, B. (2005). Spoken document understanding and organization. IEEE Signal Processing Magazine, 22, (5), pp. 42 - 60.
Lin, C. Y., Cao, G., Gao, J. and Nie, J. Y. (2006). An information-theoretic approach to automatic evaluation of summaries. In Proc. of Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics Annual Meeting, pp. 463 - 470.
Lin, C. Y. (2004). ROUGE: a Package for Automatic Evaluation of Summaries. In Proc. of Workshop on Text Summarization Branches Out.
Lin, S. H., Chen, B. and Wang, H. M. (2009a). A comparative study of probabilistic ranking models for Chinese spoken document summarization. ACM Transactions on Asian Language Information Processing, 8, (1), pp. 3, pp. 1 - 3, pp. 23.
Lin, S. H., Lo, Y. T., Yeh, Y. M. and Chen, B. (2009b). Hybrids of supervised and unsupervised models for extractive speech summarization. In Proc. of Annual Conference of the International Speech Communication Association, pp. 1507 - 1510.
Lin, S. H., Chang, Y. M., Liu, J. W. and Chen, B. (2010). Leveraging evaluation metric-related training criteria for speech summarization. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 5314 - 5317.
Liu, F. and Liu, Y. (2010). Exploring correlation between ROUGE and human evaluation on meeting summaries. IEEE Transactions on Audio, Speech and Language Processing, 18, (1), pp. 187 - 196.
Liu, T. Y. (2009). Learning to rank for information retrieval. Foundations and Trends in Information Retrieval, 3, (3), pp. 225 - 331.
Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M. and Harper, M. (2006). Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech and Language Processing, 14, (5), pp. 1526 - 1540.
Liu, Y. and Xie, S. (2008). Impact of automatic sentence segmentation on meeting summarization. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 5009 - 5012.
Liu, Y., Xie, S. and Liu, F. (2010). Using N-best Recognition output for extractive summarization and keyword extraction in meeting speech. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 5310 - 5313.
Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2, pp. 157 - 165.
Mangu, L., Brill, E. and Stolcke, A. (2000). Finding consensus in speech recognition, pp. word error minimization and other applications of confusion networks. Computer Speech and Language, 14, (4), pp. 373 - 400.
Mani, I. and Maybury, M. T. (1999). Advances in automatic text summarization. MIT Press, Cambridge.
Marcu, D. (2000). The Theory and Practice of Discourse Parsing and Summarization. MIT Press, Cambridge.
Maskey, S. and Hirschberg, J. (2005). Comparing lexical, acoustic/prosodic, discourse and structural features for speech summarization. In Proc. of Annual Conference of the International Speech Communication Association, pp. 621 - 624.
Maskey, S. and Hirschberg, J. (2006). Summarizing speech without text using hidden markov models. In Proc. of Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics Annual Meeting, pp. 89 - 92.
McDonald, R. (2007). A study of global inference algorithms in multi-document summarization. In Proc. of European Conference on Information Retrieval.
McKeown, K., Hirschberg, J., Galley, M. and Maskey, S. (2005). From text to speech summarization. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 997 - 1000.
Mihalcea, R. and Tarau, P. (2005). TextRank: bringing order into texts. In Proc. of Conference on Empirical Methods in Natural Language Processing, pp. 404 - 411.
Murray, G., Renals, S. and Carletta, J. (2005). Extractive summarization of meeting recordings. In Proc. of Annual Conference of the International Speech Communication Association, pp. 593 - 596.
Murray, G., Renals, S., Carletta, J. and Moore, J. (2006). Incorporating speaker and discourse features into speech summarization. In Proc. of Human Language Technology Conference and the North American Chapter of the Association for Computational Linguistics Annual Meeting.
Nenkova, A., Passonneau, R. and Mckeown, K. (2007). The pyramid method, pp. Incorporating human content selection variation in summarization evaluation. ACM Transcations on Speech and Language Processing, 4, (2), pp. 4, pp. 1 - 4, pp. 23.
Och, F. J. (2003). Minimum error rate training in statistical machine translation. In Proc. of Annual Meeting of the Association for Computational Linguistics, pp. 160 - 167.
Ortmanns, S., Ney, H. and Aubert, X. (1997). A word graph algorithm for large vocabulary continuous speech recognition. Computer Speech and Language, 11, (1), pp. 43 - 72.
Ostendorf, M., Favre, B., Grishman, R., Hakkani-Tur, D., Harper, M., Hillard, D., Hirschberg, J., Ji, H., Kahn, J. G., Liu, Y., Maskey, S., Matusov, E., Ney, H., Rosenberg, A., Shriberg, E., Wen Wang, W., Woofers, C. (2008). Speech segmentation and spoken document processing. IEEE Signal Processing Magazine, 25, (3), pp. 59 - 69.
Penn, G. and Zhu, X. (2008). A critical reassessment of evaluation baselines for speech summarization. In Proc. of Annual Meeting of the Association for Computational Linguistics, pp. 470 - 478.
Povey, D. and Woodland, P. C. (2002). Minimum phone error and I-smoothing for improved discriminative training. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 105 - 108.
Radev, D. R. and Tam, D. (2003). Summarization evaluation using relative utility. In Proc. of International Conference on Information and Knowledge Management, pp. 508 - 511.
Roark, B., Saraclar, M. and Collins, M. (2007). Discriminative n-gram language modeling. Computer Speech and Language, 21, (2), pp. 373 - 392.
Saon, G. and Padmanabhan, M. (2001). Data-driven approach to designing compound words for continuous speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 9, (4), pp. 327 - 332.
Saon, G., Padmanabhan, M., Gopinath, R. and Chen, S. (2000). Maximum likelihood discriminant feature spaces. In Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1129 - 1132.
Shriberg, E. E. (1999). Phonetic consequences of speech disfluency. Ph.D. Dissertation, University California Berkeley.
Sjolander, K. (2004). The Snack Sound Toolkit. Available: http://www.speech.kth.se/snack/.
Stolcke, A. (2005). SRILM -- an extensible language modeling toolkit. In Proc. of Annual Conference of the International Speech Communication Association, pp. 901 - 904.
Strzalkowski, T., Wand, J. and Wise, B. (1998). A robust practical text summarization. In Proc. of AAAI Conference on Artificial Intelligence Spring Symposium on Intelligent Text Summarization, pp. 26 - 33.
Teufel, S. and Halteren, H. (2004). Evaluating information content by factoid analysis, pp. human annotation and stability. In Proc. of Conference on Empirical Methods in Natural Language Processing, pp. 419 - 426.
Wang, H. M., Chen, B., Kuo, J. W. and Cheng, S. S. (2005). MATBN: A Mandarin Chinese broadcast news corpus. International Journal of Computational Linguistics and Chinese Language Processing, 10, (2), pp. 219 - 236.
Wessel, F., Schluter, R., Macherey, K. and Ney, H. (2001). Confidence measures for large vocabulary continuous speech recognition. IEEE Transactions on Audio, Speech and Language Processing, 9, (3), pp. 288 - 298.
Wu, C. H., Hsieh, C. H. and Huang, C. L. (2007). Speech sentence compression based on speech segment extraction and concatenation. IEEE Transcations on Multimedia, 9, (2), pp. 434 - 437.
Xie, S. and Liu, Y. (2010). Improving supervised learning for meeting summarization using sampling and regression. Computer Speech and Language, 24, pp. 495 - 514.
Xie, S., Lin, H. and Liu, Y. (2010). Semi-Supervised Extractive Speech Summarization via Co-Training Algorithm. In Proc. of Annual Conference of the International Speech Communication Association, pp. 2522 - 2525.
Zhai, C. X. (2008) Statistical language models for information retrieval. Morgan & Claypool Publishers.
Zhai, C. X. and Lafferty, J. (2006). A risk minimization framework for information retrieval. Information Processing & Management, 42, (1), pp. 31 - 55.
Zhang, J., Chan, H. Y. and Fung, P. (2007). A comparative study on speech summarization of broadcast news and lecture speech. In Proc. of Annual Conference of the International Speech Communication Association, pp. 2781 - 2784.
Zhou, Z. H., Yu, P., Chelba, C. and Seide, F. (2006). Towards spoken-document retrieval for the internet: lattice indexing for large-scale web-search architectures. In Proc. of Human Language Technology Conference and the North American Chapter of the Association of Computational Linguistics, pp. 415 - 422.