國立臺灣師範大學博碩士論文全文系統

簡易檢索 / 詳目顯示

回結果列表

研究生：	劉成韋 Liu Cheng-Wei
論文名稱：	強健性語音辨識上關於特徵正規化與其它改良技術的研究 A Study on Feature Normalization and Other Improved Techniques for Robust Speech Recognition
指導教授：	陳柏琳 Chen, Berlin
學位類別：	碩士 Master
系所名稱：	資訊工程學系 Department of Computer Science and Information Engineering
論文出版年：	2005
畢業學年度：	93
語文別：	中文
論文頁數：	115
中文關鍵詞：	特徵抽取、特徵正規化、統計圖等化法、頻譜熵 、語音辨識、強健性
英文關鍵詞：	feature extraction, feature normalization, histogram equalization, spectral entropy, speech recognition, aurora 2.0, robust
論文種類：	學術論文
相關次數：	點閱：447 下載：2
分享至:	分享至facebook 分享至twitter

查詢本校圖書館目錄查詢臺灣博碩士論文知識加值系統勘誤回報

人類在幾千年的演化過程中，生活上的智慧不斷的累積傳承，因此過去文明變遷和人類演化的步伐是一致的。而如今科技進化的速度，卻早已大大的超越了人類演化的速度，並且日常生活中可以使用的多媒體影音資訊也越來越多，例如廣播電視節目、語音信件、演講錄影和數位典藏等，基於這個因素，可以隨時隨地的存取上述多媒體資訊的手持式行動裝置，也越來越受到重視。很明顯地，在上述的絕大部份多媒體中，語音可以說是最具語意的主要內涵之一。除此之外，語音自古以來一直都是人類最自然也最直接的溝通方式，若能利用語音來做為人類和科技產品之間的溝通橋樑，除了具備友善且有效的優點之外，更能省去繁雜的操作手續。現今市面上所見的科技產品，普遍的來說體積已越來越小，因此觸控的方式已漸漸地不再便利。此外傳統的人機介面如滑鼠和鍵盤，並非在所有的環境下都能適當的被使用，例如在行動的汽車環境下就顯得不夠方便。所以若能利用語音來做為人機介面，將會大大的提升便利性，使得科技和生活能夠更緊密的融合。然而語音辨識通常會遭受到一些複雜的因素干擾，諸如背景噪音，通道效應，以及語者和語言上的差異等諸多因素，使得辨識系統始終無法發揮最佳的效用，而辨識率往往也差強人意。
而本篇論文的主旨，在於針對目前許多語音強健技術進行研究比較並加以改良，最後整合出一套新的技術。而本論文主要的研究方法，是以查表式統計圖等化法為主，並和其它相關的技術結合來提升語音的強健性，最後將查表式統計圖等化法加以改良為改良式統計圖等化法，也就是將參考分佈依據音框的種類，分為靜音和語音。甚至根據中文特性，再將語音細分為聲母和韻母。而吾人所提出的改良式統計圖等化法，辨識率比傳統的查表示統計圖等化法相對提升了4.04% ; 對於原始辨識率也相對提升了至少5.75%。此外吾人也嘗試對語音訊號所擷取出的頻譜熵特徵與線性鑑別分析的技術結合，再與傳統的語音特徵參數合併來作為新的語音特徵參數，而辨識率也相對提升了近1.00%。若將新的特徵參數和本論文另一個研究主題（THEQ）作結合，更可以達到加成性的效果，平均相對辨識率提升至5.19%。

In the course of evolution for thousands of years, human beings have continuously acquired as well as accumulated their knowledge from their daily life. Therefore, the civilization and evolution of human beings were almost on a par with each other in the past several thousand years. However, the quick development of technology nowadays has surmounted the evolution of human beings further. For example, huge quantities of multimedia information, such as broadcast radio and television programs, voice mails, digital archives and so on, are continuously growing and filling our computers, networks and lives. Therefore, accessing multimedia information at anytime, anywhere by small handheld mobile devices is now becoming more and more emphasized. It is well known that speech is the primary and the most convenient means of communication between people, and it will play a more active role and serve as the major human-machine interface for the interaction between people and different kinds of smart devices in the near future. Hence, it would be much more comfortable if we could use speech as the human-machine interface, and automatically transcribe, retrieve and summarize multimedia using the speech information inherent in it. However, speech recognition is usually interfered with some complicated factors, such as the background and channel noises, speaker and linguistic variations, etc., which make the current state-of-the-art recognition systems still far from perfect.
With these observations in mind, in this thesis, several attempts were made to improve the current speech robustness techniques, as well as to find a way to integrate them together. The experiments were carried out on the Aurora 2.0 database and the Mandarin broadcast news speech collected in Taiwan. Considering the phonetic characteristics of the Chinese language, a modified histogram equalization (MHEQ) approach was first proposed. Separated reference histograms for the silence and speech segments (MHEQ-2), or more precisely, the silence, INITIAL and FINAL segments (MHEQ-3) in Chinese, were established. The proposed approach can yield above 5.75% and 4.04% relative improvements over the baseline system and the conventional table-based histogram equalization (THEQ) approach, respectively, in the clean environments. Furthermore, the spectral entropy features obtained after Linear Discriminant Analysis (LDA) were used to augment the Mel-frequency cepsctral features, and considerable improvements were initially indicated. Finally, fusion of the above proposed approaches was also investigated with very promising results demonstrated.

研究摘要        I
ABSTRACT        III
第一章     序論    1
1  研究動機    1
2  研究目的    2
3  研究內容    4
4  研究成果    5
5  論文大綱    6
第二章     文獻回顧    9
1  倒頻譜平均消去法與倒頻譜正規化法    9
1.1 倒頻譜平均消去法 （Cepstrum Mean Subtraction, CMS）    9
1.2 倒頻譜正規化法 （Cepstrum Normalization, CN）    10
2  統計圖等化法    11
2.1 查表式統計圖等化法 （Table Based Histogram Equalization, THEQ）    11
2.2 分位差統計圖等化法 (Quartile Based Histogram Equalization, QHEQ)    14
3  兩階段式維爾濾波器法（TWO STAGE WIENER FILTER, TWF）    15
4  頻譜消去法 (SPECTRAL SUBTRACTION, SS)    19
5  頻譜熵特徵 （SPECTRAL ENTROPY FEATURE）    20
6  高階倒頻譜等化法（HIGHER ORDER CEPSTRAL MONENT NORMALIZATION, HOCMN）    21
第三章     實驗語料庫介紹與設定    25
1  語音特徵參數的抽取    25
2  聲學模型的介紹及辨識效能的評估    29
3  實驗語料庫介紹    30
3.1  AURORA 2.0    30
3.2 華語廣播新聞    36
4  AURORA2.0 基礎系統實驗結果    37
第四章     基礎實驗結果    41
1  查表式統計圖等化法的深入探討    41
1.1 不同參考模型的比較    41
1.2 不同表格大小的比較    42
1.3 作用在能量維與否的比較    44
1.4 作用在頻譜和倒頻譜上的比較    45
2  兩種統計圖等化法的比較    47
2.1 作用在倒頻譜上的比較    48
2.2 作用在頻譜上的比較    49
3  強健性語音參數技術的合併    50
3.1 高階倒頻譜正規化法    51
3.2 查表式統計圖等化法與高階倒頻譜正規化法的合併    54
3.3 作用在AURORA 2.0的結果    57
4    頻譜熵特徵    63
4.1 原始頻譜熵作為特徵參數    63
4.2 頻譜熵在噪音環境下的抗噪音能力    67
第五章     改良式統計圖等化法    71
1  傳統統計圖等化法的潛在問題    71
2  改良式的統計圖等化法    74
2.1 將原始統計圖分為靜音統計圖和語音統計圖    74
2.2 實驗結果    76
2.3 改良式統計圖等化法與查表式統計圖等化法的合併    77
2.4 實驗結果    80
3  進階的統計圖等化法    81
3.1 將原始統計圖分為靜音統計圖、聲母統計圖、韻母統計圖    81
3.2 實驗結果    84
3.3 進階式統計圖等化法與查表式統計圖等化法的合併    85
3.4 實驗結果    86
3.5 作用在人工加入噪音的廣播新聞上    88
3.6 改良式統計圖等化法的討論    90
第六章     改良式頻譜熵特徵    91
1  頻譜熵特徵的深入探討    91
2  對頻譜熵進行線性鑑別分析    95
第七章     結論與未來展望    101
參考文獻        I
                                

Anshu Agarwal and Yan Ming Cheng, “Two-Stage Mel-Warped Wiener Filter for Robust Speech Recognition”, USA, ASRU, 1999

S. Furui, “Cepstral Analysis Technique for Automatic Speaker Verification”, IEEE Trans. Acoust. Speech Signal Process. 1981

R. Gomez, A. Lee, K. Shikano et al., “Robust Speech Recognition with Spectral Subtraction in low SNR”, ICSLP 2004

Chang-wen Hsu, Lin-shan Lee, “Higher Order Cepstral Moment Normalization for Robust Speech Recognition”, ISASSP 2004.

Florian Hilger & Hermann Ney and Olivier Siohan & Frank K. Soong, “Combining Neighboring Filter Channels to Improve Quantile-Based Histogram Equalization”, in Proc. IEEE International Conference, Hong Kong, China, Apr. 2003

Florian Hilger & Hermann Ney, “Evaluation of Quantile Based Histogram Equalization with Filter Combination on the Aurora 3 and 4 Databases”, GENEVA, EUROSPEECH 2003.

J.R. Hassall, and k. Zaveri, “Acoustic Noise Measurement“, 5thed., Bruel & Kjaer, Naerum, Denmark, June 1988, Chapter 3.

Anil Khare, Student Member, IEEE, Toshinori Yoshikawa, Member, IEEE, “Moment of Cepstrum and its Aplications”, IEEE TRANSCATIONS on SIGNAL PROCESSING. VOL. 40 NO. 11, NOVEMBER 1992

D.Y. Kim, S.Umesh, M.J.F. Gales, T.Hain and P.C. Woodland, “Using VTLN for Broadcast News Transcription”, Cambridge University Engineering Department, 2004

Filipp Korkmazsky, Dominique Fohr, Irina Illina, “Using Linear Interpolation to Improve Histogram Equalization for Speech Recognition”, France, ICSLP 2004

Harold Gene Longbotham, Alan Conrad Bovik, “Theory of Order Statistic Filters and Their Relationship to Linear FIR Filters”, IEEE TRANSACTIONS on ACOUSTICS, SPEECH, and SIGNAL PROCESSING, VOL. 37. NO. 2 , 1989

H. Lord, W.S. Gatley, and H.A. Evensen, “Noise Control for Engineers”, McGraw Hill, 1980, Chapter 2.

Li Lee and Richard Rose, “A Frequency Warping Approach to Speaker Normalization“, Member IEEE, 1998

D. Macho et al., “Evaluation of a Noise Robust DSR Front-End on AURORA Databases”, ICSLP 2002

Dusan Macho and Yan Ming Cheng, “SNR-Dependent Waveform Processing for Improving the Robustness of ASR Front-End”, Human Interface Lab, Motorola Labs, ICASSP, 2001

H. Misra, S. Ikbal, S. Sivadas, and H. Bourlard, “Multi-Resolution Spectral Entropy Feature for Robust ASR”, ICASSP 2005.

H. Misra, S. Ikbal, H. Bourlard, and H.Hermansky, “Spectral Entropy Based Feature for Robust ASR”, ICASSP 2004.

Sirko Molau, “Normalization in the Acoustic Feature Space for Improved Speech Recognition”, February, 2003.

Sirko Molau, Daniel Keysers, and Hermann Ney, “ Matching Training and Test Data Distributions for Robust Speech Recognition”, Speech Communication 41, 579-601, ELSEVIER 2003.

Antonio M. Peinado, Carmen Benitez, “Histogram Equalization of Speech Representation for Robust Speech Recognition”, IEEE Transactions on Speech and Audio Processing, November 2003.

Michael Pitz and Hermann Ney, ”Vocal Tract Normalization as Linear Transformation of MFCC”, Aachen, Germany, EUROSPEECH 2003

J.C. Segura, M.C. Benitez, A. de la Torre, A.J. Rubio, “Feature Extraction Combining Spectral Noise Reduction and Cepstral Histogram Equalization for Robust ASR”, Granada, SPAIN, ICSLP, 2002

Yong Ho Suk, Seung Ho Choi, Hwang Soo Lee, ”Cepstrum Third-Order Normalization Method for Noisy Speech Recognition”, IEEE LETTERS, 1st April 1999 Vol. 35 No. 7

Shang-Nien Tsai, “Improved Robustness if Time-Frequency Principle Components (TFPC) by Synergy of Methods in Different Domains”, ICSLP 2004.

Shang-Nien Tsai and Lin-Shan Lee, “A New Feature Extraction Front-End for Robust Speech Recognition Using Progressive Histogram Equalization and Multi-Eigenvector Temporal Filtering”, ICSLP 2004.

L.F. Uebel and P.C. Woodland,”An Investigation into VTLN”, Cambridge University Engineering Department,2000

O. Viikki, K. Laurila, “Cepstral Domain Segmental Feature Vector Normalization for Noise Robust Speech Recognition”, Speech Communication, Vol. 25, pp. 133-147, August 1998.

M. Westphal, “The Use of Cepstral Means in Conversational Speech Recognition”, in Proc. Eurospeech 1997, Berlin.

Zhenyu Xiong, Thomas Fang Zheng, and Wenhu Wu, “Weighting Observation Vectors for Robust Speech Recognition in Noisy Environments”, Beijing, China, ICASSP, 2004

Chen Yang, Frank K. Soong and Tan Lee, “Static and Dynamic Spectral Features: Their Noise Robustness and Optimal Weights for ASR“, ICASSP 2005

Puming Zhan and Martin Westphal, “Speaker Normalization based on Frequency Warping“, Interactive Systems Laboratories

Weizhong Zhu and Douglas O’Shaughnessy, “Log-Energy Dynamic Range Normalization for Robust Speech Recognition“, ICASSP 2005

簡易檢索 / 詳目顯示

相關論文