研究生: |
曹又升 Tsao, Yu-Sheng |
---|---|
論文名稱: |
即時單通道語音增強技術之研究 A Study on Real-Time Single-Channel Speech Enhancement Techniques |
指導教授: |
陳柏琳
Chen, Berlin 洪志偉 Hung, Jeih-Weih |
口試委員: |
陳柏琳
Chen, Berlin 洪志偉 Hung, Jeih-weih 陳冠宇 Chen, Kuan-Yu 曾厚強 Tseng, Hou-Chiang |
口試日期: | 2022/06/22 |
學位類別: |
碩士 Master |
系所名稱: |
資訊工程學系 Department of Computer Science and Information Engineering |
論文出版年: | 2022 |
畢業學年度: | 110 |
語文別: | 中文 |
論文頁數: | 47 |
中文關鍵詞: | 語音增強 、強健性語音辨識 、即時語音增強 、離散餘弦變換 、次頻帶 |
英文關鍵詞: | Speech Enhancement, Robust Automatic Speech Recognition, Real-time Speech Enhancement, Discrete Cosine Transform, Sub-band Processing |
研究方法: | 實驗設計法 、 調查研究 |
DOI URL: | http://doi.org/10.6345/NTNU202201584 |
論文種類: | 學術論文 |
相關次數: | 點閱:131 下載:12 |
分享至: |
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報 |
隨著深度學習的發展,語音增強 (Speech Enhancement) 技術更能在各種不同的噪音環境下發揮效果。本論文針對語音增強的兩個子議題進行研究,分別是對於語音辨識 (Speech Recognition) 系統的語音增強前端預處理 (Front-end Preprocessing),以及線上 (On-line) 的串流式即時語音增強。語音增強研究首要目標是提升感知品質 (Perceptual Quality),然而若將最大化感知品質的模型用於預處理,可能會影響下游任務所依賴的聲學特徵,這樣的失真會導致效果不如預期,甚至衰退。而除了維持聲學特徵外,本論文也探討如何更有效的使用頻譜資訊,以及針對即時語音增強模型提高運算效率。
本論文在兩個方法上各自提出了改進,第一項為 DCT-TENET,以時序反轉增強網路 (Time-reversal Enhancement NETwork, TENET) 為基礎,針對語音增強的訓練流程調整,在保有一定增強效果的前提上,作為語音辨識前端處理機制也能更有效的提升辨識率,並且減少額外訓練聲學模型的需求。第二項為可調適性全次頻帶融合網路 (Adaptive-FSN),透過延伸「次頻帶能有效處理局部樣態 (Local Pattern)」的概念,提出一個可調適性次頻帶機制,壓縮大範圍相鄰頻帶之有效資訊來提高語音品質,並搭配其他改進以提高運算的效能。我們使用 VoiceBank-DEMAND 資料集對兩個方法進行實驗,改進後的 DCT-TENET相較 TENET 模型,能進一步的提升語音辨識系統於受噪語音的辨識率。使用乾淨情境聲學模型辨識 DEMAND 噪音之測試集降低相對約 7.9% 的詞錯誤率,使用多情境聲學模型於額外的未見噪音測試集也能降低相對約 10.6% 的詞錯誤率。另一部分,Adaptive-FSN 也相較基礎的 FullSubNet+,在語音品質指標上有更佳的表現,於 CPU 上運算則能有效的降低相對 44% 的實時率 (Real-time Factor)。
With the development of deep learning, Speech Enhancement techniques can be more effective in various noise situations. This paper investigates two sub-topics of speech enhancement, namely, front-end pre-processing with speech enhancement for speech recognition, and online real-time speech enhancement. The primary goal of speech enhancement is to improve perceptual quality. However, if we use the model that training to maximize perceptual quality for pre-processing, it may affect the acoustic features that downstream tasks relied on, such distortion may lead to less than expected results or degradation. In addition to maintaining acoustic features, this paper also explores how to use spectrogram information more effectively and improve computational efficiency in a real-time speech enhancement model.
This paper proposes improvements to each of the two models. The first one is DCT-TENET, which is based on Time-reversal Enhancement NETwork (TENET). The proposed model adjusts the training process for speech enhancement, which can improve the recognition rate more effectively as a front-end processing mechanism for speech recognition while maintaining desirable perceptual quality, and finally reduce the need for additional training in acoustic models of ASR. The second one is Adaptive-FSN, which extends the concept of "sub-band mechanism can effectively handle local pattern". We propose an adaptive sub-band mechanism to compress the rich information in a wide range of adjacent bands to improve speech quality, along with other improvements to speed up the computation.
We use the VoiceBank-DEMAND dataset to experiment with the two methods:
1) the improved DCT-TENET can improve the recognition rate of the downstream ASR for noisy speech compared to the TENET model. The relative word error rate (WER) of using the clean context acoustic model on the DEMAND test set reduces by about 7.9%, and the relative WER of using the multi-context acoustic model on the supplemental unseen noise test set also reduces by about 10.6%. 2) Adaptive-FSN performs better than the basic FullSubNet+ in terms of speech quality metrics and effectively reduces the Real-time Factor by 44% on the CPU.
[1] Paliwal, K., Wójcicki, K., & Shannon, B. (2011). The importance of phase in speech enhancement. Speech Communication, 53(4), 465–494. https://doi.org/10.1016/j.specom.2010.12.003
[2] Hu, Y., Liu, Y., Lv, S., Xing, M., Zhang, S., Fu, Y., Wu, J., Zhang, B., & Xie, L. (2020). DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. ArXiv:2008.00264[cs,eess]. http://arxiv.org/abs/2008.00264
[3] Rix, A. W., Beerends, J. G., Hollier, M. P., & Hekstra, A. P. (2001). Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2, 749–752 vol.2. https://doi.org/10.1109/ICASSP.2001.941023
[4] Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An Algorithm for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 2125–2136. https://doi.org/10.1109/TASL.2011.2114881
[5] Wang, P., Tan, K., & Wang, D. L. (2020). Bridging the Gap Between Monaural Speech Enhancement and Recognition With Distortion-Independent Acoustic Modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 39–48. https://doi.org/10.1109/TASLP.2019.2946789
[6] Iwamoto, K., Ochiai, T., Delcroix, M., Ikeshita, R., Sato, H., Araki, S., & Katagiri, S. (2022). How Bad Are Artifacts?: Analyzing the Impact of Speech Enhancement Errors on ASR. ArXiv:2201.06685 [Cs, Eess]. http://arxiv.org/abs/2201.06685
[7] Wang, Z.-Q., & Wang, D. (2016). A Joint Training Framework for Robust Automatic Speech Recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(4), 796–806. https://doi.org/10.1109/TASLP.2016.2528171
[8] Plantinga, P., Bagchi, D., & Fosler-Lussier, E. (2021). Perceptual Loss with Recognition Model for Single-Channel Enhancement and Robust ASR. ArXiv:2112.06068 [Cs, Eess]. http://arxiv.org/abs/2112.06068
[9] Chao, F.-A., Jiang, S.-W. F., Yan, B.-C., Hung, J., & Chen, B. (2021). TENET: A Time-reversal Enhancement Network for Noise-robust ASR. ArXiv:2107.01531 [Cs, Eess]. http://arxiv.org/abs/2107.01531
[10] Hsieh, T.-A., Yu, C., Fu, S.-W., Lu, X., & Tsao, Y. (2021). Improving Perceptual Quality by Phone-Fortified Perceptual Loss using Wasserstein Distance for Speech Enhancement. ArXiv:2010.15174 [Cs, Eess]. http://arxiv.org/abs/2010.15174
[11] Hao, X., Su, X., Horaud, R., & Li, X. (2021). FullSubNet: A Full-Band and Sub-Band Fusion Model for Real-Time Single-Channel Speech Enhancement. ArXiv:2010.15508 [Cs, Eess]. http://arxiv.org/abs/2010.15508
[12] Chen, J., Wang, Z., Tuo, D., Wu, Z., Kang, S., & Meng, H. (2022). FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement. ArXiv:2203.12188 [Cs, Eess]. http://arxiv.org/abs/2203.12188
[13] Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., & Pang, R. (2020). Conformer: Convolution-augmented Transformer for Speech Recognition. ArXiv:2005.08100 [Cs, Eess]. http://arxiv.org/abs/2005.08100
[14] Boll, S. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27(2), 113–120. https://doi.org/10.1109/TASSP.1979.1163209
[15] Chen, J., Benesty, J., Huang, Y., & Doclo, S. (2006). New insights into the noise reduction Wiener filter. IEEE Transactions on Audio, Speech, and Language Processing, 14(4), 1218–1234. https://doi.org/10.1109/TSA.2005.860851
[16] Kalman, R. E. (1960). A New Approach to Linear Filtering and Prediction Problems. Journal of Basic Engineering, 82(1), 35–45. https://doi.org/10.1115/1.3662552
[17] Wilson, K. W., Raj, B., Smaragdis, P., & Divakaran, A. (2008). Speech denoising using nonnegative matrix factorization with priors. 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, 4029–4032. https://doi.org/10.1109/ICASSP.2008.4518538
[18] Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., & Kingsbury, B. (2012). Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Processing Magazine, 29(6), 82–97. https://doi.org/10.1109/MSP.2012.2205597
[19] Xu, Y., Du, J., Dai, L.-R., & Lee, C.-H. (2015). A Regression Approach to Speech Enhancement Based on Deep Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23(1), 7–19. https://doi.org/10.1109/TASLP.2014.2364452
[20] Lu, X., Tsao, Y., Matsuda, S., & Hori, C. (2013). Speech enhancement based on deep denoising autoencoder. In Interspeech (Vol. 2013, pp. 436-440).
[21] Defossez, A., Synnaeve, G., & Adi, Y. (2020). Real Time Speech Enhancement in the Waveform Domain. ArXiv:2006.12847 [Cs, Eess, Stat]. http://arxiv.org/abs/2006.12847
[22] Choi, H.-S., Kim, J.-H., Huh, J., Kim, A., Ha, J.-W., & Lee, K. (2019). Phase-aware Speech Enhancement with Deep Complex U-Net (arXiv:1903.03107). arXiv. https://doi.org/10.48550/arXiv.1903.03107
[23] Bando, Y., Sekiguchi, K., & Yoshii, K. (2020). Adaptive Neural Speech Enhancement with a Denoising Variational Autoencoder. Interspeech 2020, 2437–2441. https://doi.org/10.21437/Interspeech.2020-2291
[24] Pascual, S., Bonafonte, A., & Serrà, J. (2017). SEGAN: Speech Enhancement Generative Adversarial Network. ArXiv:1703.09452 [Cs]. http://arxiv.org/abs/1703.09452
[25] Fu, S.-W., Liao, C.-F., Tsao, Y., & Lin, S.-D. (2019). MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement. ArXiv:1905.04874 [Cs, Eess]. http://arxiv.org/abs/1905.04874
[26] Wang, D., & Brown, G. J. (2006). Computational auditory scene analysis: Principles, algorithms, and applications. Wiley-IEEE press.
[27] Narayanan, A., & Wang, D. (2013). Ideal ratio mask estimation using deep neural networks for robust speech recognition. 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 7092–7096. https://doi.org/10.1109/ICASSP.2013.6639038
[28] Tzinis, E., Wang, Z., Jiang, X., & Smaragdis, P. (2022). Compute and memory efficient universal sound source separation. Journal of Signal Processing Systems, 94(2), 245–259. https://doi.org/10.1007/s11265-021-01683-x
[29] Braun, S., & Tashev, I. (2020). A consolidated view of loss functions for supervised deep learning-based speech enhancement. ArXiv:2009.12286 [Eess]. http://arxiv.org/abs/2009.12286
[30] Braun, S., & Gamper, H. (2021). Effect of noise suppression losses on speech distortion and ASR performance. ArXiv:2111.11606 [Cs, Eess]. http://arxiv.org/abs/2111.11606
[31] Erdogan, H., Hershey, J. R., Watanabe, S., & Le Roux, J. (2015). Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 708–712. https://doi.org/10.1109/ICASSP.2015.7178061
[32] Williamson, D. S., Wang, Y., & Wang, D. (2016). Complex Ratio Masking for Monaural Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(3), 483–492. https://doi.org/10.1109/TASLP.2015.2512042
[33] Yin, D., Luo, C., Xiong, Z., & Zeng, W. (2019). PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network. ArXiv:1911.04697 [Cs, Eess]. http://arxiv.org/abs/1911.04697
[34] Xia, Y., Braun, S., Reddy, C. K. A., Dubey, H., Cutler, R., & Tashev, I. (2020). Weighted Speech Distortion Losses for Neural-Network-Based Real-Time Speech Enhancement. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 871–875. https://doi.org/10.1109/ICASSP40776.2020.9054254
[35] Luo, Y., & Mesgarani, N. (2019). Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(8), 1256–1266. https://doi.org/10.1109/TASLP.2019.2915167
[36] Reddy, C. K. A., Beyrami, E., Dubey, H., Gopal, V., Cheng, R., Cutler, R., Matusevych, S., Aichner, R., Aazami, A., Braun, S., Srinivasan, S., & Gehrke, J. (n.d.). The INTERSPEECH 2020 Deep Noise Suppression Challenge: Datasets, Subjective Speech Quality and Testing Framework. 5.
[37] Weninger, F., Erdogan, H., Watanabe, S., Vincent, E., Le Roux, J., Hershey, J. R., & Schuller, B. (2015). Speech Enhancement with LSTM Recurrent Neural Networks and its Application to Noise-Robust ASR. In E. Vincent, A. Yeredor, Z. Koldovský, & P. Tichavský (Eds.), Latent Variable Analysis and Signal Separation (Vol. 9237, pp. 91–99). Springer International Publishing. https://doi.org/10.1007/978-3-319-22482-4_11
[38] Roux, J. L., Wisdom, S., Erdogan, H., & Hershey, J. R. (2018). SDR - half-baked or well done? ArXiv:1811.02508 [Cs, Eess]. http://arxiv.org/abs/1811.02508
[39] Chen, J., Mao, Q., & Liu, D. (2020). Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation. ArXiv:2007.13975 [Cs, Eess]. http://arxiv.org/abs/2007.13975
[40] Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. Interspeech 2019, 2613–2617. https://doi.org/10.21437/Interspeech.2019-2680
[41] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention Is All You Need. ArXiv:1706.03762 [Cs]. http://arxiv.org/abs/1706.03762
[42] Luo, Y., Chen, Z., & Yoshioka, T. (2020). Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. ArXiv:1910.06379 [Cs, Eess]. http://arxiv.org/abs/1910.06379
[43] Lam, M. W. Y., Wang, J., Su, D., & Yu, D. (2021). Effective Low-Cost Time-Domain Audio Separation Using Globally Attentive Locally Recurrent Networks. ArXiv:2101.05014 [Cs, Eess]. http://arxiv.org/abs/2101.05014
[44] Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised Pre-training for Speech Recognition. ArXiv:1904.05862 [Cs]. http://arxiv.org/abs/1904.05862
[45] Pandey, A., & Wang, D. (2020). Densely Connected Neural Network with Dilated Convolutions for Real-Time Speech Enhancement in The Time Domain. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6629–6633. https://doi.org/10.1109/ICASSP40776.2020.9054536
[46] Dang, F., Chen, H., & Zhang, P. (2022). DPT-FSNet: Dual-path Transformer Based Full-band and Sub-band Fusion Network for Speech Enhancement. ArXiv:2104.13002 [Cs, Eess]. http://arxiv.org/abs/2104.13002
[47] Yu, G., Li, A., Wang, Y., Guo, Y., Wang, H., & Zheng, C. (2021). Dual-branch Attention-In-Attention Transformer for single-channel speech enhancement. ArXiv:2110.06467 [Cs, Eess]. http://arxiv.org/abs/2110.06467
[48] Takahashi, N., & Mitsufuji, Y. (2017). Multi-scale Multi-band DenseNets for Audio Source Separation. ArXiv:1706.09588 [Cs]. http://arxiv.org/abs/1706.09588
[49] Geng, C., & Wang, L. (2019). End-to-end speech enhancement based on discrete cosine transform. ArXiv:1910.07840 [Cs, Eess]. http://arxiv.org/abs/1910.07840
[50] Li, Q., Gao, F., Guan, H., & Ma, K. (2021). Real-time Monaural Speech Enhancement With Short-time Discrete Cosine Transform. ArXiv:2102.04629 [Cs, Eess]. http://arxiv.org/abs/2102.04629
[51] Qian, K., Zhang, Y., Chang, S., Cox, D., & Hasegawa-Johnson, M. (2021). Unsupervised Speech Decomposition via Triple Information Bottleneck. ArXiv:2004.11284 [Cs, Eess]. http://arxiv.org/abs/2004.11284
[52] Chao, R., Yu, C., Fu, S.-W., Lu, X., & Tsao, Y. (2022). Perceptual Contrast Stretching on Target Feature for Speech Enhancement. ArXiv:2203.17152 [Cs, Eess]. http://arxiv.org/abs/2203.17152
[53] Peer, T., & Gerkmann, T. (2022). Phase-Aware Deep Speech Enhancement: It’s All About The Frame Length. ArXiv:2203.16222 [Cs, Eess]. http://arxiv.org/abs/2203.16222
[54] Wang, Z.-Q., Wichern, G., & Roux, J. L. (2021). On The Compensation Between Magnitude and Phase in Speech Separation. ArXiv:2108.05470 [Cs, Eess]. http://arxiv.org/abs/2108.05470
[55] Park, H. J., Kang, B. H., Shin, W., Kim, J. S., & Han, S. W. (2022). MANNER: Multi-view Attention Network for Noise Erasure. ArXiv:2203.02181 [Cs, Eess]. http://arxiv.org/abs/2203.02181
[56] Koizumi, Y., Karita, S., Wisdom, S., Erdogan, H., Hershey, J. R., Jones, L., & Bacchiani, M. (2021). DF-Conformer: Integrated architecture of Conv-TasNet and Conformer using linear complexity self-attention for speech enhancement. ArXiv:2106.15813 [Cs, Eess]. http://arxiv.org/abs/2106.15813
[57] Valentini-Botinhao, C., Wang, X., Takaki, S., & Yamagishi, J. (2016). Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech. 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), 146–152. https://doi.org/10.21437/SSW.2016-24
[58] Veaux, C., Yamagishi, J., & King, S. (2013). The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 1–4. https://doi.org/10.1109/ICSDA.2013.6709856
[59] Thiemann, J., Ito, N., & Vincent, E. (2013). DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments [Data set]. Zenodo. https://doi.org/10.5281/zenodo.1227121
[60] Kinoshita, K., Ochiai, T., Delcroix, M., & Nakatani, T. (2020). Improving noise robust automatic speech recognition with single-channel time-domain enhance-ment network. ArXiv:2003.03998 [Cs, Eess]. http://arxiv.org/abs/2003.03998
[61] Fu, S.-W., Yu, C., Hsieh, T.-A., Plantinga, P., Ravanelli, M., Lu, X., & Tsao, Y. (2021). MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement. ArXiv:2104.03538 [Cs, Eess]. http://arxiv.org/abs/2104.03538
[62] Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlıcˇek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (n.d.). The Kaldi Speech Recognition Toolkit. 4.