簡易檢索 / 詳目顯示

研究生: 施俊良
Shih, Chun-Liang
論文名稱: 語者自動分段標記與後修正技術之研究
A Study on the Development of Speaker Diarization and Post-Correction Techniques
指導教授: 陳柏琳
Chen, Berlin
口試委員: 陳柏琳
Chen, Berlin
洪志偉
Hung, Jeih-weih
江振宇
Chiang, Chen-Yu
口試日期: 2024/01/20
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2024
畢業學年度: 112
語文別: 中文
論文頁數: 41
中文關鍵詞: 語者自動分段標記端對端語音活性檢測後修正重疊語音檢測
英文關鍵詞: Speaker Diarization, End-to-End, Speech Activity Detection, Post-Correction, Overlapped Speech Detection
研究方法: 實驗設計法
DOI URL: http://doi.org/10.6345/NTNU202400423
論文種類: 學術論文
相關次數: 點閱:38下載:6
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語音辨識 (Speech Recognition) 可以被應用在諸多方面,研究者發現若可以將不同語者進行分割後,再分別進行語音辨識可以有更好的結果,也因此語者辨識 (Speaker Recognition) 相關領域開始被關注。其中語者自動分段標記 (Speaker Diarization) 是這個領域中最終目標,該任務目標希望可以辨別「誰在甚麼時候說話」,即是在沒有足夠語者或語音資訊時,能夠自動化的將語音分割出各個片段,並標示出各個片段那些屬於相同語者。這個任務可以被應用在諸多情境,例如在會議場合、直播節目或電話錄音等,多位語者同時存在的場合。隨著近年來的技術提升,在此領域研究已經有各種不同的方法與模型架構,但當遇到有重疊語音時,辨識結果依舊容易產生錯誤,也因此研究者們也針對如何處理重疊語音片段提出各種後修正改良方法。
    在本論文中,使用了LibriSpeech語料,並透過模擬房間脈衝響應 (Simulated Room Impulse Responses) 及Musan混合噪音用以模擬真實環境。實驗主要使用端對端之架構來做為預訓練模型,並使用了DiaCorrect後修正方法。因為使用了較多噪音的訓練集,在預訓練之結果有著更高的錯誤率,因此在後修正實驗中亦選擇保留更多的訓練資料來訓練解碼器模型,並比較選擇不同之資料量訓練時的結果。另外,本論文也使用了Pyannote 2.1提供之模型當作預訓練模型,此模型在各語料皆有不凡的表現,透過Pyannote 2.1與DiaCorrect之方法結合,以改善原先較差的部分,並藉此得到更佳的語者自動分段標記結果。

    Speech recognition can be applied in various fields, and researchers have found that better results can be achieved by segmenting different speakers and performing speech recognition separately. Therefore, speaker recognition has gained attention in related fields. Among them, speaker diarization is the ultimate goal in this area. The task aims to identify “who spoke when”, auto-matically segmenting speech into segments and labeling segments belonging to the same speaker when there is not enough speaker or speech information. This task can be applied in many situations, such as meetings, live broadcasts, or phone call recording. With recent technological advancements, various methods and model architectures have been developed in this field. However, when encountering overlapping speech, recognition results still tend to pro-duce errors. Therefore, researchers have proposed various post-correction and improvement methods for handling overlapping speech segments.
    In this paper, the LibriSpeech corpus is utilized, and simulated room im-pulse responses along with Musan mixed noise are employed to simulate real environments. The experiments primarily utilize an end-to-end architecture as the pre-training model, coupled with the DiaCorrect post-correction method. Due to the incorporation of a more noise-intensive training set, higher error rates are observed in the results of pre-training. Therefore, in the post-correction experiments, a decision is made to retain more training data to train the decoder model and to compare the results with different amounts of train-ing data. Additionally, Pyannote 2.1 models are employed as pre-training models, exhibiting exceptional performance across all corpora. Through the integration of Pyannote 2.1 and DiaCorrect methods, the aim is to improve the initially inferior aspects and thereby achieve better speaker diarization results.

    第1章 緒論 1 1.1 研究背景 1 1.2 發展歷史 5 1.3 語者自動分段標記架構 8 1.3.1. 傳統多階段方法 8 1.3.2. 端對端語者自動分段標記方法 9 1.4 研究動機 10 1.5 論文貢獻 11 1.5.1. DiaCorrect 11 1.5.2. Pyannote 2.1 11 1.6 論文架構 12 第2章 文獻探討 13 2.1 傳統多階段方法 13 2.1.1. 基於 i-vector 與 PLDA之語者自動分段標記 13 2.1.2. 基於 d-vector之語者自動分段標記 13 2.1.3. 基於 x-vector之語者自動分段標記 14 2.1.4. Pyannote 15 2.2 端對端方法 16 2.2.1. End-to-End Neural Diarization 16 2.2.2. 排序不變性訓練 17 2.2.3. Pyannote 2.1 18 2.2.4. 編碼器-解碼器-吸引子 18 2.2.5. Overlap-Aware 18 2.2.6. DiaCorrect 19 2.2.7. Unlimited Number of Speakers 20 2.3 其他方法 20 2.3.1. 使用不相同長度的輸入來做Meta-Learning 20 第3章 實驗方法 22 3.1 實驗語料集說明 22 3.1.1. LibriSpeech語料集 22 3.1.2. Augmented Multi-party Interaction (AMI) 會議語料集 22 3.1.3. 資料合成 23 3.2 實驗設計 23 3.3 評估方式 23 第4章 研究結果 25 4.1 DiaCorrect 25 4.2 Pyannote 2.1 34 第5章 結論與未來展望 35 參考文獻 36

    T. Kinnunen and H. Li, "An Overview of Text-Independent Speaker Recognition: from Features to Supervectors," Speech Communication, vol. 52, pp. 12-40, 2010.
    Z. Bai and X.-L. Zhang, "Speaker Recognition Based on Deep Learning: An Overview," Neural Networks Volume 140, pp. 65-99, August 2021.
    J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee and I. Han, "In defence of metric learning for speaker recognition," arXiv preprint arXiv:2003.11982, 2020.
    X. F. Yiming Wang, I.-F. Chen, Y. Liu, T. Chen and B. Hoffmeister, "End-to-end Anchored Speech Recognition," 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
    S. Sigtia, E. Marchi, S. Kajarekar, D. Naik and J. Bridle, "Multi-task Learning for Speaker Verification and Voice Trigger Detection," 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
    F. A. R. R. Chowdhury, Q. Wang, I. L. Moreno and L. Wan, "Attention-Based Models for Text-Dependent Speaker Verification," 2018 IEEE International Conference on Acoustics, Speech, and Signal Processin (ICASSP), 2018.
    D. Snyder, D. Garcia-Romero, D. Povey and S. Khudanpur, "Deep Neural Network Embeddings for Text-Independent Speaker Verification," Interspeech, 2017.
    C. Zhang, K. Koishida and J. H. L. Hansen, "Text-Independent Speaker Verification Based on Triplet Convolutional Neural Network Embeddings," IEEE/ACM Transactions on Audio Speech and Language Processing vol. 26, pp. 1633-1644, 2018.
    S. Haykin and Z. Chen, "The cocktail party," Neural Computation, vol. 17, p. 1875–1902, 2005.
    Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu and S. Watanabe, "End-to-End Neural Speaker Diarization with Self-Attention," ASRU, pp. 296-303, 2019.
    S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue and K. Nagamatsu, "End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors," arXiv preprint arXiv:2005.09921, 2020.
    J. Han, F. Landini, J. Rohdin, M. Diez, L. Burget, Y. Cao, H. Lu and J. Cernocky, "DIACORRECT: ERROR CORRECTION BACK-END FOR SPEAKER DIARIZATION," Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD), 2023.
    J. G. Fiscus, J. Ajot and J. S. Garofolo, "The rich transcription 2007 meeting recognition evaluation," in Multimodal Technologies for Perception of Humans, Springer, 2007, p. 373–389.
    V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "LibriSpeech: An ASR corpus based on public domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015.
    Mathews, S. Pruzansky and M. V., "Talker‐Recognition Procedure Based on Analysis of Variance," The Journal of the Acoustical Society of America, vol. 36, pp. 2041-2047, 1964.
    X. Huang, A. Acero, H.-W. Hon and R. Reddy, "Spoken Language Processing: A Guide to Theory, Algorithm, and System Development," Prentice hall PTR, 2001.
    S. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, pp. 357-366, 1980.
    H. Gish, M.-H. Siu and R. Rohlicek, "Segregation of speakers for speech recognition and speaker identification," 1991 International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1991.
    S. Chen, P. Gopalakrishnan and I. Watson, "Speaker, Environment and Channel Change Detection and Clustering via the Bayesian Information Criterion," Proc. DARPA Broadcast News Transcription and Understanding Workshop, 1998.
    S. Furui, "Cepstral analysis technique for automatic speaker verification," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 29, pp. 254-272, 1981.
    A. Gersho and R. M. Gray, "Vector Quantization and Signal Compression," The Springer International Series in Engineering and Computer Science, vol. 159, 2012.
    D. Reynolds, T. Quatieri and R. B. Dunn, "Speaker Verification Using Adapted Gaussian Mixture Models," Digit. Signal Process, vol .10, pp. 19-41, 2000.
    D. Vijayasenan, F. Valente and H. Bourlard, "An Information Theoretic Approach to Speaker Diarization of Meeting Data," IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, pp. 1382-1393, 2009.
    F. Valente, P. Motlicek and D. Vijayasenan, "Variational Bayesian speaker diarization of meeting recordings," 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010.
    P. Kenny, G. Boulianne, P. Ouellet and P. Dumouchel, "Joint Factor Analysis Versus Eigenchannels in Speaker Recognition," IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, pp. 1382-1393, 2009.
    P. K. Najim Dehak, R. Dehak, P. Dumouchel and P. Ouellet, "Front-End Factor Analysis for Speaker Verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, pp. 788-798, 2010.
    N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, p. 788–798, 2010.
    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur, "X-vectors: Robust dnn embeddings for speaker recognition," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
    E. Variani, X. Lei, E. McDermott, I. L. Moreno and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014.
    D. Arthur and S. Vassilvitskii, "k-means++: The advantages of careful seeding," 2006.
    A. Ng, M. Jordan and Y. Weiss, "On spectral clustering: Analysis and an algorithm," Advances in neural information processing systems, vol. 2, p. 849–856, 2002.
    D. Müllner, "Modern hierarchical, agglomerative clustering algorithms," arXiv preprint arXiv:1109.2378, 2011.
    H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz and M.-P. Gill, "Pyannote.Audio: Neural Building Blocks for Speaker Diarization," ICASSP, 2020.
    S. J. Prince and J. H. Elder, "Probabilistic Linear Discriminant Analysis for Inferences About Identity," 2007 IEEE 11th International Conference on Computer Vision, 2007.
    P. Kenny, T. Stafylakis, P. Ouellet, M. J. Alam and P. Dumouchel, "PLDA for speaker verification with utterances of arbitrary duration," IEEE, 2013.
    H. Sak, A. Senior and F. Beaufays, "Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition," arXiv preprint arXiv:1402.1128, 2014.
    D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur, "X-Vectors: Robust DNN Embeddings for Speaker Recognition," ICASSP.2018.8461375, 2018.
    Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu and S. Watanabe, "End-to-end neural speaker diarization with permutation-free objectives," arXiv preprint arXiv:1909.05952, 2019.
    E. Han, C. Lee and A. Stolcke, "BW-EDA-EEND: Streaming End-to-End Neural Speaker Diarization for a Variable Number of Speakers," ICASSP, 2021.
    S. Horiguchi, P. Garcia, Y. Fujita, S. Watanabe and K. Nagamatsu, "End-to-End Speaker Diarization as Post-Processing," arXiv:2012.10055, 2020.
    D. Yu, M. Kolbæk, Z.-H. Tan and J. Jensen, "Permutation Invariant Training of Deep Models for Speaker-Independent Multi-talker Speech Separation," Computation and Language (cs.CL); Machine Learning (cs.LG); Sound (cs.SD); Audio and Speech Processing (eess.AS), 2017.
    H. Bredin and A. Laurent, "End-to-end speaker segmentation for overlap-aware resegmentation," arXiv:2104.04045, 2021.
    H. Bredin, "PYANNOTE.AUDIO 2.1 SPEAKER DIARIZATION PIPELINE:PRINCIPLE, BENCHMARK, AND RECIPE," INTERSPEECH , 2023.
    S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue and P. Garcia, "Encoder-Decoder Based Attractors for End-to-End Neural Diarization," Audio and Speech Processing (eess.AS); Sound (cs.SD), 2021.
    J. M. Coria, H. Bredin, S. Ghannay and S. Rosset, "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation," Audio and Speech Processing (eess.AS); Sound (cs.SD), 2021.
    N. Kanda, X. Xiao, Y. Gaur, X. Wang, Z. Meng, Z. Chen and T. Yoshioka, "Transcribe-to-Diarize: Neural Speaker Diarization for Unlimited Number of Speakers using End-to-End Speaker-Attributed ASR," Audio and Speech Processing (eess.AS); Computation and Language (cs.CL); Sound (cs.SD), 2022.
    S. M. Kye, Y. Jung, H. B. Lee, S. J. Hwang and H. Kim, "Meta-Learning for Short Utterance Speaker Recognition with Imbalance Length Pairs," Interspeech, 2020.
    I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos and others, "The AMI meeting corpus," in Proceedings of the 5th International Conference on Methods and Techniques in Behavioral Research, 2005.
    F. Landini, J. Profant, M. Diez and L. Burget, "Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks," arXiv preprint arXiv:2012.14952, 2020.
    T. Ko, V. Peddinti, D. Povey, M. L. Seltzer and S. Khudanpur, "A study on data augmentation of reverberant speech for robust speech recognition," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
    D. Snyder, G. Chen and D. Povey, "MUSAN: A Music, Speech, and Noise Corpus," arXiv preprint arXiv:1510.08484, 2015.
    H. Bredin, "pyannote.metrics: a toolkit for reproducible evaluation," INTERSPEECH, 2017.

    下載圖示
    QR CODE