簡易檢索 / 詳目顯示

研究生: 鄭宇森
Cheng, Yu-Sen
論文名稱: 語者自動分段標記之改進方法
Improved Methods for Speaker Diarization
指導教授: 陳柏琳
Chen, Berlin
口試委員: 陳冠宇 曾厚強 陳柏琳
口試日期: 2021/10/14
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2021
畢業學年度: 109
語文別: 中文
論文頁數: 66
中文關鍵詞: 語者自動分段標記自動語音辨識端對端音素後驗圖語音活性檢測
英文關鍵詞: Speaker Diarzation, End-to-End, Phone Posteriorgrams, Speech Activity Detection, DOVER-lap
研究方法: 實驗設計法調查研究比較研究
DOI URL: http://doi.org/10.6345/NTNU202101736
論文種類: 學術論文
相關次數: 點閱:94下載:19
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 語者自動分段標記(Speaker Diarization)任務目標在沒有先驗資訊的情況下,自動化的將語音分割出不同的語者段落,並區分出哪些段落屬於相同語者。此任務可以應用於會議語音、視訊上課或者直播節目等,在音訊中同時存在多位語者對話之情境,並對音訊中每段語音標記其所屬之語者。近年來,伴隨硬體技術突破帶來的電腦算力增強,使得此任務效能得到了突破性的進長。然而,目前的語者自動分段標記系統在如增強多方互動(Augmented Multi-Party Interaction, AMI)語料集等高重疊長時語料上仍舊未取得良好的效能表現。
    有鑑於此,本論文於AMI語料集上研究階段性與端對端等兩大語者自動分段標記架構,並分別針對兩種架構提出相應的改進方法。在階段性架構上,考量不同語者間在用詞習慣上對於語者辨識之影響,本論文採用自動語音辨識(Automatic Speech Recognition)生成的音素後驗圖(Phone Posteriorgrams)做為語者特徵之輔助資訊以此提高系統對於不同語者之辨別能力。在端對端架構上,考量相較於傳統階段性架構的網路僅需處理片段,段對段架構的神經網路需要一次性分析整段音訊,因而造成建模上的負擔,使得其效能在長時語料上表現不彰。此外,段對段架構相較於階段性架構需額外判斷此段音訊是否為非語音段,容易被噪音誤導造成辨識的難度激增。因此本研究參考階段性架構引入標準答案的語音活性檢測(Oracle Speech Activity Detection)降低神經網路在分辨語者時的負擔,並藉此縮短音訊長度。最後,本研究採用重疊自動分段標記投票機制(Diarization Output Voting Error Reduction Overlap, DOVER-lap)融合兩種架構的多個實驗結果,結合階段性架構的準確性與端對端架構對於重疊語音的處理能力以獲得更傑出之效能表現。

    Speaker diarzation is to solve the “who spoke when” question by partitioning an audio file into homogeneous segments based on the speaker’s identity. It can be applied to conversations when there are multiple speakers at the same time. However, current dairization system can not achieve good performance on wideband corpus such as the Augmented Multi-party Interaction (AMI) corpus.
    This paper studies the two major diarization architectures, i.e. stage-wise and end-to-end (E2E), on the AMI corpus. Improved methods corresponding to the two architectures are porposed. For the influence of different speakers’ idioms on speaker recognition in stage-wise, phone posteriorgrams generated by automatic speech recognition is used as the auxiliary information. It improves the ability to distinguish different speakers. Different from the stage-wise architecture that only segmented information is processed, the E2E architecture analyzes the entire audio. It becomes an obstacle on modeling and decreases its performance on wideband corpus. In addition, the E2E architecture needs additional judgement on whether the audio segment is a speech segment. Such judgement can be easily affected by noise, which makes the difficulty of identification increase sharply. Therefore, this research refers to the stage-wise architecture and introduces the speech activity detection. It reduces the burden for neural network to distinguish speakers. Finally, the so-called DOVER-lap mechanism is used to integrate multiple experimental results of these two architectures. With the high accuracy of stage-wise architecture and the ability of E2E architecture on tackling overlapping speech, more pronounced performance with DER 16.9% is obtained.

    第1章 緒論 1 1.1 研究背景 1 1.2 發展歷史 5 1.3 論文貢獻 8 1.4 論文架構 9 第2章 語者自動分段標記架構 10 2.1 階段性語者自動分段標記 10 2.2 端對端語者自動分段標記 13 2.2.1. 排序不變性訓練(PERMUTATION-INVARIANT TRAINING) 14 2.2.2. 深度聚類損失(DEEP CLUSTERING LOSS) 15 2.2.3. 吸引子(ATTRACTORS) 15 第3章 相關技術與研究 17 3.1 I-VECTOR 17 3.2 區域提名網路(REGION PROPOSAL NETWORKS) 17 3.3 無極限交織態遞歸神經網路(UNBOUNDED INTERLEAVED-STATE RECURRENT NEURAL NETWORK) 19 3.4 目標語者聲音活性檢測(TARGET-SPEAKER VOICE ACTIVITY DETECTION) 19 第4章 研究方法 21 4.1 聚合式階層聚類法(AGGLOMERATIVE HIERARCHICAL CLUSTERING) 21 4.2 譜聚類(SPECTRAL CLUSTERING) 22 4.3 變異貝氏隱藏式馬可夫模型(VARIATIONAL BAYESIAN HIDDEN MARKOV MODEL) 25 4.4 語音活性檢測端對端語者自動分段標記 28 4.5 使用音素後驗圖之語者自動分段標記 29 4.6 重疊自動分段標記投票機制 30 第5章 實驗架構 32 5.1 實驗語料集說明 32 5.1.1. 增強多方互動語料集(AMI CORPUS) 32 5.1.2. LIBRISPEECH語料集 34 5.1.3. 資料合成 34 5.2 系統架構 35 5.2.1. 階段性架構實驗 35 5.2.2. 語音活性檢測應用於端對端架構實驗 36 5.2.3. 音素輔助資訊實驗 37 第6章 實驗結果與分析 38 6.1 評估方式 38 6.2 實驗結果 40 6.2.1. AMI頭戴式耳麥語料基線實驗 40 6.2.2. AMI波束成形語料基線實驗 42 6.2.3. 端對端系統改進實驗 43 6.2.4. 音素後驗圖實驗 44 6.2.5. 系統混合實驗數據 46 6.2.6. 總評 47 第7章 結論與未來展望 49 參考文獻 50

    [1] T. Kinnunen and H. Li, "An overview of text-independent speaker recognition: From features to supervectors," Speech communication, vol. 52, p. 12–40, 2010.
    [2] Z. Bai and X.-L. Zhang, "Speaker recognition based on deep learning: An overview," Neural Networks, 2021.
    [3] T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe and S. Narayanan, "A review of speaker diarization: Recent advances with deep learning," arXiv preprint arXiv:2101.09624, 2021.
    [4] S. Haykin and Z. Chen, "The cocktail party problem," Neural computation, vol. 17, p. 1875–1902, 2005.
    [5] J. S. Chung, J. Huh, S. Mun, M. Lee, H. S. Heo, S. Choe, C. Ham, S. Jung, B.-J. Lee and I. Han, "In defence of metric learning for speaker recognition," arXiv preprint arXiv:2003.11982, 2020.
    [6] F. R. rahman Chowdhury, Q. Wang, I. L. Moreno and L. Wan, "Attention-based models for text-dependent speaker verification," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
    [7] D. Snyder, D. Garcia-Romero, D. Povey and S. Khudanpur, "Deep neural network embeddings for text-independent speaker verification.," in Interspeech, 2017.
    [8] C. Zhang, K. Koishida and J. H. L. Hansen, "Text-independent speaker verification based on triplet convolutional neural network embeddings," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, p. 1633–1644, 2018.
    [9] Y. Wang, X. Fan, I.-F. Chen, Y. Liu, T. Chen and B. Hoffmeister, "End-to-end anchored speech recognition," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
    [10] S. Sigtia, E. Marchi, S. Kajarekar, D. Naik and J. Bridle, "Multi-task learning for speaker verification and voice trigger detection," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
    [11] T. Drugman, Y. Stylianou, Y. Kida and M. Akamine, "Voice activity detection: Merging source and filter-based information," IEEE Signal Processing Letters, vol. 23, p. 252–256, 2015.
    [12] L. Bullock, H. Bredin and L. P. Garcia-Perera, "Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
    [13] D. Raj, Z. Huang and S. Khudanpur, "Multi-class spectral clustering with overlaps for speaker diarization," in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021.
    [14] Y. LeCun, Y. Bengio and G. Hinton, "Deep learning," nature, vol. 521, p. 436–444, 2015.
    [15] Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu and S. Watanabe, "End-to-end neural speaker diarization with permutation-free objectives," interspeech pp.4300-4304, 2019.
    [16] A. McCree, G. Sell and D. Garcia-Romero, "Speaker Diarization Using Leave-One-Out Gaussian PLDA Clustering of DNN Embeddings.," in Interspeech, 2019.
    [17] K. Karra and A. McCree, "Speaker Diarization using Two-pass Leave-One-Out Gaussian PLDA Clustering of DNN Embeddings," arXiv preprint arXiv:2104.02469, 2021.
    [18] T. J. Park and P. Georgiou, "Multimodal speaker segmentation and diarization using lexical and acoustic cues via sequence to sequence neural networks," arXiv preprint arXiv:1805.10731, 2018.
    [19] T. J. Park, K. J. Han, J. Huang, X. He, B. Zhou, P. Georgiou and S. Narayanan, "Speaker diarization with lexical information," arXiv preprint arXiv:2004.06756, 2020.
    [20] N. Flemotomos, P. Georgiou and S. Narayanan, "Linguistically aided speaker diarization using speaker role information," arXiv preprint arXiv:1911.07994, 2019.
    [21] I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos and others, "The AMI meeting corpus," in Proceedings of the 5th International Conference on Methods and Techniques in Behavioral Research, 2005.
    [22] F. Landini, J. Profant, M. Diez and L. Burget, "Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks," arXiv preprint arXiv:2012.14952, 2020.
    [23] D. Raj, L. P. Garcia-Perera, Z. Huang, S. Watanabe, D. Povey, A. Stolcke and S. Khudanpur, "DOVER-Lap: A method for combining overlap-aware diarization outputs," in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021.
    [24] "2000 NIST Speaker Recognition Evaluation Kernel Description," [Online]. Available: https://catalog.ldc. upenn.edu/LDC2001S97.
    [25] S. Pruzansky and M. V. Mathews, "Talker-recognition procedure based on analysis of variance," The Journal of the Acoustical Society of America, vol. 36, p. 2041–2047, 1964.
    [26] X. Huang, A. Acero, H.-W. Hon and R. Reddy, Spoken language processing: A guide to theory, algorithm, and system development, Prentice hall PTR, 2001.
    [27] H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech," the Journal of the Acoustical Society of America, vol. 87, p. 1738–1752, 1990.
    [28] S. Davis and P. Mermelstein, "Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences," IEEE transactions on acoustics, speech, and signal processing, vol. 28, p. 357–366, 1980.
    [29] A. Gersho and R. M. Gray, Vector quantization and signal compression, vol. 159, Springer Science & Business Media, 2012.
    [30] S. Furui, "Cepstral analysis technique for automatic speaker verification," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 29, p. 254–272, 1981.
    [31] H. Gish, M.-H. Siu and J. R. Rohlicek, "Segregation of speakers for speech recognition and speaker identification.," in icassp, 1991.
    [32] S. Chen, P. Gopalakrishnan and others, "Speaker, environment and channel change detection and clustering via the bayesian information criterion," in Proc. DARPA broadcast news transcription and understanding workshop, 1998.
    [33] D. A. Reynolds, T. F. Quatieri and R. B. Dunn, "Speaker verification using adapted Gaussian mixture models," Digital signal processing, vol. 10, p. 19–41, 2000.
    [34] W. M. Campbell, D. E. Sturim and D. A. Reynolds, "Support vector machines using GMM supervectors for speaker verification," IEEE signal processing letters, vol. 13, p. 308–311, 2006.
    [35] P. Kenny, G. Boulianne, P. Ouellet and P. Dumouchel, "Joint factor analysis versus eigenchannels in speaker recognition," IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, p. 1435–1447, 2007.
    [36] D. Vijayasenan, F. Valente and H. Bourlard, "An information theoretic approach to speaker diarization of meeting data," IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, p. 1382–1393, 2009.
    [37] F. Valente, P. Motlicek and D. Vijayasenan, "Variational Bayesian speaker diarization of meeting recordings," in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010.
    [38] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel and P. Ouellet, "Front-end factor analysis for speaker verification," IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, p. 788–798, 2010.
    [39] G. Sell and D. Garcia-Romero, "Speaker diarization with PLDA i-vector scoring and unsupervised calibration," in 2014 IEEE Spoken Language Technology Workshop (SLT), 2014.
    [40] J. Ajmera and C. Wooters, "A robust speaker clustering algorithm," in 2003 ieee workshop on automatic speech recognition and understanding (ieee cat. no. 03ex721), 2003.
    [41] S. E. Tranter and D. A. Reynolds, "Speaker diarisation for broadcast news," in Odyssey04-The Speaker and Language Recognition Workshop, 2004.
    [42] D. A. Reynolds and P. Torres-Carrasquillo, "Approaches and applications of audio diarization," in Proceedings.(ICASSP'05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005., 2005.
    [43] X. Zhu, C. Barras, S. Meignier and J.-L. Gauvain, "Combining speaker identification and BIC for speaker diarization," in Interspeech'05, 2005.
    [44] S. Meignier, D. Moraru, C. Fredouille, J.-F. Bonastre and L. Besacier, "Step-by-step and integrated approaches in broadcast news speaker diarization," Computer Speech & Language, vol. 20, p. 303–330, 2006.
    [45] A. E. Rosenberg, A. L. Gorin, Z. Liu and S. Parthasarathy, "Unsupervised speaker segmentation of telephone conversations.," in INTERSPEECH, 2002.
    [46] D. Liu and F. Kubala, "A cross-channel modeling approach for automatic segmentation of conversational telephone speech [automatic speech recognition applications]," in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721), 2003.
    [47] S. E. Tranter, K. Yu, G. Everinann and P. C. Woodland, "Generating and evaluating segmentations for automatic speech recognition of conversational telephone speech," in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004.
    [48] P. Kenny, D. Reynolds and F. Castaldo, "Diarization of telephone conversations using factor analysis," IEEE Journal of Selected Topics in Signal Processing, vol. 4, p. 1059–1070, 2010.
    [49] J. Ajmera, G. Lathoud and L. McCowan, "Clustering and segmenting speakers and their locations in meetings," in 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004.
    [50] Q. Jin and T. Schultz, "Speaker segmentation and clustering in meetings.," in INTERSPEECH, 2004.
    [51] X. Anguera, C. Wooters and J. Hernando, "Purity algorithms for speaker diarization of meetings data," in 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, 2006.
    [52] D. A. van Leeuwen and M. Konečnỳ, "Progress in the AMIDA speaker diarization system for meeting data," in Multimodal Technologies for Perception of Humans, Springer, 2007, p. 475–483.
    [53] E. Variani, X. Lei, E. McDermott, I. L. Moreno and J. Gonzalez-Dominguez, "Deep neural networks for small footprint text-dependent speaker verification," in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014.
    [54] D. Snyder, D. Garcia-Romero, G. Sell, D. Povey and S. Khudanpur, "X-vectors: Robust dnn embeddings for speaker recognition," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
    [55] Q. Lin, R. Yin, M. Li, H. Bredin and C. Barras, "LSTM based similarity measurement with spectral clustering for speaker diarization," arXiv preprint arXiv:1907.10393, 2019.
    [56] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu and S. Watanabe, "End-to-end neural speaker diarization with self-attention," in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019.
    [57] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue and K. Nagamatsu, "End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors," arXiv preprint arXiv:2005.09921, 2020.
    [58] Y. Chieh Liu, E. Han, C. Lee and A. Stolcke, "End-to-end Neural Diarization: From Transformer to Conformer," arXiv e-prints, p. arXiv–2106, 2021.
    [59] Y. Xue, S. Horiguchi, Y. Fujita, S. Watanabe, P. Garcı́a and K. Nagamatsu, "Online end-to-end neural diarization with speaker-tracing buffer," in 2021 IEEE Spoken Language Technology Workshop (SLT), 2021.
    [60] H. Sak, A. Senior and F. Beaufays, "Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition," arXiv preprint arXiv:1402.1128, 2014.
    [61] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang, "Phoneme recognition using time-delay neural networks," IEEE transactions on acoustics, speech, and signal processing, vol. 37, p. 328–339, 1989.
    [62] K. He, X. Zhang, S. Ren and J. Sun, "Deep residual learning for image recognition," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
    [63] A. Y. Ng, M. I. Jordan, Y. Weiss and others, "On spectral clustering: Analysis and an algorithm," Advances in neural information processing systems, vol. 2, p. 849–856, 2002.
    [64] U. Von Luxburg, "A tutorial on spectral clustering," Statistics and computing, vol. 17, p. 395–416, 2007.
    [65] D. Müllner, "Modern hierarchical, agglomerative clustering algorithms," arXiv preprint arXiv:1109.2378, 2011.
    [66] M. Diez, L. Burget, S. Wang, J. Rohdin and J. Cernockỳ, "Bayesian HMM Based x-Vector Clustering for Speaker Diarization.," in Interspeech, 2019.
    [67] J. G. Fiscus, J. Ajot and J. S. Garofolo, "The rich transcription 2007 meeting recognition evaluation," in Multimodal Technologies for Perception of Humans, Springer, 2007, p. 373–389.
    [68] A. Stolcke and T. Yoshioka, "DOVER: A method for combining diarization outputs," in 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2019.
    [69] A. Zhang, Q. Wang, Z. Zhu, J. Paisley and C. Wang, "Fully supervised speaker diarization," in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
    [70] D. Yu, M. Kolbæk, Z.-H. Tan and J. Jensen, "Permutation invariant training of deep models for speaker-independent multi-talker speech separation," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
    [71] J. R. Hershey, Z. Chen, J. Le Roux and S. Watanabe, "Deep clustering: Discriminative embeddings for segmentation and separation," in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016.
    [72] Z. Chen, Y. Luo and N. Mesgarani, "Deep attractor network for single-microphone speaker separation," in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017.
    [73] Y. Luo, Z. Chen and N. Mesgarani, "Speaker-independent speech separation with deep attractor network," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, p. 787–796, 2018.
    [74] S. Ren, K. He, R. Girshick and J. Sun, "Faster r-cnn: Towards real-time object detection with region proposal networks," Advances in neural information processing systems, vol. 28, p. 91–99, 2015.
    [75] Z. Huang, S. Watanabe, Y. Fujita, P. Garcı́a, Y. Shao, D. Povey and S. Khudanpur, "Speaker diarization with region proposal network," in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020.
    [76] D. Arthur and S. Vassilvitskii, "k-means++: The advantages of careful seeding," 2006.
    [77] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny and others, "Target-speaker voice activity detection: a novel approach for multi-speaker diarization in a dinner party scenario," arXiv preprint arXiv:2005.07272, 2020.
    [78] S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang, S. Khudanpur, V. Manohar, D. Povey, D. Raj and others, "CHiME-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings," arXiv preprint arXiv:2004.09249, 2020.
    [79] R. R. Sokal, "A statistical method for evaluating systematic relationships.," Univ. Kansas, Sci. Bull., vol. 38, p. 1409–1438, 1958.
    [80] D. A. Van Leeuwen and N. Brümmer, "The distribution of calibrated likelihood-ratios in speaker recognition," arXiv preprint arXiv:1304.1199, 2013.
    [81] N. Brümmer and D. Garcia-Romero, "Generative modelling for unsupervised score calibration," in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014.
    [82] Q. Wang, C. Downey, L. Wan, P. A. Mansfield and I. L. Moreno, "Speaker diarization with LSTM," in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.
    [83] T. J. Park, K. J. Han, M. Kumar and S. Narayanan, "Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap," IEEE Signal Processing Letters, vol. 27, p. 381–385, 2019.
    [84] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz and others, "The Kaldi speech recognition toolkit," in IEEE 2011 workshop on automatic speech recognition and understanding, 2011.
    [85] M. Diez, A. Varona, M. Penagarikano, L. J. Rodriguez-Fuentes and G. Bordel, "Using phone log-likelihood ratios as features for speaker recognition," evaluation, vol. 3, p. 15, 2013.
    [86] M. Diez, A. Varona, M. Penagarikano, L. J. Rodriguez-Fuentes and G. Bordel, "On the complementarity of phone posterior probabilities for improved speaker recognition," IEEE Signal Processing Letters, vol. 21, p. 649–652, 2014.
    [87] S. Cumani, P. Laface and F. Kulsoom, "Speaker recognition by means of acoustic and phonetically informed GMMs," in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
    [88] X. Chen and C. Bao, "Phoneme-unit-specific time-delay neural network for speaker verification," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, p. 1243–1255, 2021.
    [89] L. Rokach, "Ensemble-based classifiers," Artificial intelligence review, vol. 33, p. 1–39, 2010.
    [90] J. G. Fiscus, "A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER)," in 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, 1997.
    [91] H. W. Kuhn, "The Hungarian method for the assignment problem," Naval research logistics quarterly, vol. 2, p. 83–97, 1955.
    [92] X. Anguera, C. Wooters and J. Hernando, "Acoustic beamforming for speaker diarization of meetings," IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, p. 2011–2022, 2007.
    [93] V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: an asr corpus based on public domain audio books," in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015.
    [94] A. Nagrani, J. S. Chung, W. Xie and A. Zisserman, "Voxceleb: Large-scale speaker verification in the wild," Computer Speech & Language, vol. 60, p. 101027, 2020.
    [95] D. Snyder, G. Chen and D. Povey, "Musan: A music, speech, and noise corpus," arXiv preprint arXiv:1510.08484, 2015.
    [96] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin, "Attention is all you need," in Advances in neural information processing systems, 2017.
    [97] S.-H. Chiu, T.-H. Lo and B. Chen, "Cross-sentence Neural Language Models for Conversational Speech Recognition," in 2021 International Joint Conference on Neural Networks (IJCNN), 2021.
    [98] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang and S. Khudanpur, "Purely sequence-trained neural networks for ASR based on lattice-free MMI.," in Interspeech, 2016.
    [99] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi and S. Khudanpur, "Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks.," in Interspeech, 2018.

    下載圖示
    QR CODE