簡易檢索 / 詳目顯示

研究生: 吳宥德
Wu, Yu-Te
論文名稱: 多重樂器自動採譜之探討
An Investigation of Multi-Instrument Automatic Music Transcription
指導教授: 陳柏琳
Chen, Berlin
蘇黎
Su, Li
學位類別: 碩士
Master
系所名稱: 資訊工程學系
Department of Computer Science and Information Engineering
論文出版年: 2020
畢業學年度: 108
語文別: 英文
論文頁數: 65
中文關鍵詞: 自動音樂採譜多音預測深度學習自注意力機制多音多樂器預測
英文關鍵詞: automatic music transcription, multi-pitch estimation, multi-pitch streaming, deep learning, self-attention
DOI URL: http://doi.org/10.6345/NTNU202000964
論文種類: 學術論文
相關次數: 點閱:176下載:19
分享至:
查詢本校圖書館目錄 查詢臺灣博碩士論文知識加值系統 勘誤回報
  • 自動音樂採譜 (Automatic Music Transcription, AMT)是音樂資訊檢索 (Music Information Retrieval, MIR)中最重要的任務之一,由於其訊號的複雜性,它已被視為訊號處理中最具挑戰性的領域之一。在許多 AMT 任務中,多樂器採譜任務是通用採譜系統的關鍵步驟之一,但相關領域的研究卻很少。模型必須在一首樂曲當中,同時辨識多種樂器和其相應音高,而其中包括了不同樂器的各種音色和豐富的諧波(Harmonics),可能導致訊號彼此相互干擾,造成更為複雜的情況,因此與傳統的單樂器採譜研究相比,多樂器採譜成為了一個更進階且複雜的問題。除了存在技術本質上的困難,統整與協調不同層次的採譜問題、處理複雜的交互影響,也需要更加清晰與明確的問題定義,並針對最後的結果發展一套有效的評估方法。
    在這項研究中,我們提出了一個多樂器自動採譜的方法。藉由發展一套從訊號層級的特徵工程、到最終評估結果的端到端流程,整合了多項技術以更好的處理此複雜的問題。當中結合了能夠清楚顯現音高特徵的訊號處理技術、新穎的深度學習模型,以及從多目標識別(Multi-object Recognition),實例分割(Instance Segmentation)、計算機視覺中,圖到圖轉換所激發出來的概念,進一步整合新發展的後處理演算法,提出來的系統對於多樂器採譜中的所有子任務,呈現出通用彈性且十分有效率的表現。在針對不同子任務進行綜合評估後,於各項指標上皆表現出了至今為止最優的結果,其中包括了過去從未被研究的多樂器音符層級採譜任務(Note-level Transcription)。

    Automatic music transcription (AMT), one of the most important tasks in music information retrieval (MIR), has been seen as one of the most challenging field in signal processing because of its inherent complexity of signals. Among many of the AMT tasks, multi-instrument is one critical step for general transcription system, but yet a less investigated field. The requirement of identifying multiple instruments and the corresponding pitch in music performances, which consists of various timbres and rich harmonic information that could interfere with each other, making it a more advanced problem in comparison with the conventional single-instrument AMT problem. Despite the technical difficulties, to orchestrate different levels of the complex
    problem scopes, a clear definition of problem scenarios and efficient evaluation approaches are
    also needed.
    In this research, we propose a multi-instrument AMT approach, with a complete end-to-end flow from signal-level feature engineering to the final evaluation. Combined with signal processing techniques capable of specifying pitch saliency, novel deep learning methods, concepts inspired from multi-object recognition, instance segmentation, and image-to-image translation in computer vision, meanwhile being integrated with a newly developed post-processing algorithm, the proposed system is flexible and efficient for all the sub-tasks in multi-instrument AMT. Comprehensive evaluations on different sub-tasks have shown state-of-the-art performance, including the task of multi-instrument note tracking which has not been investigated before.

    1. Introduction 1 1.1. Background and Motivation 1 1.2. Problem Scenarios 4 1.3. Arrangement of the Thesis 5 2. Related Work 6 2.1. Background of AMT 6 2.2. Era of Deep Learning 7 2.2.1. Dealing with Long-Term Sequence 11 2.3. Data Representations 14 3. Method 17 3.1. Data Representations 17 3.1.1. CFP Representation 18 3.1.2. Harmonic Representation 20 3.2. Model 21 3.3. Label Smooth 25 3.4. Post-Processing 26 4. Experiments 30 4.1. Settings 30 4.2. Datasets 31 4.2.1. Single-instrument Datasets 31 4.2.2. Multi-instrument Datasets 32 4.3. Training 34 4.4. Evaluation Metrics 35 5. Experiment Results 38 5.1. CFP Feature Comparison 39 5.2. Harmonic Feature Comparison 40 5.3. MPE and NT with Different Models 41 5.4. MPS and NS with Different Models 43 5.5. MPS and NS Instrument-level Evaluation 46 5.6. Confusion Matrix Analysis 49 5.7. Effect of Post-processing 51 5.8. Illustration 52 6. Discussion 55 7. Conclusion and Future Works 57 References 58

    [1] Emmanouil Benetos, Simon Dixon, Zhiyao Duan, and Sebastian Ewert, “Automatic music transcription: An overview” IEEE Signal Processing Magazine, vol. 36, no. 1, pp. 20–30, 2019.
    [2] Emmanouil Benetos, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff, and Anssi Klapuri, “Automatic music transcription: challenges and future directions,” J. Intelligent Information Systems, vol. 41, no. 3, pp. 407–434, 2013.
    [3] Benetos Emmanouil, Simon Dixon, Dimitrios Giannoulis, Holger Kirchhoff, and Anssi Klapuri, “Automatic Music Transcription: Breaking the Glass Ceiling,” in Proceedings of the 13th International Society for Music Information Retrieval Conference (ISMIR), 2012, pp. 379 – 384.
    [4] Li Su and Yi-Hsuan Yang, “Escaping from the Abyss of Manual Annotation: New Methodology of Building Polyphonic Datasets for Automatic Music Transcription,” in Proceedings of the International Symposium on Computer Music Multidisciplinary Research (CMMR). Springer, 2015, pp. 309–321. Yu Zhang and Qiang Yang, “A survey on multi-task learning,” arXiv preprint arXiv:1707.08114, 2017.
    [5] Yu Zhang and Qiang Yang, “A Survey on Multi-Task Learning,” arXiv preprint arXiv:1707.08114, 2017.
    [6] Emmanouil Benetos and Simon Dixon, “Polyphonic music transcription using note onset and offset detection,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2011, pp. 37–40.
    [7] Sungkyun Chang and Kyogu Lee, “A pairwise approach to simultaneous onset/offset detection for singing voice using correntropy,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 629–633.
    [8] Lufei Gao, Li Su, Yi-Hsuan Yang, and Tan Lee, “Polyphonic piano note transcription with non-negative matrix factorization of differential spectrogram,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 291–295.
    [9] Graham Grindlay and Daniel PW Ellis, “Transcribing Multi-Instrument Polyphonic Music With Hierarchical Eigeninstruments,” IEEE Journal of Selected Topics in Signal Processing, pp. 1159–1169, 2011.
    [10] Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, and Douglas Eck, “Onsets and Frames: Dual-Objective Piano Transcription,” in Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), 2018, pp. 50–57.
    [11] Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse H. Engel, and Douglas Eck, “Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset,” in 7th International Conference on Learning Representations (ICLR), 2019.
    [12] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M.Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck, “Music transformer: Generating music with long-term structure,” in 7th International Conference on Learning Representations (ICLR), 2019.
    [13] Yun-Ning Hung and Yi-Hsuan Yang, “Frame-level Instrument Recognition by Timbre and Pitch,” in Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), 2018, pp. 135–142.
    [14] Yun-Ning Hung, Yi-An Chen, and Yi-Hsuan Yang, “Multitask Learning for Frame-level Instrument Recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 381–385.
    [15] Yu-Te Wu, Berlin Chen, and Li Su, “Polyphonic Music Transcription with Semantic Segmentation,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 166–170.
    [16] Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation,” in Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer, 2015, pp. 234–241.
    [17] Andreas Jansson, Eric J. Humphrey, Nicola Montecchio, Rachel M. Bittner, Aparna Kumar, and Tillman Weyde, “Singing voice separation with deep u-net convolutional networks,” in Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), 2017, pp. 745–751.
    [18] Wei-Tsung Lu and Li Su, “Vocal Melody Extraction with Semantic Segmentation and Audio-symbolic Domain Transfer Learning,” in Proceedings of the 19th International Society for Music Information Retrieval Conference (ISMIR), 2018, pp. 521–528.
    [19] Rachel M Bittner, Brian McFee, Justin Salamon, Peter Li, and Juan Pablo Bello, “Deep Salience Representations for F0 Estimation in Polyphonic Music,” in Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), 2017, pp. 63–70.
    [20] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick, “Microsoft COCO: Common Objects in Context,” in Computer Vision - ECCV - 13th European Conference, Proceedings, Part V. Springer, 2014, pp. 740–755.
    [21] Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, Alexander Ku, and Dustin Tran, “Image Transformer,” in Proceedings of the 35th International Conference on Machine Learning (ICML), 2018, pp. 4052–4061.
    [22] Zhiyao Duan, Bryan Pardo, and Changshui Zhang, “Multiple Fundamental Frequency Estimation by Modeling Spectral Peaks and Non-Peak Regions,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 8, pp. 2121–2133, 2010.
    [23] Masataka Goto, Hiroki Hashiguchi, Takuichi Nishimura, and Ryuichi Oka, “RWC Music Database: Music genre database and musical instrument sound database,” in ISMIR, 4th International Conference on Music Information Retrieval, Proceedings, 2003, pp. 229–230.
    [24] John Thickstun, Za¨ıd Harchaoui, Dean P. Foster, and Sham M. Kakade, “Invariances and Data Augmentation for Supervised Music Transcription,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2241–2245.
    [25] Zhiyao Duan, Jinyu Han, and Bryan Pardo, “Multi-pitch Streaming of Harmonic Sound Mixtures,” IEEE/ACM transactions on audio, speech, and language processing, vol. 22, no. 1, pp. 138–150, 2014.
    [26] Vipul Arora and Laxmidhar Behera, “Musical Source Clustering and Identification in Polyphonic Audio,” IEEE/ACM transactions on audio, speech, and language processing, vol. 22, no. 6, pp. 1003–1012, 2014.
    [27] Vipul Arora and Laxmidhar Behera, “Multiple F0 Estimation and Source Clustering of Polyphonic Music Audio Using PLCA and HMRFs,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 2, pp. 278–287, 2015.
    [28] Emmanuel Vincent, Nancy Bertin, and Roland Badeau, “Adaptive Harmonic Spectral Decomposition for Multiple Pitch Estimation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 3, pp. 528–537, 2010.
    [29] Emmanouil Benetos and Simon Dixon, “A Shift-Invariant Latent Variable Model for Automatic Music Transcription,” Computer Music Journal, vol. 36, pp. 81–94, 2012.
    [30] Benoit Fuentes, Roland Badeau, and Ga¨el Richard, “Harmonic Adaptive Latent Component Analysis of Audio and Application to Music Transcription,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 9, pp. 1854–1866, 2013.
    [31] Emmanuel Vincent and Xavier Rodet, “Music Transcription with ISA and HMM,” in Independent Component Analysis and Blind Signal Separation, Fifth International Conference (ICA), Proceedings. Springer, 2004, pp. 1197–1204.
    [32] Hirokazu Kameoka, Takuya Nishimoto, and Shigeki Sagayama, “A Multipitch Analyzer Based on Harmonic Temporal Structured Clustering,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 3, pp. 982–994, 2007.
    [33] Wei-Chen Chang, Alvin WY Su, Chunghsin Yeh, Axel Roebel, and Xavier Rodet, “Multiple-F0 tracking based on a high-order HMM model,” in Proc. of the 11th Int. Conference on Digital Audio Effects, DAFx, 2008.
    [34] Emmanouil Benetos and Simon Dixon, “Multiple-instrument polyphonic music transcription using a temporally constrained shift-invariant model,” The Journal of the Acoustical Society of America, vol. 133, pp. 1727–1741, 2013.
    [35] Emmanouil Benetos, Sebastian Ewert, and Tillman Weyde, “Automatic transcription of pitched and unpitched sounds from polyphonic music,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, pp. 3107–3111.
    [36] Alec Radford, Luke Metz, and Soumith Chintala, “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks,” in 4th International Conference on Learning Representations (ICLR), 2016.
    [37] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, " Deep Residual Learning for Image Recognition," in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
    [38] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck, “Music Transformer: Generating Music with Long-Term Structure,” in 7th International Conference on Learning Representations (ICLR), 2019.
    [39] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Vedavyas Panneershelvam, Marc Lanctot Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy P. Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and DemisHassabis, “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016.
    [40] Sebastian B ̈ock and Markus Schedl, “Polyphonic piano note transcription with recurrent neural networks,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012. pp. 121–124.
    [41] Siddharth Sigtia, Emmanouil Benetos, and Simon Dixon, “An end-to-end neural network for polyphonic piano music transcription,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 5, pp. 927–939, 2016.
    [42] Rainer Kelz, Matthias Dorfer, Filip Korzeniowski, Sebastian B¨ock, Andreas Arzt, and Gerhard Widmer, “On the Potential of Simple Framewise Approaches to Piano Transcription,” in Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), 2016, pp. 475–481.
    [43] Alex Graves and J ̈urgen Schmidhuber, “Framewise phoneme classification with bidirectional LSTM and other neural network architectures,” Neural Networks, vol. 18, no. 5-6, pp. 602–610, 2005.
    [44] Jong Wook Kim and Juan Pablo Bello, “Adversarial learning for improved onsets and frames music transcription,” in Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), 2019, pp. 670–677
    [45] Rainer Kelz, Sebastian B ̈ock, and Gerhard Widmer, “Deep polyphonic ADSR piano note transcription,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 246–250.
    [46] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, J ̈urgen Schmidhuber, et al., “Gradient flow in recurrent nets: the difficulty of learning long-term dependencies,” in A field guide to dynamical recurrent neural networks. IEEE Press, 2001.
    [47] Sepp Hochreiter and J ̈urgen Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
    [48] Junyoung Chung, C ̧ aglar G ̈ulc ̧ehre, KyungHyun Cho, and Yoshua Bengio, “Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling,” CoRR, vol. abs/1412.3555, 2014.
    [49] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le, “Sequence to Sequence Learning with Neural Networks,” in Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems (NeurIPS), 2014, pp. 3104–31
    [50] Kyunghyun Cho, Bart van Merrienboer, C ̧ aglar G ̈ulc ̧ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1724–1734
    [51] Rafal J ́ozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu, “Exploring the Limits of Language Modeling,” CoRR, vol. abs/1602.02410, 2016
    [52] Thang Luong, Hieu Pham, and Christopher D. Manning, “Effective approaches to attention-based neural machine translation,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015, pp. 1412–1421.
    [53] Zhouhan Lin, Minwei Feng, C ́ıcero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio, “A Structured Self-Attentive Sentence Embedding,” in 5th International Conference on Learning Representations (ICLR), 2017.
    [54] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean, “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer,” in 5th International Conference on Learning Representations, (ICLR), 2017.
    [55] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in 3rd International Conference on Learning Representations, (ICLR), 2015.
    [56] Yoon Kim, Carl Denton, Luong Hoang, and Alexander M. Rush, “Structured Attention Networks,” in 5th International Conference on Learning Representations (ICLR), 2017.
    [57] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is All you Need,” in Advances in Neural Information Processing Systems 30: Annual Conferenceon Neural Information Processing Systems (NeurIPS), 2017, pp.5998–6008.
    [58] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019, pp. 4171–4186.
    [59] Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan, Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan horowski, and Michiel Bacchiani, “State-of-the-Art Speech Recognition with Sequence-to-Sequence Models,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, pp. 4774–4778.
    [60] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam, “Rethinking Atrous Convolution for Semantic Image Segmentation,” arXiv preprint:1706.05587, 2017.
    [61] Sangeun Kum, Changheun Oh, and Juhan Nam, “Melody Extraction on Vocal Segments Using Multi-Column Deep Neural Networks,” in Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), 2016, pp. 819–825.
    [62] Klapuri, Anssi. "Multipitch Analysis of Polyphonic Music and Speech Signals Using an Auditory Model." In IEEE Transactions on Audio, Speech, and Language Processing 16.2 (2008): 255-266.
    [63] Indefrey, Helge, Wolfgang Hess, and Günter Seeser. "Design and evaluation of double-transform pitch determination algorithms with nonlinear distortion in the frequency domain-preliminary results." In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1985, pp. 415–418.
    [64] Tolonen, Tero, and Matti Karjalainen. "A computationally efficient multipitch analysis model," IEEE transactions on speech and audio processing 8.6 (2000): 708–716.
    [65] Li Su, "Between homomorphic signal processing and deep neural networks: Constructing deep algorithms for polyphonic music transcription." in IEEE Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017.
    [66] Li Su, "Vocal Melody Extraction Using Patch-Based CNN”, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 371–375.
    [67] Peeters, Geoffroy. "Music Pitch Representation by Periodicity Measures Based on Combined Temporal and Spectral Representations," IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2006, pp. 53–56.
    [68] Li Su and Yi-Hsuan Yang, “Combining Spectral and Temporal Representations for Multipitch Estimation of Polyphonic Music,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 10, pp. 1600–1612, 2015.
    [69] Yu-Te Wu, Berlin Chen, and Li Su, “Automatic Music Transcription Leveraging Generalized Cepstral Features and Deep Learning,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2018, pp. 401–405.
    [70] Li Su, Tsung-Ying Chuang, and Yi-Hsuan Yang, “Exploiting Frequency, Periodicity and Harmonicity Using Advanced Time-Frequency Concentration Techniques for Multipitch Estimation of Choir and Symphony,” in Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), 2016, pp. 393–399.
    [71] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Doll´ar, “Focal loss for dense object detection,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2999–3007.
    [72] Morgane Goibert and Elvis Dohmatob, “Adversarial Robustness via Adversarial Label-Smoothing,” arXiv preprint arXiv:1906.11567, 2019.
    [73] Rafael M¨uller, Simon Kornblith, and Geoffrey E Hinton, “When does label smoothing help?,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS), 2019, pp. 4696–4705.
    [74] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna, “Rethinking the Inception Architecture for Computer Vision,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818–2826.
    [75] Valentin Emiya, Roland Badeau, and Bertrand David, “Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectralness Principle,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 6, pp. 1643–1654, 2010.
    [76] John Thickstun, Za¨ıd Harchaoui, and Sham M. Kakade, “Learning Features of Music From Scratch,” in 5th International Conference on Learning Representations (ICLR), 2017.
    [77] Bochen Li, Xinzhao Liu, Karthik Dinesh, Zhiyao Duan, and Gaurav Sharma, “Creating a Multitrack Classical Music Performance Dataset for Multimodal Music Analysis: Challenges, Insights, and Applications,” IEEE Transactions on Multimedia, vol. 21, no. 2, pp. 522–535, 2019.
    [78] Andrea Cogliati, David Temperley, and Zhiyao Duan, “Transcribing Human Piano Performances into Music Notation,” in Proceedings of the 17th International Society for Music Information Retrieval Conference (ISMIR), 2016, pp. 758–764.
    [79] Andre Holzapfel, Emmanouil Benetos, et al., “Automatic Music Transcription and Ethnomusicology: a User Study,” in Proceedings of the 20th International Society for Music Information Retrieval Conference (ISMIR), 2019, pp. 678–684.
    [80] Justin Salamon, Rachel M Bittner, Jordi Bonada, Juan J Bosch, Emilia G´omez, and Juan Pablo Bello, “An Analysis/Synthesis Framework for Automatic F0 Annotation of Multitrack Datasets,” in Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), 2017, pp. 71–78.
    [81] Beat Gfeller, Christian Frank, Dominik Roblek, Matt Sharifi, Marco Tagliasacchi, and Mihajlo Velimirovi´c, “SPICE: Self-supervised Pitch Estimation,” arXiv preprint arXiv:1910.11664, 2019.
    [82] Ryo Nishikimi, Eita Nakamura, Satoru Fukayama, Masataka Goto, and Kazuyoshi Yoshii, “Automatic Singing Transcription Based on Encoderdecoder Recurrent Neural Networks with a Weakly-supervised Attention Mechanism,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 161–165.

    下載圖示
    QR CODE