MRNet: A multi-route convolutional neural network for robust music representation learning

  • Jungwoo Heo
  • , Hyun Seo Shin
  • , Chan Yeong Lim
  • , Kyo Won Koo
  • , Seung Bin Kim
  • , Jisoo Son
  • , Ha Jin Yu

Research output: Contribution to journalArticlepeer-review

Abstract

Music Information Retrieval (MIR) focuses on extracting semantic information embedded in audio signals, such as genre, artist identity, and tempo. These musical cues cover a wide range of temporal characteristics, from short-term features like pitch and timbre to long-term patterns such as melody and mood, and they require processing at multiple levels of abstraction. In this paper, we propose a Multi-Route Neural Network (MRNet) designed to capture musical representations that reflect both short-term and long-term characteristics, as well as different levels of abstraction. To achieve this, MRNet stacks several convolutional layers with different dilation rates, allowing the model to analyze audio patterns over multiple time scales. Additionally, it introduces a specialized module called the multi-route Res2Block, which separates the processing path into multiple branches. Each branch processes the input to a different depth, enabling the network to extract low-level, mid-level, and high-level features simultaneously. MRNet achieves classification accuracies of 94.5 %, 56.6 %, 63.2 %, and 71.3 % on the GTZAN, FMA Small, FMA Large, and Melon datasets, respectively, outperforming previous Convolution Neural Network(CNN)-based approaches. These results demonstrate the effectiveness of MRNet in learning robust and hierarchical music representations for MIR tasks.

Original languageEnglish
Pages (from-to)516-523
Number of pages8
JournalJournal of the Acoustical Society of Korea
Volume44
Issue number5
DOIs
StatePublished - 2025

Keywords

  • Convolution Neural Network (CNN)
  • Deep learning
  • Music information retrieval
  • Music representation

Fingerprint

Dive into the research topics of 'MRNet: A multi-route convolutional neural network for robust music representation learning'. Together they form a unique fingerprint.

Cite this