Machine Learning Applied to Music/Audio Signal Processing

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Circuit and Signal Processing".

Deadline for manuscript submissions: closed (31 August 2021) | Viewed by 62756

Special Issue Editors


E-Mail Website1 Website2
Guest Editor
Center for Music Technology, Georgia Institute of Technology, Atlanta, GA 30332, USA
Interests: audio content analysis; music information retrieval; semantic audio; audio signal processing; music generation

E-Mail Website
Guest Editor
Faculty of Informatics, Institute of Information Systems Engineering, TU Wien Informatics, Favoritenstraße 9-11, 1040 Vienna, Austria
Interests: information retrieval; user interfaces; music information retrieval; recommender systems; artificial intelligence; machine learning

Special Issue Information

Dear Colleagues,

The applications of audio and music processing range from music discovery and recommendation systems over speech enhancement, audio event detection, and music transcription, to creative applications such as sound synthesis and morphing.

The last decade has seen a paradigm shift from expert-designed algorithms to data-driven approaches. Machine learning approaches, and Deep Neural Networks specifically, have been shown to outperform traditional approaches on a large variety of tasks including audio classification, source separation, enhancement, and content analysis. With data-driven approaches, however, came a set of new challenges. Two of these challenges are training data and interpretability. As supervised machine learning approaches increase in complexity, the increasing need for more annotated training data can often not be matched with available data. The lack of understanding of how data are modeled by neural networks can lead to unexpected results and open vulnerabilities for adversarial attacks.

The main aim of this Special Issue is to seek high-quality submissions that present novel data-driven methods for audio/music signal processing and analysis and address main challenges of applying machine learning to audio signals. Within the general area of audio and music information retrieval as well as audio and music processing, the topics of interest include, but are not limited to, the following:

  • unsupervised and semi-supervised systems for audio/music processing and analysis
  • machine learning methods for raw audio signal analysis and transformation
  • approaches to understanding and controlling the behavior of audio processing systems such as visualization, auralization, or regularization methods
  • generative systems for sound synthesis and transformation
  • adversarial attacks and the identification of 'deepfakes' in audio and music
  • audio and music style transfer methods
  • audio recording and music production parameter estimation
  • data collection methods, active learning, and interactive machine learning for data-driven approaches

Dr. Peter Knees
Dr. Alexander Lerch
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • music information retrieval
  • machine learning for audio
  • intelligent audio signal processing
  • audio analysis and transformation

Published Papers (16 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research

3 pages, 157 KiB  
Editorial
Machine Learning Applied to Music/Audio Signal Processing
by Alexander Lerch and Peter Knees
Electronics 2021, 10(24), 3077; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10243077 - 10 Dec 2021
Cited by 4 | Viewed by 3946
Abstract
Over the past two decades, the utilization of machine learning in audio and music signal processing has dramatically increased [...] Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)

Research

Jump to: Editorial

33 pages, 1611 KiB  
Article
Combining Real-Time Extraction and Prediction of Musical Chord Progressions for Creative Applications
by Tristan Carsault, Jérôme Nika, Philippe Esling and Gérard Assayag
Electronics 2021, 10(21), 2634; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10212634 - 28 Oct 2021
Cited by 5 | Viewed by 3724
Abstract
Recently, the field of musical co-creativity has gained some momentum. In this context, our goal is twofold: to develop an intelligent listening and predictive module of chord sequences, and to propose an adapted evaluation of the associated Music Information Retrieval (MIR) tasks that [...] Read more.
Recently, the field of musical co-creativity has gained some momentum. In this context, our goal is twofold: to develop an intelligent listening and predictive module of chord sequences, and to propose an adapted evaluation of the associated Music Information Retrieval (MIR) tasks that are the real-time extraction of musical chord labels from a live audio stream and the prediction of a possible continuation of the extracted symbolic sequence. Indeed, this application case invites us to raise questions about the evaluation processes and methodology that are currently applied to chord-based MIR models. In this paper, we focus on musical chords since these mid-level features are frequently used to describe harmonic progressions in Western music. In the case of chords, there exists some strong inherent hierarchical and functional relationships. However, most of the research in the field of MIR focuses mainly on the performance of chord-based statistical models, without considering music-based evaluation or learning. Indeed, usual evaluations are based on a binary qualification of the classification outputs (right chord predicted versus wrong chord predicted). Therefore, we present a specifically-tailored chord analyser to measure the performances of chord-based models in terms of functional qualification of the classification outputs (by taking into account the harmonic function of the chords). Then, in order to introduce musical knowledge into the learning process for the automatic chord extraction task, we propose a specific musical distance for comparing predicted and labeled chords. Finally, we conduct investigations into the impact of including high-level metadata in chord sequence prediction learning (such as information on key or downbeat position). We show that a model can obtain better performances in terms of accuracy or perplexity, but output biased results. At the same time, a model with a lower accuracy score can output errors with more musical meaning. Therefore, performing a goal-oriented evaluation allows a better understanding of the results and a more adapted design of MIR models. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

19 pages, 5484 KiB  
Article
Automatic Melody Harmonization via Reinforcement Learning by Exploring Structured Representations for Melody Sequences
by Te Zeng and Francis C. M. Lau
Electronics 2021, 10(20), 2469; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10202469 - 11 Oct 2021
Cited by 8 | Viewed by 2301
Abstract
We present a novel reinforcement learning architecture that learns a structured representation for use in symbolic melody harmonization. Probabilistic models are predominant in melody harmonization tasks, most of which only treat melody notes as independent observations and do not take note of substructures [...] Read more.
We present a novel reinforcement learning architecture that learns a structured representation for use in symbolic melody harmonization. Probabilistic models are predominant in melody harmonization tasks, most of which only treat melody notes as independent observations and do not take note of substructures in the melodic sequence. To fill this gap, we add substructure discovery as a crucial step in automatic chord generation. The proposed method consists of a structured representation module that generates hierarchical structures for the symbolic melodies, a policy module that learns to break a melody into segments (whose boundaries concur with chord changes) and phrases (the subunits in segments) and a harmonization module that generates chord sequences for each segment. We formulate the structure discovery process as a sequential decision problem with a policy gradient RL method selecting the boundary of each segment or phrase to obtain an optimized structure. We conduct experiments on our preprocessed HookTheory Lead Sheet Dataset, which has 17,979 melody/chord pairs. The results demonstrate that our proposed method can learn task-specific representations and, thus, yield competitive results compared with state-of-the-art baselines. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

20 pages, 813 KiB  
Article
Improving Semi-Supervised Learning for Audio Classification with FixMatch
by Sascha Grollmisch and Estefanía Cano
Electronics 2021, 10(15), 1807; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10151807 - 28 Jul 2021
Cited by 14 | Viewed by 4532
Abstract
Including unlabeled data in the training process of neural networks using Semi-Supervised Learning (SSL) has shown impressive results in the image domain, where state-of-the-art results were obtained with only a fraction of the labeled data. The commonality between recent SSL methods is that [...] Read more.
Including unlabeled data in the training process of neural networks using Semi-Supervised Learning (SSL) has shown impressive results in the image domain, where state-of-the-art results were obtained with only a fraction of the labeled data. The commonality between recent SSL methods is that they strongly rely on the augmentation of unannotated data. This is vastly unexplored for audio data. In this work, SSL using the state-of-the-art FixMatch approach is evaluated on three audio classification tasks, including music, industrial sounds, and acoustic scenes. The performance of FixMatch is compared to Convolutional Neural Networks (CNN) trained from scratch, Transfer Learning, and SSL using the Mean Teacher approach. Additionally, a simple yet effective approach for selecting suitable augmentation methods for FixMatch is introduced. FixMatch with the proposed modifications always outperformed Mean Teacher and the CNNs trained from scratch. For the industrial sounds and music datasets, the CNN baseline performance using the full dataset was reached with less than 5% of the initial training data, demonstrating the potential of recent SSL methods for audio data. Transfer Learning outperformed FixMatch only for the most challenging dataset from acoustic scene classification, showing that there is still room for improvement. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

23 pages, 1503 KiB  
Article
User-Driven Fine-Tuning for Beat Tracking
by António S. Pinto, Sebastian Böck, Jaime S. Cardoso and Matthew E. P. Davies
Electronics 2021, 10(13), 1518; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10131518 - 23 Jun 2021
Cited by 6 | Viewed by 2394
Abstract
The extraction of the beat from musical audio signals represents a foundational task in the field of music information retrieval. While great advances in performance have been achieved due the use of deep neural networks, significant shortcomings still remain. In particular, performance is [...] Read more.
The extraction of the beat from musical audio signals represents a foundational task in the field of music information retrieval. While great advances in performance have been achieved due the use of deep neural networks, significant shortcomings still remain. In particular, performance is generally much lower on musical content that differs from that which is contained in existing annotated datasets used for neural network training, as well as in the presence of challenging musical conditions such as rubato. In this paper, we positioned our approach to beat tracking from a real-world perspective where an end-user targets very high accuracy on specific music pieces and for which the current state of the art is not effective. To this end, we explored the use of targeted fine-tuning of a state-of-the-art deep neural network based on a very limited temporal region of annotated beat locations. We demonstrated the success of our approach via improved performance across existing annotated datasets and a new annotation-correction approach for evaluation. Furthermore, we highlighted the ability of content-specific fine-tuning to learn both what is and what is not the beat in challenging musical conditions. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

30 pages, 3863 KiB  
Article
Analyzing and Visualizing Deep Neural Networks for Speech Recognition with Saliency-Adjusted Neuron Activation Profiles
by Andreas Krug, Maral Ebrahimzadeh, Jost Alemann, Jens Johannsmeier and Sebastian Stober
Electronics 2021, 10(11), 1350; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10111350 - 05 Jun 2021
Cited by 13 | Viewed by 3497
Abstract
Deep Learning-based Automatic Speech Recognition (ASR) models are very successful, but hard to interpret. To gain a better understanding of how Artificial Neural Networks (ANNs) accomplish their tasks, several introspection methods have been proposed. However, established introspection techniques are mostly designed for computer [...] Read more.
Deep Learning-based Automatic Speech Recognition (ASR) models are very successful, but hard to interpret. To gain a better understanding of how Artificial Neural Networks (ANNs) accomplish their tasks, several introspection methods have been proposed. However, established introspection techniques are mostly designed for computer vision tasks and rely on the data being visually interpretable, which limits their usefulness for understanding speech recognition models. To overcome this limitation, we developed a novel neuroscience-inspired technique for visualizing and understanding ANNs, called Saliency-Adjusted Neuron Activation Profiles (SNAPs). SNAPs are a flexible framework to analyze and visualize Deep Neural Networks that does not depend on visually interpretable data. In this work, we demonstrate how to utilize SNAPs for understanding fully-convolutional ASR models. This includes visualizing acoustic concepts learned by the model and the comparative analysis of their representations in the model layers. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

21 pages, 5686 KiB  
Article
Stochastic Restoration of Heavily Compressed Musical Audio Using Generative Adversarial Networks
by Stefan Lattner and Javier Nistal
Electronics 2021, 10(11), 1349; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10111349 - 05 Jun 2021
Cited by 7 | Viewed by 3830
Abstract
Lossy audio codecs compress (and decompress) digital audio streams by removing information that tends to be inaudible in human perception. Under high compression rates, such codecs may introduce a variety of impairments in the audio signal. Many works have tackled the problem of [...] Read more.
Lossy audio codecs compress (and decompress) digital audio streams by removing information that tends to be inaudible in human perception. Under high compression rates, such codecs may introduce a variety of impairments in the audio signal. Many works have tackled the problem of audio enhancement and compression artifact removal using deep-learning techniques. However, only a few works tackle the restoration of heavily compressed audio signals in the musical domain. In such a scenario, there is no unique solution for the restoration of the original signal. Therefore, in this study, we test a stochastic generator of a Generative Adversarial Network (GAN) architecture for this task. Such a stochastic generator, conditioned on highly compressed musical audio signals, could one day generate outputs indistinguishable from high-quality releases. Therefore, the present study may yield insights into more efficient musical data storage and transmission. We train stochastic and deterministic generators on MP3-compressed audio signals with 16, 32, and 64 kbit/s. We perform an extensive evaluation of the different experiments utilizing objective metrics and listening tests. We find that the models can improve the quality of the audio signals over the MP3 versions for 16 and 32 kbit/s and that the stochastic generators are capable of generating outputs that are closer to the original signals than those of the deterministic generators. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

14 pages, 404 KiB  
Article
Singing Voice Detection in Opera Recordings: A Case Study on Robustness and Generalization
by Michael Krause, Meinard Müller and Christof Weiß
Electronics 2021, 10(10), 1214; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10101214 - 20 May 2021
Cited by 9 | Viewed by 2545
Abstract
Automatically detecting the presence of singing in music audio recordings is a central task within music information retrieval. While modern machine-learning systems produce high-quality results on this task, the reported experiments are usually limited to popular music and the trained systems often overfit [...] Read more.
Automatically detecting the presence of singing in music audio recordings is a central task within music information retrieval. While modern machine-learning systems produce high-quality results on this task, the reported experiments are usually limited to popular music and the trained systems often overfit to confounding factors. In this paper, we aim to gain a deeper understanding of such machine-learning methods and investigate their robustness in a challenging opera scenario. To this end, we compare two state-of-the-art methods for singing voice detection based on supervised learning: A traditional approach relying on hand-crafted features with a random forest classifier, as well as a deep-learning approach relying on convolutional neural networks. To evaluate these algorithms, we make use of a cross-version dataset comprising 16 recorded performances (versions) of Richard Wagner’s four-opera cycle Der Ring des Nibelungen. This scenario allows us to systematically investigate generalization to unseen versions, musical works, or both. In particular, we study the trained systems’ robustness depending on the acoustic and musical variety, as well as the overall size of the training dataset. Our experiments show that both systems can robustly detect singing voice in opera recordings even when trained on relatively small datasets with little variety. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

18 pages, 1205 KiB  
Article
Informing Piano Multi-Pitch Estimation with Inferred Local Polyphony Based on Convolutional Neural Networks
by Michael Taenzer, Stylianos I. Mimilakis and Jakob Abeßer
Electronics 2021, 10(7), 851; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10070851 - 02 Apr 2021
Cited by 6 | Viewed by 2930
Abstract
In this work, we propose considering the information from a polyphony for multi-pitch estimation (MPE) in piano music recordings. To that aim, we propose a method for local polyphony estimation (LPE), which is based on convolutional neural networks (CNNs) trained in a supervised [...] Read more.
In this work, we propose considering the information from a polyphony for multi-pitch estimation (MPE) in piano music recordings. To that aim, we propose a method for local polyphony estimation (LPE), which is based on convolutional neural networks (CNNs) trained in a supervised fashion to explicitly predict the degree of polyphony. We investigate two feature representations as inputs to our method, in particular, the Constant-Q Transform (CQT) and its recent extension Folded-CQT (F-CQT). To evaluate the performance of our method, we conduct a series of experiments on real and synthetic piano recordings based on the MIDI Aligned Piano Sounds (MAPS) and the Saarland Music Data (SMD) datasets. We compare our approaches with a state-of-the art piano transcription method by informing said method with the LPE knowledge in a postprocessing stage. The experimental results suggest that using explicit LPE information can refine MPE predictions. Furthermore, it is shown that, on average, the CQT representation is preferred over F-CQT for LPE. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

23 pages, 2721 KiB  
Article
An Interpretable Deep Learning Model for Automatic Sound Classification
by Pablo Zinemanas, Martín Rocamora, Marius Miron, Frederic Font and Xavier Serra
Electronics 2021, 10(7), 850; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10070850 - 02 Apr 2021
Cited by 22 | Viewed by 6606
Abstract
Deep learning models have improved cutting-edge technologies in many research areas, but their black-box structure makes it difficult to understand their inner workings and the rationale behind their predictions. This may lead to unintended effects, such as being susceptible to adversarial attacks or [...] Read more.
Deep learning models have improved cutting-edge technologies in many research areas, but their black-box structure makes it difficult to understand their inner workings and the rationale behind their predictions. This may lead to unintended effects, such as being susceptible to adversarial attacks or the reinforcement of biases. There is still a lack of research in the audio domain, despite the increasing interest in developing deep learning models that provide explanations of their decisions. To reduce this gap, we propose a novel interpretable deep learning model for automatic sound classification, which explains its predictions based on the similarity of the input to a set of learned prototypes in a latent space. We leverage domain knowledge by designing a frequency-dependent similarity measure and by considering different time-frequency resolutions in the feature space. The proposed model achieves results that are comparable to that of the state-of-the-art methods in three different sound classification tasks involving speech, music, and environmental audio. In addition, we present two automatic methods to prune the proposed model that exploit its interpretability. Our system is open source and it is accompanied by a web application for the manual editing of the model, which allows for a human-in-the-loop debugging approach. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

20 pages, 3830 KiB  
Article
Investigating the Effects of Training Set Synthesis for Audio Segmentation of Radio Broadcast
by Satvik Venkatesh, David Moffat and Eduardo Reck Miranda
Electronics 2021, 10(7), 827; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10070827 - 31 Mar 2021
Cited by 7 | Viewed by 4038
Abstract
Music and speech detection provides us valuable information regarding the nature of content in broadcast audio. It helps detect acoustic regions that contain speech, voice over music, only music, or silence. In recent years, there have been developments in machine learning algorithms to [...] Read more.
Music and speech detection provides us valuable information regarding the nature of content in broadcast audio. It helps detect acoustic regions that contain speech, voice over music, only music, or silence. In recent years, there have been developments in machine learning algorithms to accomplish this task. However, broadcast audio is generally well-mixed and copyrighted, which makes it challenging to share across research groups. In this study, we address the challenges encountered in automatically synthesising data that resembles a radio broadcast. Firstly, we compare state-of-the-art neural network architectures such as CNN, GRU, LSTM, TCN, and CRNN. Later, we investigate how audio ducking of background music impacts the precision and recall of the machine learning algorithm. Thirdly, we examine how the quantity of synthetic training data impacts the results. Finally, we evaluate the effectiveness of synthesised, real-world, and combined approaches for training models, to understand if the synthetic data presents any additional value. Amongst the network architectures, CRNN was the best performing network. Results also show that the minimum level of audio ducking preferred by the machine learning algorithm was similar to that of human listeners. After testing our model on in-house and public datasets, we observe that our proposed synthesis technique outperforms real-world data in some cases and serves as a promising alternative. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

16 pages, 914 KiB  
Article
A Comparison of Deep Learning Methods for Timbre Analysis in Polyphonic Automatic Music Transcription
by Carlos Hernandez-Olivan, Ignacio Zay Pinilla, Carlos Hernandez-Lopez and Jose R. Beltran
Electronics 2021, 10(7), 810; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10070810 - 29 Mar 2021
Cited by 16 | Viewed by 3561
Abstract
Automatic music transcription (AMT) is a critical problem in the field of music information retrieval (MIR). When AMT is faced with deep neural networks, the variety of timbres of different instruments can be an issue that has not been studied in depth yet. [...] Read more.
Automatic music transcription (AMT) is a critical problem in the field of music information retrieval (MIR). When AMT is faced with deep neural networks, the variety of timbres of different instruments can be an issue that has not been studied in depth yet. The goal of this work is to address AMT transcription by analyzing how timbre affect monophonic transcription in a first approach based on the CREPE neural network and then to improve the results by performing polyphonic music transcription with different timbres with a second approach based on the Deep Salience model that performs polyphonic transcription based on the Constant-Q Transform. The results of the first method show that the timbre and envelope of the onsets have a high impact on the AMT results and the second method shows that the developed model is less dependent on the strength of the onsets than other state-of-the-art models that deal with AMT on piano sounds such as Google Magenta Onset and Frames (OaF). Our polyphonic transcription model for non-piano instruments outperforms the state-of-the-art model, such as for bass instruments, which has an F-score of 0.9516 versus 0.7102. In our latest experiment we also show how adding an onset detector to our model can outperform the results given in this work. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

11 pages, 297 KiB  
Article
Jazz Bass Transcription Using a U-Net Architecture
by Jakob Abeßer and Meinard Müller
Electronics 2021, 10(6), 670; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10060670 - 12 Mar 2021
Cited by 7 | Viewed by 2383
Abstract
In this paper, we adapt a recently proposed U-net deep neural network architecture from melody to bass transcription. We investigate pitch shifting and random equalization as data augmentation techniques. In a parameter importance study, we study the influence of the skip connection strategy [...] Read more.
In this paper, we adapt a recently proposed U-net deep neural network architecture from melody to bass transcription. We investigate pitch shifting and random equalization as data augmentation techniques. In a parameter importance study, we study the influence of the skip connection strategy between the encoder and decoder layers, the data augmentation strategy, as well as of the overall model capacity on the system’s performance. Using a training set that covers various music genres and a validation set that includes jazz ensemble recordings, we obtain the best transcription performance for a downscaled version of the reference algorithm combined with skip connections that transfer intermediate activations between the encoder and decoder. The U-net based method outperforms previous knowledge-driven and data-driven bass transcription algorithms by around five percentage points in overall accuracy. In addition to a pitch estimation improvement, the voicing estimation performance is clearly enhanced. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

14 pages, 1863 KiB  
Article
Vocal Melody Extraction via HRNet-Based Singing Voice Separation and Encoder-Decoder-Based F0 Estimation
by Yongwei Gao, Xulong Zhang and Wei Li
Electronics 2021, 10(3), 298; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10030298 - 26 Jan 2021
Cited by 18 | Viewed by 3417
Abstract
Vocal melody extraction is an important and challenging task in music information retrieval. One main difficulty is that, most of the time, various instruments and singing voices are mixed according to harmonic structure, making it hard to identify the fundamental frequency (F0) of [...] Read more.
Vocal melody extraction is an important and challenging task in music information retrieval. One main difficulty is that, most of the time, various instruments and singing voices are mixed according to harmonic structure, making it hard to identify the fundamental frequency (F0) of a singing voice. Therefore, reducing the interference of accompaniment is beneficial to pitch estimation of the singing voice. In this paper, we first adopted a high-resolution network (HRNet) to separate vocals from polyphonic music, then designed an encoder-decoder network to estimate the vocal F0 values. Experiment results demonstrate that the effectiveness of the HRNet-based singing voice separation method in reducing the interference of accompaniment on the extraction of vocal melody, and the proposed vocal melody extraction (VME) system outperforms other state-of-the-art algorithms in most cases. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

26 pages, 880 KiB  
Article
Sigmoidal NMFD: Convolutional NMF with Saturating Activations for Drum Mixture Decomposition
by Len Vande Veire, Cedric De Boom and Tijl De Bie
Electronics 2021, 10(3), 284; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics10030284 - 25 Jan 2021
Cited by 5 | Viewed by 2355
Abstract
In many types of music, percussion plays an essential role to establish the rhythm and the groove of the music. Algorithms that can decompose the percussive signal into its constituent components would therefore be very useful, as they would enable many analytical and [...] Read more.
In many types of music, percussion plays an essential role to establish the rhythm and the groove of the music. Algorithms that can decompose the percussive signal into its constituent components would therefore be very useful, as they would enable many analytical and creative applications. This paper describes a method for the unsupervised decomposition of percussive recordings, building on the non-negative matrix factor deconvolution (NMFD) algorithm. Given a percussive music recording, NMFD discovers a dictionary of time-varying spectral templates and corresponding activation functions, representing its constituent sounds and their positions in the mix. We observe, however, that the activation functions discovered using NMFD do not show the expected impulse-like behavior for percussive instruments. We therefore enforce this behavior by specifying that the activations should take on binary values: either an instrument is hit, or it is not. To this end, we rewrite the activations as the output of a sigmoidal function, multiplied with a per-component amplitude factor. We furthermore define a regularization term that biases the decomposition to solutions with saturated activations, leading to the desired binary behavior. We evaluate several optimization strategies and techniques that are designed to avoid poor local minima. We show that incentivizing the activations to be binary indeed leads to the desired impulse-like behavior, and that the resulting components are better separated, leading to more interpretable decompositions. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Graphical abstract

23 pages, 2043 KiB  
Article
Research on Singing Voice Detection Based on a Long-Term Recurrent Convolutional Network with Vocal Separation and Temporal Smoothing
by Xulong Zhang, Yi Yu, Yongwei Gao, Xi Chen and Wei Li
Electronics 2020, 9(9), 1458; https://0-doi-org.brum.beds.ac.uk/10.3390/electronics9091458 - 07 Sep 2020
Cited by 25 | Viewed by 3589
Abstract
Singing voice detection or vocal detection is a classification task that determines whether a given audio segment contains singing voices. This task plays a very important role in vocal-related music information retrieval tasks, such as singer identification. Although humans can easily distinguish between [...] Read more.
Singing voice detection or vocal detection is a classification task that determines whether a given audio segment contains singing voices. This task plays a very important role in vocal-related music information retrieval tasks, such as singer identification. Although humans can easily distinguish between singing and nonsinging parts, it is still very difficult for machines to do so. Most existing methods focus on audio feature engineering with classifiers, which rely on the experience of the algorithm designer. In recent years, deep learning has been widely used in computer hearing. To extract essential features that reflect the audio content and characterize the vocal context in the time domain, this study adopted a long-term recurrent convolutional network (LRCN) to realize vocal detection. The convolutional layer in LRCN functions in feature extraction, and the long short-term memory (LSTM) layer can learn the time sequence relationship. The preprocessing of singing voices and accompaniment separation and the postprocessing of time-domain smoothing were combined to form a complete system. Experiments on five public datasets investigated the impacts of the different features for the fusion, frame size, and block size on LRCN temporal relationship learning, and the effects of preprocessing and postprocessing on performance, and the results confirm that the proposed singing voice detection algorithm reached the state-of-the-art level on public datasets. Full article
(This article belongs to the Special Issue Machine Learning Applied to Music/Audio Signal Processing)
Show Figures

Figure 1

Back to TopTop