Sorry, you need to enable JavaScript to visit this website.

Analog audio effects and synthesizers often owe their distinct sound to circuit nonlinearities. Faithfully modeling such significant aspect of the original sound in virtual analog software can prove challenging. The current work proposes a generic data-driven approach to virtual analog modeling and applies it to the Fender Bassman 56F-A vacuum-tube amplifier. Specifically, a feedforward variant of the WaveNet deep neural network is trained to carry out a regression on audio waveform samples from input to output of a SPICE model of the tube amplifier.


In this contribution, we consider a cross-modal retrieval scenario of Western classical music. Given a short monophonic musical theme in symbolic notation as query, the objective is to find relevant audio recordings in a database. A major challenge of this retrieval task is the possible difference in the degree of polyphony between the monophonic query and the music recordings. Previous studies for popular music addressed this issue by performing the cross-modal comparison based on predominant melodies extracted from the recordings.


The recent deep learning methods can offer state-of-the-art performance for Monaural Singing Voice Separation (MSVS). In these deep methods, the recurrent neural network (RNN) is widely employed. This work proposes a novel type of Deep RNN (DRNN), namely Proximal DRNN (P-DRNN) for MSVS, which improves the conventional Stacked RNN (S-RNN) by introducing a novel interlayer structure. The interlayer structure is derived from an optimization problem for Monaural Source Separation (MSS).


For many music analysis problems, we need to know the presence
of instruments for each time frame in a multi-instrument
musical piece. However, such a frame-level instrument recognition
task remains difficult, mainly due to the lack of labeled
datasets. To address this issue, we present in this paper a
large-scale dataset that contains synthetic polyphonic music
with frame-level pitch and instrument labels. Moreover, we
propose a simple yet novel network architecture to jointly predict


In this paper, a phase-aware HPSS method through convex optimization was proposed. Based on two HPSS approaches (anisotropic smoothness and sinusoidal model), the proposed method assumes the smoothness of the complex-valued spectrogram of harmonic components calculated by iPC-STFT in the time direction. On the other hand, the time-frame-wise sparsity of percussive spectrograms was considered as a phase insensitive prior. Furthermore, the proposed method considers the perfect reconstruction constraint in the time domain instead of power spectrograms.


Low-rankness of amplitude spectrograms has been effectively utilized in audio signal processing methods including non-negative matrix factorization. However, such methods have a fundamental limitation owing to their amplitude-only treatment where the phase of the observed signal is utilized for resynthesizing the estimated signal. In order to address this limitation, we directly treat a complex-valued spectrogram and show a complex-valued spectrogram of a sum of sinusoids can be approximately low-rank by modifying its phase.


In this paper, we build up a hybrid neural network (NN) for singing melody extraction from polyphonic music by imitating human pitch perception. For human hearing, there are two pitch perception models, the spectral model and the temporal model, in accordance with whether harmonics are resolved or not. Here, we first use NNs to implement individual models and evaluate their performance in the task of singing melody extraction.


In this paper, we design a novel deep learning based hybrid system for automatic chord recognition. Currently, there is a bottleneck in the amount of enough annotated data for training robust acoustic models, as hand annotating time-synchronized chord labels requires professional musical skills and considerable labor. As a solution to this problem, we construct a large set of time synchronized MIDI-audio pairs, and use these data to train a Deep Residual Network (DRN) feature extractor, which can then estimate pitch class activations of real-world music audio recordings.


The task of estimating the fundamental frequency of a monophonic sound recording, also known as pitch tracking, is fundamental to audio processing with multiple applications in speech processing and music information retrieval. To date, the best performing techniques, such as the pYIN algorithm, are based on a combination of DSP pipelines and heuristics. While such techniques perform very well on average, there remain many cases in which they fail to correctly estimate the pitch.