Sorry, you need to enable JavaScript to visit this website.

Recent advancements in deep learning have led to drastic improvements in speech segregation models. Despite their success and growing applicability, few efforts have been made to analyze the underlying principles that these networks learn to perform segregation. Here we analyze the role of harmonicity on two state-of-the-art Deep Neural Networks (DNN)-based models- Conv-TasNet and DPT-Net. We evaluate their performance with mixtures of natural speech versus slightly manipulated inharmonic speech, where harmonics are slightly frequency jittered.


We introduce a new Nonnegative Matrix Factorization (NMF) model called Nonnegative Unimodal Matrix Factorization (NuMF), which adds on top of NMF the unimodal condition on the columns of the basis matrix. NuMF finds applications for example in analytical chemistry. We propose a simple but naive brute-force heuristics strategy based on accelerated projected gradient. It is then improved by using multi-grid for which we prove that the restriction operator preserves the unimodality.


One of the leading single-channel speech separation (SS) models is based on a TasNet with a dual-path segmentation technique, where the size of each segment remains unchanged throughout all layers. In contrast, our key finding is that multi-granularity features are essential for enhancing contextual modeling and computational efficiency. We introduce a self-attentive network with a novel sandglass-shape, namely Sandglasset, which advances the state-of-the-art (SOTA) SS performance at significantly smaller model size and computational cost.


Hand-crafted spatial features (e.g., inter-channel phase difference, IPD) play a fundamental role in recent deep learning based multi-channel speech separation (MCSS) methods. However, these manually designed spatial features are hard to incorporate into the end-to-end optimized MCSS framework. In this work, we propose an integrated architecture for learning spatial features directly from the multi-channel speech waveforms within an end-to-end speech separation framework. In this architecture, time-domain filters spanning signal channels are trained to perform adaptive spatial filtering.


Functional connectivity analysis by detecting neuronal coactivation in the brain can be efficiently done using Resting State Functional Magnetic Resonance Imaging (rs-fMRI) analysis. Most of the existing research in this area employ correlation-based group averaging strategies of spatial smoothing and temporal normalization of fMRI scans, whose reliability of results heavily depends on the voxel resolution of fMRI scan as well as scanning duration. Scanning period from 5 to 11 minutes has been chosen by most of the studies while estimating the connectivity of brain networks.


Deep Clustering (DC) and Deep Attractor Networks (DANs) are a data-driven way to monaural blind source separation.
Both approaches provide astonishing single channel performance but have not yet been generalized to block-online processing.
When separating speech in a continuous stream with a block-online algorithm, it needs to be determined in each block which of the output streams belongs to whom.
In this contribution we solve this block permutation problem by introducing an additional speaker identification embedding to the DAN model structure.


Robust speech processing in multi-talker environments requires effective speech separation. Recent deep learning systems have made significant progress toward solving this problem, yet it remains challenging particularly in real-time, short latency applications. Most methods attempt to construct a mask for each source in time-frequency representation of the mixture signal which is not necessarily an optimal representation for speech separation.