Sorry, you need to enable JavaScript to visit this website.

ICASSP is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. The 2019 conference will feature world-class presentations by internationally renowned speakers, cutting-edge session topics and provide a fantastic opportunity to network with like-minded professionals from around the world. Visit website

Predicting the gaze of a user can have important applications in hu- man computer interactions (HCI). They find applications in areas such as social interaction, driver distraction, human robot interaction and education. Appearance based models for gaze estimation have significantly improved due to recent advances in convolutional neural network (CNN). This paper proposes a method to predict the gaze of a user with deep models purely based on CNNs.


The ability to identify speech with similar emotional content is valuable to many applications, including speech retrieval, surveil- lance, and emotional speech synthesis. While current formulations in speech emotion recognition based on classification or regression are not appropriate for this task, solutions based on preference learn- ing offer appealing approaches for this task. This paper aims to find speech samples that are emotionally similar to an anchor speech sample provided as a query. This novel formulation opens interest- ing research questions.


Recent studies have shown that Convolutional Neural Networks (CNN) are relatively easy to attack through the generation of so-called adversarial examples. Such vulnerability also affects CNN-based image forensic tools. Research in deep learning has shown that adversarial examples exhibit a certain degree of transferability, i.e., they maintain part of their effectiveness even against CNN models other than the one targeted by the attack. This is a very strong property undermining the usability of CNN’s in security-oriented applications.


We give a brief history of the performance analysis of LMS.
Using averaging theory we show when and why the ‘independence
assumption’ ‘works’; we preface this with a fast
heuristic explanation of averaging methods, clarifying their
connection to the ‘ODE’ method. We then extend the discussion
to more recent distributed versions such as diffusion
LMS and consensus. While single node LMS is a single timescale
algorithm it turns out that distributed versions are twotime
scale systems, something that is not yet widely understood.


Detection of depression from speech has attracted significant research attention in recent years but remains a challenge, particularly for speech from diverse smartphones in natural environments. This paper proposes two sets of novel features based on speech landmark bigrams associated with abrupt speech articulatory events for depression detection from smartphone audio recordings. Combined with techniques adapted from natural language text processing, the proposed features further exploit landmark bigrams by discovering latent articulatory events.


We model the sampling and recovery of clustered graph signals as a reinforcement learning (RL) problem. The signal sampling is carried out by an agent which crawls over the graph and selects the most relevant graph nodes to sample. The goal of the agent is to select signal samples which allow for the most accurate recovery. The sample selection is formulated as a multi-armed bandit (MAB) problem, which lends naturally to learning efficient sampling strategies using the well-known gradient MAB algorithm.


Conventional approaches to matrix completion are sensitive to outliers and impulsive noise. This paper develops robust and computationally efficient M-estimation based matrix completion algorithms. By appropriately arranging the observed entries, and then applying alternating minimization, the robust matrix completion problem is converted into a set of regression M-estimation problems. Making use of differ- entiable loss functions, the proposed algorithm overcomes a weakness of the lp-loss (p ≤ 1), which easily gets stuck in an inferior point.


Consider a network with N nodes in d dimensions, and M overlapping subsets P_1,...,P_M (subnetworks). Assume that the nodes in a given P_i are observed in a local coordinate system. We wish to register the subnetworks using the knowledge of the observed coordinates. More precisely, we want to compute the positions of the N nodes in a global coordinate system, given P_1,...,P_M and the corresponding local coordinates. Among other applications, this problem arises in divide-and-conquer algorithms for localization of adhoc sensor networks.


In this paper, we are interested in exploiting textual and acoustic data of an utterance for the speech emotion classification task. The baseline approach models the information from audio and text independently using two deep neural networks (DNNs). The outputs from both the DNNs are then fused for classification. As opposed to using knowledge from both the modalities separately, we propose a framework to exploit acoustic information in tandem with lexical data.


A differential acoustic OFDM technique is presented to embed data imperceptibly in existing music. The method allows playing back music containing the data with a speaker without users noticing the embedded data channel. Using a microphone, the data can be recovered from the recording. Experiments with smartphone microphones show that transmission distances of 24 meters are possible, while achieving bit error ratios of less than 10 percent, depending on the environment.