
IEEE ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. The IEEE ICASSP 2024 conference will feature world-class presentations by internationally renowned speakers, cutting-edge session topics and provide a fantastic opportunity to network with like-minded professionals from around the world. Visit the website.

- Read more about MULTI-MODALITY ACTION RECOGNITION BASED ON DUAL FEATURE SHIFT IN VEHICLE CABIN MONITORING
- Log in to post comments
Driver Action Recognition (DAR) is crucial in vehicle cabin monitoring systems. In real-world applications, it is common for vehicle cabins to be equipped with cameras featuring different modalities. However, multi-modality fusion strategies for the DAR task within car cabins have rarely been studied. In this paper, we propose a novel yet efficient multi-modality driver action recognition method based on dual feature shift, named DFS. DFS first integrates complementary features across modalities by performing modality feature interaction.
- Categories:

- Read more about Recent Advances in Scalable Energy-Efficient and Trustworthy Spiking Neural Networks: From Algorithms to Technology
- Log in to post comments
Neuromorphic computing and, in particular, spiking neural networks (SNNs) have become an attractive alternative to deep neural networks for a broad range of signal processing applications, processing static and/or temporal inputs from different sensory modalities, including audio and vision sensors. In this paper, we start with a description of recent advances in algorithmic and optimization innovations to efficiently train and scale low-latency, and energy-efficient spiking neural networks (SNNs) for complex machine learning applications.
- Categories:

- Read more about Training Ultra-Low-Latency Spiking Neural Networks from Scratch
- Log in to post comments
Spiking Neural networks (SNN) have emerged as an attractive spatio-temporal computing paradigm for a wide range of low-power vision tasks. However, state-of-the-art (SOTA) SNN models either incur multiple time steps which hinder their deployment in real-time use cases or increase the training complexity significantly. To mitigate this concern, we present a training framework (from scratch) for SNNs with ultra-low (down to 1) time steps that leverages the Hoyer regularizer. We calculate the threshold for each BANN layer as the Hoyer extremum of a clipped version of its activation map.
- Categories:

- Read more about EFFICIENT VIDEO AND AUDIO PROCESSING WITH LOIHI 2
- Log in to post comments
Loihi 2 is a fully event-based neuromorphic processor that supports a wide range of synaptic connectivity configurations and temporal neuron dynamics. Loihi 2's temporal and event-based paradigm is naturally well-suited to processing data from an event-based sensor, such as a Dynamic Vision Sensor (DVS) or a Silicon Cochlea. However, this begs the question: How general are signal processing efficiency gains on Loihi 2 versus conventional computer architectures?
- Categories:

- Read more about BNMTRANS: A BRAIN NETWORK SEQUENCE-DRIVEN MANIFOLD-BASED TRANSFORMER FOR COGNITIVE IMPAIRMENT DETECTION USING EEG
- Log in to post comments
Early identification of mild cognitive impairment (MCI) is crucial for the prevention of Alzheimer’s disease. As neurodegenerative
diseases progress, the synchronous activity observed in electroencephalography (EEG) - which indicates functional connectivity -
- Categories:

- Read more about Towards ASR robust spoken language understanding through in-context learning with word confusion networks
- Log in to post comments
In the realm of spoken language understanding (SLU), numerous natural language understanding (NLU) methodologies have been adapted by supplying large language models (LLMs) with transcribed speech instead of conventional written text. In real-world scenarios, prior to input into an LLM, an automated speech recognition (ASR) system generates an output transcript hypothesis, where inherent errors can degrade subsequent SLU tasks.
- Categories:

- Read more about Common-slope modeling of late reverberation
- Log in to post comments
The decaying sound field in rooms is typically described by energy decay functions (EDFs). Late reverberation can deviate considerably from the ideal diffuse field, for example, in multiple connected rooms or non-uniform absorption material distributions. This paper proposes the common-slope model of late reverberation. The model describes spatial and directional late reverberation as linear combinations of exponential decays called common slopes.
- Categories:

- Read more about VOXTLM: UNIFIED DECODER-ONLY MODELS FOR CONSOLIDATING SPEECH RECOGNITION, SYNTHESIS AND SPEECH, TEXT CONTINUATION TASKS
- Log in to post comments
We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90.
- Categories:

- Read more about VOXTLM: UNIFIED DECODER-ONLY MODELS FOR CONSOLIDATING SPEECH RECOGNITION, SYNTHESIS AND SPEECH, TEXT CONTINUATION TASKS
- Log in to post comments
We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90.
- Categories:

We propose a decoder-only language model, VoxtLM, that can perform four tasks: speech recognition, speech synthesis, text generation, and speech continuation. VoxtLM integrates text vocabulary with discrete speech tokens from self-supervised speech features and uses special tokens to enable multitask learning. Compared to a single-task model, VoxtLM exhibits a significant improvement in speech synthesis, with improvements in both speech intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90.
- Categories: