IEEE ICASSP 2024

IEEE ICASSP 2024 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) is the world’s largest and most comprehensive technical conference focused on signal processing and its applications. The IEEE ICASSP 2024 conference will feature world-class presentations by internationally renowned speakers, cutting-edge session topics and provide a fantastic opportunity to network with like-minded professionals from around the world. Visit the website.

BOOSTING ZERO-SHOT HUMAN-OBJECT INTERACTION DETECTION WITH VISION-LANGUAGE TRANSFER

Read more about BOOSTING ZERO-SHOT HUMAN-OBJECT INTERACTION DETECTION WITH VISION-LANGUAGE TRANSFER
Log in to post comments

Human-Object Interaction (HOI) detection is a crucial task that involves localizing interactive human-object pairs and identifying the actions being performed. Most existing HOI detectors are supervised in nature and lack the ability of zero-shot discovery of unseen interactions. Recently, transformer-based methods have superseded the traditional CNN detectors by aggregating image-wide context but still suffer from the long-tail distribution problem in HOI. In this work, our primary focus is improving HOI detection in images, particularly in zero-shot scenarios.

poster.pdf

Sarma_ZSHOI_ICASSP_2024_poster (198)

Categories:: Image/Video Processing

54 Views

Flow Dynamics Correction for Action Recognition

Read more about Flow Dynamics Correction for Action Recognition
Log in to post comments

Various research studies indicate that action recognition performance highly depends on the types of motions being extracted and how accurate the human actions are represented. In this paper, we investigate different optical flow, and features extracted from these optical flow that capturing both short-term and long-term motion dynamics. We perform power normalization on the magnitude component of optical flow for flow dynamics correction to boost subtle or dampen sudden motions.

icassp24_hal_poster.pdf

Poster for Flow Dynamics Correction for Action Recognition (ICASSP’24) (177)

Categories:: Image/Video Processing

41 Views

SEMANTIC DISTILLATION AND STRUCTURAL ALIGNMENT NETWORK FOR FAKE NEWS DETECTION

Read more about SEMANTIC DISTILLATION AND STRUCTURAL ALIGNMENT NETWORK FOR FAKE NEWS DETECTION
Log in to post comments

In recent years, the rapid proliferation of multi-modal fake news has posed potential harm across various sectors of society, making the detection of multi-modal fake news crucial. Most existing methods can not effectively reduce the redundant information and preserve both semantic and structural information. To address these problems, this paper proposes a semantic distillation and structural alignment (SDSA) network. We design an semantic distillation module for modality-specific features to preserve task-relevant semantic information and eliminate redundant information.

Semantic distillation and structural aligement network.pdf

Semantic distillation and structural aligement network.pdf (200)

Categories:: Pattern recognition and classification (MLR-PATT)

40 Views

TNFORMER: SINGLE-PASS MULTILINGUAL TEXT NORMALIZATION WITH A TRANSFORMER DECODER MODEL

Text Normalization (TN) is a pivotal pre-processing procedure in speech synthesis systems, which converts diverse forms of text into a canonical form suitable for correct synthesis. This work introduces a novel model, TNFormer, which innovatively transforms the TN task into a next token prediction problem, leveraging the structure of GPT with only Transformer decoders for efficient, single-pass TN. The strength of TNFormer lies not only in its ability to identify Non-Standard Words that require normalization but also in its aptitude for context-driven normalization in a single pass.

ICASSP2024-poster.pdf

ICASSP2024-poster.pdf (483)

Categories:: Speech Synthesis and Generation, including TTS (SPE-SYNT)

90 Views

TokenMotion: Motion-Guided Vision Transformer For Video Camouflaged Object Detection Via Learnable Token Selection

The area of Video Camouflaged Object Detection (VCOD) presents unique challenges in the field of computer vision due to texture similarities between target objects and their surroundings, as well as irregular motion patterns caused by both objects and camera movement. In this paper, we introduce TokenMotion (TMNet), which employs a transformer-based model to enhance VCOD by extracting motion-guided features using a learnable token selection. Evaluated on the challenging MoCA-Mask dataset, TMNet achieves state-of-the-art performance in VCOD.

ICASSP2024_ARL-ASU_Updated_April_9.pptx

ICASSP2024_ARL-ASU_Updated_April_9.pptx (172)

Categories:: Image, Video, and Multidimensional Signal Processing

41 Views

ESA: Expert-and-Samples-Aware Incremental Learning under Longtail Distribution

Read more about ESA: Expert-and-Samples-Aware Incremental Learning under Longtail Distribution
Log in to post comments

Most works in class incremental learning (CIL) assume disjoint sets of classes as tasks. Although a few works deal with overlapped sets of classes, they either assume a balanced data distribution or assume a mild imbalanced distribution. Instead, in this paper, we explore one of the understudied real-world CIL settings where (1) different tasks can share some classes but with new data samples, and (2) the training data of each task follows a long-tail distribution. We call this setting CIL-LT.

ICASSP 2024 poster.pdf

ICASSP 2024 poster.pdf (197)

Categories:: Pattern recognition and classification (MLR-PATT)

50 Views

Multi-Source Dynamic Interactive Network Collaborative Reasoning Image Captioning

Read more about Multi-Source Dynamic Interactive Network Collaborative Reasoning Image Captioning
Log in to post comments

Rich image and text features can largely improve the training of image captioning tasks. However, rich image and text features mean the incorporation of a large amount of unnecessary information. In our work, in order to fully explore and utilize the key information in images and text, we view the combination of image and text features as a data screening problem. The combination of image and text features is dynamically screened through a series of inference strategies with the aim of selecting the optimal image and text features.

20240416苏强ICASSP工作汇报.pptx

20240416苏强ICASSP工作汇报.pptx (192)

Categories:: Multimedia communications and networking

38 Views

Subspace-Based Co-Array Processing for Nested Arrays Without Eigendecomposition

Read more about Subspace-Based Co-Array Processing for Nested Arrays Without Eigendecomposition
Log in to post comments

For the purpose of computational efficiency, we propose two subspace-based methods, but without eigendecomposition, to address the two typical problems in nested array processing, i.e., direction-of-arrival (DOA) estimation and noise elimination. In detail, to estimate DOA parameters, we judiciously arrange the segments extracted from the co-array model and then introduce a novel co-array-based orthogonal propagator method (COPM).

document.pdf

Paper pre-print (517)

Categories:: Sensor Array Processing

38 Views

UNCERTAINTY-GUIDED PERSON SEARCH MODEL WITH AUXILIARY SHALLOW FEATURE EXPLORATION

Read more about UNCERTAINTY-GUIDED PERSON SEARCH MODEL WITH AUXILIARY SHALLOW FEATURE EXPLORATION
Log in to post comments

Person search is a unified system aimed at jointly localizing and identifying a person of interest from a gallery of whole scene images. Due to the inherent properties of the person search, it faces significant challenges of large-scale variations, inaccurate detection boxes, and crowded scenes. To address these issues, we proposed an uncertainty-guided framework coupled with auxiliary shallow feature exploration, which includes a shallow feature fusion module and an uncertaintyguided module.

poster_icassp_lizongyi.pdf

poster_icassp_lizongyi.pdf (232)

Categories:: Bio Imaging and Signal Processing

43 Views

DATA DRIVEN GRAPHEME-TO-PHONEME REPRESENTATIONS FOR A LEXICON-FREE TEXT-TO-SPEECH

Read more about DATA DRIVEN GRAPHEME-TO-PHONEME REPRESENTATIONS FOR A LEXICON-FREE TEXT-TO-SPEECH
Log in to post comments

Grapheme-to-Phoneme (G2P) is an essential first step in any modern, high-quality Text-to-Speech (TTS) system. Most of the current G2P systems rely on carefully hand-crafted lexicons developed by experts. This poses a two-fold problem. Firstly, the lexicons are generated using a fixed phoneme set, usually, ARPABET or IPA, which might not be the most optimal way to represent phonemes for all languages. Secondly, the man-hours required to produce such an expert lexicon are very high.

20240118060700_952384_4931.pdf

paper (347)

Categories:: Audio and Acoustic Signal Processing

75 Views