- Acoustic Modeling for Automatic Speech Recognition (SPE-RECO)
- General Topics in Speech Recognition (SPE-GASR)
- Large Vocabulary Continuous Recognition/Search (SPE-LVCR)
- Lexical Modeling and Access (SPE-LEXI)
- Multilingual Recognition and Identification (SPE-MULT)
- Resource constrained speech recognition (SPE-RCSR)
- Robust Speech Recognition (SPE-ROBU)
- Speaker Recognition and Characterization (SPE-SPKR)
- Speech Adaptation/Normalization (SPE-ADAP)
- Speech Analysis (SPE-ANLS)
- Speech Coding (SPE-CODI)
- Speech Enhancement (SPE-ENHA)
- Speech Perception and Psychoacoustics (SPE-SPER)
- Speech Production (SPE-SPRD)
- Speech Synthesis and Generation, including TTS (SPE-SYNT)
The language patterns followed by different speakers who play specific roles in conversational interactions provide valuable cues for the task of Speaker Role Recognition (SRR). Given the speech signal, existing algorithms typically try to find such patterns in the output of an Automatic Speech Recognition (ASR) system. In this work we propose an alternative way of revealing role-specific linguistic characteristics, by making use of role-specific ASR outputs, which are built by suitably rescoring the lattice produced after a first pass of ASR decoding.
The ability to identify speech with similar emotional content is valuable to many applications, including speech retrieval, surveillance, and emotional speech synthesis. While current formulations in speech emotion recognition based on classification or regression are not appropriate for this task, solutions based on preference learning offer appealing approaches for this task. This paper aims to find speech samples that are emotionally similar to an anchor speech sample provided as a query. This novel formulation opens interesting research questions.
Audio-signal acquisition as part of wearable sensing adds an important dimension for applications such as understanding human behaviors. As part of a large study on work place behaviors, we collected audio data from individual hospital staff using custom wearable recorders. The audio features collected were limited to preserve privacy of the interactions in the hospital. A first step towards audio processing is to identify the foreground speech of the person wearing the audio badge.
The results of spoofing detection systems proposed during ASVspoof Challenges 2015 and 2017 confirmed the perspective in detection of unforseen spoofing trials in microphone channel. However, telephone channel presents much more challenging conditions for spoofing detection, due to limited bandwidth, various coding standards and channel effects. Research on the topic has thus far only made use of program codecs and other telephone channel emulations. Such emulations does not quite match the real telephone spoofing attacks.
In this work, we consider the task of acoustic and articulatory feature based automatic classification of Amyotrophic Lateral Sclerosis (ALS) patients and healthy subjects using speech tasks. In particular, we compare the roles of different types of speech tasks, namely rehearsed speech, spontaneous speech and repeated words for this purpose. Simultaneous articulatory and speech data were recorded from 8 healthy controls and 8 ALS patients using AG501 for the classification experiments.
In this paper, we study the role of long-time analytic phase of speech
signals in spoken language recognition (SLR) and employ a set
of features termed as instantaneous frequency cepstral coefficients
(IFCC). We extract IFCC from long-time analytic phase, in an effort
to capture long range acoustic features from speech signals. These
features are used in combination with the traditional shifted delta
cepstral coefficients (SDCC) for SLR. As the SDCC are extracted
from spectral magnitude and IFCC are from analytic phase, they