ISCSLP 2016

Welcome to ISCSLP 2016 - October 17-20, 2016, Tianjin, China

The ISCSLP will be hosted by Tianjin University. Tianjin has a reputation throughout China for being extremely friendly, safe and a place of delicious food. Welcome to Tianjin to attend the ISCSLP2016. The 10th International Symposium on Chinese Spoken Language Processing (ISCSLP 2016) will be held on October 17-20, 2016 in Tianjin. ISCSLP is a biennial conference for scientists, researchers, and practitioners to report and discuss the latest progress in all theoretical and technological aspects of spoken language processing. While the ISCSLP is focused primarily on Chinese languages, works on other languages that may be applied to Chinese speech and language are also encouraged. The working language of ISCSLP is English.

Multi-Task Joint-Learning for Robust Voice Activity Detection

Read more about Multi-Task Joint-Learning for Robust Voice Activity Detection
Log in to post comments

Model based VAD approaches have been widely used and
achieved success in practice. These approaches usually cast
VAD as a frame-level classification problem and employ statistical
classifiers, such as Gaussian Mixture Model (GMM) or
Deep Neural Network (DNN) to assign a speech/silence label
for each frame. Due to the frame independent assumption classification,
the VAD results tend to be fragile. To address this
problem, in this paper, a new structured multi-frame prediction
DNN approach is proposed to improve the segment-level

zhuang-iscslp16-slides.pdf

zhuang-iscslp16-slides.pdf (349)

Categories:: Neural network learning (MLR-NNLR)

37 Views

Senone I-Vectors for Robust Speaker Verification

Read more about Senone I-Vectors for Robust Speaker Verification
Log in to post comments

ISCSLP16_SenoneIvector.pdf

ISCSLP16_SenoneIvector.pdf (698)

Categories:: Speaker Recognition and Characterization (SPE-SPKR)

4 Views

Investigation of the Spatiotemporal Dynamics of the Brain during Perceiving Words

Read more about Investigation of the Spatiotemporal Dynamics of the Brain during Perceiving Words
Log in to post comments

To investigate the temporal dynamics and spatial representations in speech perception and processing, electroencephalograph (EEG) signals were recorded when subjects listening Chinese words and pseudo words.

poster-syk.pdf

poster-syk.pdf (336)

Categories:: Speech Perception and Psychoacoustics (SPE-SPER)

21 Views

An Interface Research on Rhetorical Structure and Prosody Features in Chinese Reading Texts

This paper conducted an interface research on rhetorical and prosodic aspects of Chinese reading discourses within the Rhetorical Structure Theory (RST) framework. Ten discourses in 3 genres (Commentary, Narrative, and Descriptive) from the Annotated Speech Corpus of Chinese Discourse (ASCCD) were diagrammed in RST. The recordings from 5 males and 5 females were annotated and further analyzed acoustically and statistically by applying Praat and R.

Liang_O10-4-1016.pptx

Liang_O10-4-1016.pptx (352)

Liang_O10-4-1016.pptx

Liang_O10-4-1016.pptx (778)

Liang_O10-4-1018.pptx

Liang_O10-4-1018.pptx (763)

Categories:: Speech Production (SPE-SPRD)

9 Views

A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition

We propose a novel speaker-dependent (SD) approach to joint training of deep neural networks (DNNs) with an explicit speech separation structure for multi-talker speech recognition in a single-channel setting. First, a multi-condition training strategy is designed for a SD-DNN recognizer in multi-talker scenarios, which can significantly reduce the decoding runtime and improve the recognition accuracy over the approaches that use speaker-independent DNN models with a complicated joint decoding framework.

Yanhui_ISCSLP2016_oral.pdf

Yanhui_ISCSLP2016_oral.pdf (690)

Categories:: Robust Speech Recognition (SPE-ROBU)

17 Views

Dialog State Tracking for Interview Coaching Using Two-Level LSTM

Read more about Dialog State Tracking for Interview Coaching Using Two-Level LSTM
Log in to post comments

This study presents an approach to dialog state tracking (DST) in an interview conversation by using the long short-term memory (LSTM) and artificial neural network (ANN). First, the techniques of word embedding are employed for word representation by using the word2vec model. Then, each input sentence is represented by a sentence hidden vector using the LSTM-based sentence model. The sentence hidden vectors for each sentence are fed to the LSTM-based answer model to map the interviewee’s answer to an answer hidden vector.

ISCSLP-2016-1012-MH.pdf

ISCSLP-2016-1012-MH.pdf (764)

Categories:: Spoken Language Understanding (SLP-UNDE)

4 Views

Dialog State Tracking for Interview Coaching Using Two-Level LSTM

Read more about Dialog State Tracking for Interview Coaching Using Two-Level LSTM
Log in to post comments

ISCSLP-2016-1012-MH.pdf

ISCSLP-2016-1012-MH.pdf (799)

Categories:: Spoken Language Understanding (SLP-UNDE)

7 Views

Digit-dependent Local I-Vector for Text-Prompted Speaker Verification with Random Digit Sequences

The widely adopted i-vector performances well in text-independent speaker verification with long speech duration. How to integrate the state-of-the-art i-vector framework into the text-prompted speaker verification is addressed in this paper. To take advantage of the lexical information and enhance the performance for speaker verification with random digit sequences, this paper proposes to extract a set of digit-dependent local i-vectors from the utterance instead of extracting a single i-vector. The digit-dependent local i-vector is considered

ISCSLP2016_PeixinChen.pdf

ISCSLP2016_PeixinChen.pdf (325)

Categories:: Speaker Recognition and Characterization (SPE-SPKR)

11 Views

Realizing Speech to Gesture Conversion by Keyword Spotting

Read more about Realizing Speech to Gesture Conversion by Keyword Spotting
Log in to post comments

The paper proposed a method to realize a speech-to-gesture conversion for communication between normal and speech-impaired people. Keyword spotting was employed to recognize the keywords from input speech signals. At the same time, the three dimensional gesture models of keywords were built by 3D modeling technology according to the "Chinese sign language". The speech-to-gesture conversion was finally realized by playing the corresponding 3D gestures with OpenGL from the results of keyword spotting.

ISCSLP 2016 - 139.pdf

keyword spotting, gesture modeling, speech to gesture conversion (67)

Categories:: Spoken and Multimodal Dialog Systems and Applications (SLP-SMMD)

12 Views

Perceptual Evaluation of Natural and Synthesized Speech with Prosodic Focus in Mandarin Production of American Learners

Natural and synthesized speech in L2 Mandarin produced by American English learners was evaluated by native Mandarin speakers to identify focus status and rate the naturalness of the speech. The results reveal that natural speech was recognized and rated better than synthesized speech, early learners’ speech better than late learners’ speech, focused sentences better than no-focus sentences, and initial focus and medial focus better than final focus. Tones of in-focus words interacted with focus status of the sentence and speaker group.

ChenEtAl._ISCSLP2016_poster.pdf

ChenEtAl._ISCSLP2016_poster.pdf (817)

Categories:: Speech Processing

4 Views

Pages