Sorry, you need to enable JavaScript to visit this website.

VOICE CONVERSION THROUGH RESIDUAL WARPING IN A SPARSE, ANCHOR-BASED REPRESENTATION OF SPEECH

Citation Author(s):
Christopher Liberatore, Guanlong Zhao, Ricardo Gutierrez-Osuna
Submitted by:
Christopher Lib...
Last updated:
12 April 2018 - 7:47pm
Document Type:
Poster
Document Year:
2018
Event:
Presenters:
Christopher Liberatore
Paper Code:
1149
 

In previous work we presented a Sparse, Anchor-Based Representation of speech (SABR) that uses phonemic “anchors” to represent an utterance with a set of sparse non-negative weights. SABR is speaker-independent: combining weights from a source speaker with anchors from a target speaker can be used for voice conversion. Here, we present an extension of the original SABR that significantly improves voice conversion synthesis. Namely, we take the residual signal from the SABR decomposition of the source speaker’s utterance, and warp it to the target speaker’s space using a weighted warping function learned from pairs of source-target anchors. Using subjective and objective evaluations, we examine the performance of adding the warped residual (SABR+Res) to the original synthesis (SABR). Specifically, listeners rated SABR+Res with an average mean opinion score (MOS) of 3.6, a significant improvement compared to 2.2 MOS for SABR alone (p<0.01) and 2.5 MOS for a baseline GMM method (p<0.01). In an XAB speaker identity test, listeners correctly identified the identity of SABR+Res (81%) and SABR (84%) as frequently as a GMM method (82%) (p=0.70, p=0.35). These results indicate that adding the warped residual can dramatically improve synthesis while retaining the desirable independent qualities of SABR models.

up
0 users have voted: