VOICE CONVERSION THROUGH RESIDUAL WARPING IN A SPARSE, ANCHOR-BASED REPRESENTATION OF SPEECH

In previous work we presented a Sparse, Anchor-Based Representation of speech (SABR) that uses phonemic “anchors” to represent an utterance with a set of sparse non-negative weights. SABR is speaker-independent: combining weights from a source speaker with anchors from a target speaker can be used for voice conversion. Here, we present an extension of the original SABR that significantly improves voice conversion synthesis. Namely, we take the residual signal from the SABR decomposition of the source speaker’s utterance, and warp it to the target speaker’s space using a weighted warping function learned from pairs of source-target anchors. Using subjective and objective evaluations, we examine the performance of adding the warped residual (SABR+Res) to the original synthesis (SABR). Specifically, listeners rated SABR+Res with an average mean opinion score (MOS) of 3.6, a significant improvement compared to 2.2 MOS for SABR alone (p<0.01) and 2.5 MOS for a baseline GMM method (p<0.01). In an XAB speaker identity test, listeners correctly identified the identity of SABR+Res (81%) and SABR (84%) as frequently as a GMM method (82%) (p=0.70, p=0.35). These results indicate that adding the warped residual can dramatically improve synthesis while retaining the desirable independent qualities of SABR models.

ICASSP2018Poster.v4.pdf

ICASSP2018Poster.v4.pdf (483)

Thumbs Up

CITE

Documents

Poster

VOICE CONVERSION THROUGH RESIDUAL WARPING IN A SPARSE, ANCHOR-BASED REPRESENTATION OF SPEECH

ICASSP2018Poster.v4.pdf

QUESTIONS?