Documents
Poster
Exploring Phonetic Context-Aware Lip-Sync For Talking Face Generation
- DOI:
- 10.60864/524y-0977
- Citation Author(s):
- Submitted by:
- Se Jin Park
- Last updated:
- 6 June 2024 - 10:54am
- Document Type:
- Poster
- Document Year:
- 2024
- Presenters:
- Se Jin Park
- Paper Code:
- IVMSP-P21
- Categories:
- Log in to post comments
Talking face generation is the challenging task of synthesizing a natural and realistic face that requires accurate synchronization with a given audio. Due to co-articulation, where an isolated phone is influenced by the preceding or following phones, the articulation of a phone varies upon the phonetic context. Therefore, modeling lip motion with the phonetic context can generate more spatio-temporally aligned lip movement. In this respect, we investigate the phonetic context in generating lip motion for talking face generation. We propose Context-Aware Lip-Sync framework (CALS), which explicitly leverages phonetic context to generate lip movement of the target face. CALS is comprised of an Audio-to-Lip module and a Lip-to-Face module. The former is pretrained based on masked learning to map each phone to a contextualized lip motion unit. The contextualized lip motion unit then guides the latter in synthesizing a target identity with context-aware lip motion. From extensive experiments, we verify that simply exploiting the phonetic context in the proposed CALS framework effectively enhances spatio-temporal alignment. We also demonstrate the extent to which the phonetic context assists in lip synchronization and find the effective window size for lip generation to be approximately 1.2 seconds.