Abstract
This disclosure describes a data collection approach to obtain a high-quality, paired prosody dataset to support cross-lingual prosody translation in voice dubbing and other speech generation applications. Current datasets lack the nuanced prosodic features that are essential for natural and expressive cross-lingual speech synthesis, resulting in synthetic speech that often sounds flat or lacks emotional resonance. Per techniques of this disclosure, audio spoken by professional voice actors or other individuals proficient in a language is recorded in various languages. Paired recordings of matching content are available in the collected dataset. The recordings capture nuanced prosody such as rhythm, emotion, and emphasis. English recordings are used as prosodic references, allowing target language audio to mirror expressive elements accurately. The data collection is usable to train speech generation models to synthesize speech that retains the expressive qualities of the source audio in various languages. The dataset includes appropriate metadata, such as prosody markers, speaker attributes, etc.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Klein, Daniel V. and McCartney, Paul, "Cross-Lingual Prosody Preservation in Speech Generation Applications", Technical Disclosure Commons, (April 03, 2025)
https://www.tdcommons.org/dpubs_series/7966