Abstract

This disclosure describes a data collection approach to obtain a high-quality, paired prosody dataset to support cross-lingual prosody translation in voice dubbing and other speech generation applications. Current datasets lack the nuanced prosodic features that are essential for natural and expressive cross-lingual speech synthesis, resulting in synthetic speech that often sounds flat or lacks emotional resonance. Per techniques of this disclosure, audio spoken by professional voice actors or other individuals proficient in a language is recorded in various languages. Paired recordings of matching content are available in the collected dataset. The recordings capture nuanced prosody such as rhythm, emotion, and emphasis. English recordings are used as prosodic references, allowing target language audio to mirror expressive elements accurately. The data collection is usable to train speech generation models to synthesize speech that retains the expressive qualities of the source audio in various languages. The dataset includes appropriate metadata, such as prosody markers, speaker attributes, etc.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Klein, Daniel V. and McCartney, Paul, "Cross-Lingual Prosody Preservation in Speech Generation Applications", Technical Disclosure Commons, (April 03, 2025)
https://www.tdcommons.org/dpubs_series/7966

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Cross-Lingual Prosody Preservation in Speech Generation Applications

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Cross-Lingual Prosody Preservation in Speech Generation Applications

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information