Abstract

A text-to-speech (TTS) converter typically comprises a prosodic model that generates acoustic parameters from linguistic features paired with a neural vocoder. With such a configuration, some feature values can be difficult for the neural vocoder to process, resulting in audio artifacts. This disclosure describes techniques to improve neural vocoder performance, e.g., reduce audio artifacts, make the vocoder more robust to unusual acoustic feature variations, generally be more forgiving of errors made by the feature generator, etc. The techniques entail the use of an auxiliary training path that is driven by synthetic training examples generated by CHiVE inference with some random sampling far enough from the mean (zero).

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Finkelstein, Lev; Wan, Vincent; Clark, Rob; and Zen, Heiga, "Improving Neural Vocoder Stability Using Artificial Training Data", Technical Disclosure Commons, (June 26, 2023)
https://www.tdcommons.org/dpubs_series/5998

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Improving Neural Vocoder Stability Using Artificial Training Data

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Improving Neural Vocoder Stability Using Artificial Training Data

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information