A text-to-speech (TTS) converter typically comprises a prosodic model that generates acoustic parameters from linguistic features paired with a neural vocoder. With such a configuration, some feature values can be difficult for the neural vocoder to process, resulting in audio artifacts. This disclosure describes techniques to improve neural vocoder performance, e.g., reduce audio artifacts, make the vocoder more robust to unusual acoustic feature variations, generally be more forgiving of errors made by the feature generator, etc. The techniques entail the use of an auxiliary training path that is driven by synthetic training examples generated by CHiVE inference with some random sampling far enough from the mean (zero).

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.