Abstract

This disclosure describes techniques that leverage generative artificial intelligence to efficiently compress and accurately synthesize speech. Speech compression is achieved with personalized adaptation and with low perceptual loss. Per the techniques, a raw audio recording of a user is transcribed to text and is also used by a generative artificial intelligence (AI) speech synthesis model to create a fine-tuned personalized model of the user’s speech. A low-fidelity quantized representation is generated of information such as prosody, amplitude, inflection, etc. The transcript, the quantized audio data, and the personalized, fine-tuned model are compressed and saved. The compressed data is provided to a multimodal speech-synthesis model to reproduce the original raw audio.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS