Abstract

Whispered speech presents challenges in voice communication because it lacks the vocal cord vibration, fundamental frequency, and harmonic structure found in voiced speech. This results in reduced intelligibility and an unnatural sound during transmission. To address these limitations, a generative spectral mapping method is disclosed. The method utilizes a deep neural network to map the formant structure of whispered audio to a reconstructed harmonic structure. Missing pitch information is inferred from intensity dynamics and semantic context, while speaker identity is maintained through conditioning on a speaker embedding vector. A frame-based generative vocoder processes audio in small segments to allow for real-time conversion. This technology enables the transformation of whispered input into fully voiced speech, improving privacy and clarity in shared environments without requiring specialized hardware sensors.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS