Abstract
Whispered speech presents challenges in voice communication because it lacks the vocal cord vibration, fundamental frequency, and harmonic structure found in voiced speech. This results in reduced intelligibility and an unnatural sound during transmission. To address these limitations, a generative spectral mapping method is disclosed. The method utilizes a deep neural network to map the formant structure of whispered audio to a reconstructed harmonic structure. Missing pitch information is inferred from intensity dynamics and semantic context, while speaker identity is maintained through conditioning on a speaker embedding vector. A frame-based generative vocoder processes audio in small segments to allow for real-time conversion. This technology enables the transformation of whispered input into fully voiced speech, improving privacy and clarity in shared environments without requiring specialized hardware sensors.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Yakar, Tamar and Labzovsky, Ilia, "Real-Time Whisper-to-Voiced Speech Conversion using Generative Spectral Mapping", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/9590