Video and audio conferencing hardware and software can implement noise filters to remove non-speech background noise captured when a participant is in a noisy environment. However, while noise filters can help remove non-speech background noise, the audio signal can still contain unwanted background speech from parties co-located with the speaker. The quality of a speaker’s audio can additionally be impacted by hardware and/or software issues which noise filters cannot remove. This disclosure describes the use of generative voice models to clean up degraded audio of a user’s speech. The recovery of the user speech is based on dynamically updated ambient characterization of the user’s speech. With user permission, speaker embeddings are obtained by automatically segmenting conversations to identify the parts that involve a particular user’s speech, without requiring the user to engage in a long calibration session. The noisy acoustic signal containing the user's speech and the rolling average of speaker embeddings characterizing the user’s typical speech are input to a suitable vocoder-type neural network to obtain clean audio of the speaker’s original speech.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Shin, D, "Recovering Clean Speech from Noisy Audio Input via Ambient Speech Characterization", Technical Disclosure Commons, (July 13, 2023)