Abstract

The standard approach to overlaying text-to-speech (TTS) on background audio is to globally attenuate the volume of the background track and play the overlaid TTS. This approach degrades the user experience by uniformly stripping rhythm from the background audio. This disclosure describes an audio-mixing architecture that utilizes real-time audio source separation combined with dynamic spatial manipulation. Instead of attenuating the master volume, the incoming stereo audio is demultiplexed into distinct stems (e.g., vocals, drums). When a system audio event (such as a voice assistant speaking) is triggered, a mixing matrix is executed as follows. The vocal stem is ducked to remove frequency masking. The bass stem is boosted to maintain the rhythmic energy of the track. The stereo width of the remaining instrumental stems is increased. This pushes the background music to the edges of the spatial soundscape, carving out a clear acoustic center pocket that accommodates the inserted voice before seamlessly transitioning back to the original mix.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS