Abstract

In remote communication, audio signals and visual signals are often unsynchronized. A mismatch between perceived visuals and corresponding audio leads to speech recognition errors and reduced audio quality. This disclosure provides an architecture for visual-audio speech correction. Features are extracted by a mouth landmark encoder, an audio encoder, and a visual encoder. These features are mapped to a target audio signal through an on-device visual-audio transformer. Simultaneous processing of signals from both visual and audio modalities is performed. Device latency and conferencing platform latency are integrated into transformer encoders during a preprocessing stage. Enhanced audio signals are then decoded to align with perceived facial expressions. This approach improves speech perception and recognition accuracy in unsynchronized virtual environments.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS