Abstract
This disclosure describes a system for synchronizing audio and video streams in video conferencing. Network limitations often cause synchronization issues in video conferencing, resulting in a disconnect between a speaker’s lips and voice. To enhance the viewer experience, this system employs machine learning models to generate representative vectors for audio and video segments. Upon detecting network limitations, these vectors are compared, and segment timing is adjusted to achieve synchronization. The system may include an audio-video stream processing module with video and audio representation submodules, a proximity calculator, and an audio-video synchronizer. The video and audio representation submodules may employ neural network architectures, including convolutional neural network (CNN) and bi-directional long short-term memory (LSTM) network, that generate representative vectors. The proximity calculator calculates a proximity score between a pair of audio and video representative vectors and provide the score to the audio-video synchronizer to adjust timing of audio and video segments, enhancing the viewer experience. The audio-video stream processing module may also include a lip localizer to enhance video segment quality by focusing on lip movements.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Shin, Dongeek, "Audio-Video Stream Synchronization Using Representative Vectors", Technical Disclosure Commons, (August 29, 2024)
https://www.tdcommons.org/dpubs_series/7314