Abstract

Audio and video components of a video can fall out of alignment for a variety of reasons, e.g., since they are recorded separately, due to encoding errors, etc. The viewing experience is degraded when audio and video are out of sync. Manual synchronization is expensive. This disclosure describes the use of a machine learning model to determine synchronization information for audio and video streams that are to be synchronized. The audio and video streams are provided as input to a multimodal model, e.g., a large language model that can handle both audio and video. The model is tasked with outputting the synchronization information, e.g., a set of timestamps corresponding to the different streams that are to be lined up for synchronization. The multimodal model performs this task via audio and video understanding by picking out specific landmark points in both the audio and video streams that correspond to each other. Synchronization information from the multimodal model can be provided to video editing tools or playback software to perform the necessary action, e.g., trim and offset the streams, to provide a synchronized video.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS