Abstract
Generative text-to-video models often exhibit temporal disjointedness where high-impact visual events fail to align with corresponding audio transients. This results in a drifting effect that reduces the perceived impact and immersion of the generated content. To mitigate this, a reinforcement learning framework utilizes a signal-based reward mechanism to synchronize visual motion with audio energy. By shifting focus from high-level semantic guidance to fine-grained signal correlation, the system enables precise temporal timing in synthetic video production.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Weisz, Ágoston, "Reinforcement Learning With a Transient Locking Reward for Audio-Visual Synchronization", Technical Disclosure Commons, (March 30, 2026)
https://www.tdcommons.org/dpubs_series/9656