Abstract
Standard audio description tracks are manually scripted and pre-recorded, leaving vast amounts of uncurated media content inaccessible to visually impaired users as the audio description tracks are only available for a small fraction of premium media content. To address this limitation, the disclosed technology details a process for generating real-time audio descriptions for media content. Video frames are analyzed locally by an on-device vision language model during identified silence windows or gaps within the media content. A text summary of the visual action is generated and subsequently converted to speech. The playback speed of the synthesized narration is dynamically adjusted to ensure it fits precisely within the available silence window. Consequently, accessibility is significantly enhanced for varied media content lacking pre-recorded tracks. Visual descriptions are provided seamlessly without disrupting original dialogue or immersive audio elements.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Agarwal, Nikita, "Real-Time Generative Audio Description Insertion Using Vision Language Analysis", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10433