Abstract

Standard audio description tracks are manually scripted and pre-recorded, leaving vast amounts of uncurated media content inaccessible to visually impaired users as the audio description tracks are only available for a small fraction of premium media content. To address this limitation, the disclosed technology details a process for generating real-time audio descriptions for media content. Video frames are analyzed locally by an on-device vision language model during identified silence windows or gaps within the media content. A text summary of the visual action is generated and subsequently converted to speech. The playback speed of the synthesized narration is dynamically adjusted to ensure it fits precisely within the available silence window. Consequently, accessibility is significantly enhanced for varied media content lacking pre-recorded tracks. Visual descriptions are provided seamlessly without disrupting original dialogue or immersive audio elements.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Agarwal, Nikita, "Real-Time Generative Audio Description Insertion Using Vision Language Analysis", Technical Disclosure Commons, (June 14, 2026)
https://www.tdcommons.org/dpubs_series/10433

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Real-Time Generative Audio Description Insertion Using Vision Language Analysis

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Real-Time Generative Audio Description Insertion Using Vision Language Analysis

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information