Abstract
Visually impaired individuals can find it difficult to understand their surroundings as they navigate the world. In some cases, individuals may be in situations where they are unable to look at their surroundings but need to understand their context. This disclosure leverages the capabilities of large language models to perform video understanding tasks to assist users that may not be able to view their surroundings. Per the techniques, video captured by a user device is provided to a multimodal model. A narration prompt is provided to the multimodal model, instructing the model to generate a narration of what is depicted in the video. For use cases where the user has visual impairment, the prompt can include instructions to the model to tailor its output accordingly. The prompt can be finetuned on a dataset of known videos and corresponding narrations for different audiences. The output generated by the multimodal model can be audio narration, or text that can be converted into audio via text-to-speech techniques.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Weisz, Ágoston and Goodman, Andrew, "Generating Audio Narration for Streaming Video Using Large Language Model", Technical Disclosure Commons, (March 13, 2025)
https://www.tdcommons.org/dpubs_series/7900