Abstract

Visually impaired individuals can find it difficult to understand their surroundings as they navigate the world. In some cases, individuals may be in situations where they are unable to look at their surroundings but need to understand their context. This disclosure leverages the capabilities of large language models to perform video understanding tasks to assist users that may not be able to view their surroundings. Per the techniques, video captured by a user device is provided to a multimodal model. A narration prompt is provided to the multimodal model, instructing the model to generate a narration of what is depicted in the video. For use cases where the user has visual impairment, the prompt can include instructions to the model to tailor its output accordingly. The prompt can be finetuned on a dataset of known videos and corresponding narrations for different audiences. The output generated by the multimodal model can be audio narration, or text that can be converted into audio via text-to-speech techniques.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Weisz, Ágoston and Goodman, Andrew, "Generating Audio Narration for Streaming Video Using Large Language Model", Technical Disclosure Commons, (March 13, 2025)
https://www.tdcommons.org/dpubs_series/7900

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Generating Audio Narration for Streaming Video Using Large Language Model

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Generating Audio Narration for Streaming Video Using Large Language Model

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information