Abstract
Automatically curated photo stories generated by a photo library application are popular. However, such stories currently do not include audio or text to enhance the story. This disclosure describes the use of a visual language model (VLM) (multimodal model) to generate text and/or audio description that provides the context of photographs included in photo stories. The VLM is provided with a system-generated prompt to generate text for a story that finds semantic consistency across a set of selected photographs. The selected photographs are processed by a vision tokenizer and the resulting tokens are fed into the VLM context window. A live user prompt (114) descriptive of the type of story the user is looking for from the set of photos is also provided to the VLM. Based on the prompts and the tokens, the auto-regressive text completion of VLM outputs a text description that effectively summarizes the serial context embedded in the selected photos. Optionally, the text description can be converted to audio. The text description and/or audio can be added to the photo story to provide a rich audiovisual story experience for the user.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Shin, D, "Generating Text Description for a Photo Story Using a Visual Language Model", Technical Disclosure Commons, (November 21, 2024)
https://www.tdcommons.org/dpubs_series/7580