Abstract

Image captioning models can receive an image as input and generate captions that describe objects, scenes, and other visual aspects in the image. However, such models are not personalized to a user and lack the ability to generate captions that are tailored to specific user intent or contexts. This disclosure describes techniques to generate personalized image captions using a large language model (LLM). With user permission, user intent data such as context, preferences, notes, or tags from users (e.g., that are stored in association with user images) are used as input to a LLM to refine image captions generated by a vision-to-caption extractor. The LLM is provided with a prompt that includes the user intent data and tasked with refinement of captions generated by the vision-to-caption extractor. This configuration, where the LLM performs text refinement to obtain personalized captions, allows personalization without retraining the vision-to-caption extractor (or any existing captioning model). The output of the LLM is image descriptions that are more relevant and meaningful to the user.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Shin, D, "User Intent-conditioned Image Captioning Using LLM for Generating Personalized Photo Descriptions", Technical Disclosure Commons, (August 20, 2025)
https://www.tdcommons.org/dpubs_series/8491

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

User Intent-conditioned Image Captioning Using LLM for Generating Personalized Photo Descriptions

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

User Intent-conditioned Image Captioning Using LLM for Generating Personalized Photo Descriptions

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information