Abstract
Some photo library applications provide a conversational interface that enables users to retrieve photos that match their query. However, current interfaces are not multi-turn and are limited to retrieval and cannot automatically perform generation tasks. This disclosure describes an interactive, conversational interface for a digital photo library that allows users to explore their photos using natural language and receive generative, insightful answers. The exploration is multi-turn and multimodal, and is powered by a multimodal large language model (LLM). With user permission, the LLM is employed to interpret both the user's query (e.g., entered as free-form text) and the content of their photos in sequence.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Shin, D, "Interactive Photo Exploration Agent with Chained Language Model Calls", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/8995