Abstract

Augmenting textual responses from large language models (LLMs) with relevant images can present challenges at scale, for example, due to computational resources and latency that may be associated with some integrated multimodal models or co-embedding techniques. The disclosed method can address this by classifying a user query, for instance, to determine if it is "image-seeking." If a query is identified as image-seeking, the system can forward the query and associated metadata to multiple systems that can operate in parallel. For example, the query can be sent to an LLM for generating a text response and also to a separate image retrieval system, which may be pre-indexed, for sourcing corresponding images. The text and image outputs may then be combined into a cohesive response. This decoupled, parallel architecture can facilitate the augmentation of LLM output with images, potentially reducing training and serving costs compared to some fully multimodal approaches.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Jain, Pulkit; Ain, Josh; Vijayaraghavan, Sowmya; Hirsch, Daniel; and Fund, Jason, "Decoupled Image Retrieval to Augment Large Language Model Output", Technical Disclosure Commons, (October 23, 2025)
https://www.tdcommons.org/dpubs_series/8765

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Decoupled Image Retrieval to Augment Large Language Model Output

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Decoupled Image Retrieval to Augment Large Language Model Output

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information