Abstract

Augmenting textual responses from large language models (LLMs) with relevant images can present challenges at scale, for example, due to computational resources and latency that may be associated with some integrated multimodal models or co-embedding techniques. The disclosed method can address this by classifying a user query, for instance, to determine if it is "image-seeking." If a query is identified as image-seeking, the system can forward the query and associated metadata to multiple systems that can operate in parallel. For example, the query can be sent to an LLM for generating a text response and also to a separate image retrieval system, which may be pre-indexed, for sourcing corresponding images. The text and image outputs may then be combined into a cohesive response. This decoupled, parallel architecture can facilitate the augmentation of LLM output with images, potentially reducing training and serving costs compared to some fully multimodal approaches.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS