Abstract

Chatbot interfaces powered by multimodal large language models (LLMs) can enable users to submit queries that include both images and text. Image pre-processing is performed to analyze the images in the queries to confirm that the query is valid, to extract text and entities from the image, and to tokenize the image. These operations introduce a delay in generating a response to multimodal queries, in comparison to text-only queries. This disclosure describes the use of preemptive image analysis, with user permission, to reduce the time to first token, and thereby, the overall latency of a response from an LLM to a multimodal query that includes an image and text. Per techniques described herein, image analysis (validation, text and entity extraction, and image tokenization) is initiated as soon as the user adds an image to the chatbot interface (e.g., uploads an image), even before the user has initiated text entry. Additionally, validation, text and entity extraction, and image tokenization can be performed in parallel to further reduce latency and improve the user experience of interaction with the chatbot.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Salama, Khalid and Agostini, Alessandro, "Response Latency Reduction for Multimodal Prompts that Include Image and Text", Technical Disclosure Commons, (October 03, 2024)
https://www.tdcommons.org/dpubs_series/7399

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Response Latency Reduction for Multimodal Prompts that Include Image and Text

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Response Latency Reduction for Multimodal Prompts that Include Image and Text

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information