Abstract

Chatbot interfaces powered by multimodal large language models (LLMs) can enable users to submit queries that include both images and text. Image pre-processing is performed to analyze the images in the queries to confirm that the query is valid, to extract text and entities from the image, and to tokenize the image. These operations introduce a delay in generating a response to multimodal queries, in comparison to text-only queries. This disclosure describes the use of preemptive image analysis, with user permission, to reduce the time to first token, and thereby, the overall latency of a response from an LLM to a multimodal query that includes an image and text. Per techniques described herein, image analysis (validation, text and entity extraction, and image tokenization) is initiated as soon as the user adds an image to the chatbot interface (e.g., uploads an image), even before the user has initiated text entry. Additionally, validation, text and entity extraction, and image tokenization can be performed in parallel to further reduce latency and improve the user experience of interaction with the chatbot.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS