Abstract

Chatbots and virtual assistants powered by a multimodal large language model (LLM) can respond to a wide range of queries and can process input in text, audio, video, or other formats. However, when deployed on a user device such as a smartphone, it is difficult to determine what data to capture and analyze, and the modality in which to provide the response. This disclosure describes the use of contextual signals such as device capability, device settings and state, connections with other devices, etc., obtained and used with user permission, to detect the input to be analyzed by the LLM and the format of the output. For example, an on-device chatbot can detect input such as screen sharing along with user commands, and use the screen content as context when generating a response.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS