Conversational interfaces enable users to pose queries to and receive answers from a large language model (LLM). With server-hosted LLMs, there can be substantial latency between query completion and receiving a response due to network delays and server computation time. This disclosure describes the provision of a distilled smaller model on an edge device to reduce latency through on-device response generation on the edge device. Owing to its smaller scope and size, the responses from the model may not be as high quality as that obtained via an LLM. The query and the locally generated response are provided to a server LLM that is instruction tuned to correct the assumptions in the responses generated by the on-device model. The higher-accuracy response from the LLM can be used as context information by the on-device LLM when generating subsequent responses. The background interaction can be repeated for subsequent queries producing high quality responses with low latency.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.