Inventor(s)

D ShinFollow

Abstract

While chatbots powered by large language models (LLMs) provide conversational experiences, the setup requires sending a user query (prompt) to the server, executing a server-side LLM to generate a response, and receiving the response over the network, all of which introduce latency. This disclosure describes a split compute framework to provide a response to user queries with low latency by use of a local LLM on the client device. The output entropy for each token generated by the on-device language model is calculated and compared to a threshold to selectively trigger the server-based larger language model. The detailed response generated by the server LLM is stitched together in a language-semantically natural way. The server LLM is provided the response generated by the on-device LLM as an instruction-tuned input for matching context output.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS