Abstract
While chatbots powered by large language models (LLMs) provide conversational experiences, the setup requires sending a user query (prompt) to the server, executing a server-side LLM to generate a response, and receiving the response over the network, all of which introduce latency. This disclosure describes a split compute framework to provide a response to user queries with low latency by use of a local LLM on the client device. The output entropy for each token generated by the on-device language model is calculated and compared to a threshold to selectively trigger the server-based larger language model. The detailed response generated by the server LLM is stitched together in a language-semantically natural way. The server LLM is provided the response generated by the on-device LLM as an instruction-tuned input for matching context output.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Shin, D, "Selectively Triggering Server LLM Based on Local LLM Response Entropy to Provide a Low Latency Conversational Experience", Technical Disclosure Commons, (November 15, 2024)
https://www.tdcommons.org/dpubs_series/7532