Defensive Publications Series

Selectively Triggering Server LLM Based on Local LLM Response Entropy to Provide a Low Latency Conversational Experience

Abstract

While chatbots powered by large language models (LLMs) provide conversational experiences, the setup requires sending a user query (prompt) to the server, executing a server-side LLM to generate a response, and receiving the response over the network, all of which introduce latency. This disclosure describes a split compute framework to provide a response to user queries with low latency by use of a local LLM on the client device. The output entropy for each token generated by the on-device language model is calculated and compared to a threshold to selectively trigger the server-based larger language model. The detailed response generated by the server LLM is stitched together in a language-semantically natural way. The server LLM is provided the response generated by the on-device LLM as an instruction-tuned input for matching context output.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Shin, D, "Selectively Triggering Server LLM Based on Local LLM Response Entropy to Provide a Low Latency Conversational Experience", Technical Disclosure Commons, (November 15, 2024)
https://www.tdcommons.org/dpubs_series/7532

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Selectively Triggering Server LLM Based on Local LLM Response Entropy to Provide a Low Latency Conversational Experience

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Selectively Triggering Server LLM Based on Local LLM Response Entropy to Provide a Low Latency Conversational Experience

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information