Abstract
This disclosure describes a system for continuous token streaming and inference for large language models. Large language model inference can be a high-latency operation, which can result in noticeable wait times for users to receive an LLM output. This system can reduce latency by streaming tokens from a user’s device to an LLM in real time as the user inputs new tokens, and by performing partial inference operations on received tokens to generate intermediate inference states. Once a final token is received, the system can use the intermediate inference states to finalize inference operations and generate an output in less time than a full inference operation, resulting in reduced latency and improved user experience. This system can further support token modifications by reverting a partial inference operation to a previous intermediate inference state.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Shin, Dongeek, "Continuous Input Streaming and Inference for Low-latency Large Language Models", Technical Disclosure Commons, (November 03, 2024)
https://www.tdcommons.org/dpubs_series/7493