Abstract

This disclosure describes a system for continuous token streaming and inference for large language models. Large language model inference can be a high-latency operation, which can result in noticeable wait times for users to receive an LLM output. This system can reduce latency by streaming tokens from a user’s device to an LLM in real time as the user inputs new tokens, and by performing partial inference operations on received tokens to generate intermediate inference states. Once a final token is received, the system can use the intermediate inference states to finalize inference operations and generate an output in less time than a full inference operation, resulting in reduced latency and improved user experience. This system can further support token modifications by reverting a partial inference operation to a previous intermediate inference state.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Shin, Dongeek, "Continuous Input Streaming and Inference for Low-latency Large Language Models", Technical Disclosure Commons, (November 03, 2024)
https://www.tdcommons.org/dpubs_series/7493

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Continuous Input Streaming and Inference for Low-latency Large Language Models

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Continuous Input Streaming and Inference for Low-latency Large Language Models

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information