Inventor(s)

Dongeek ShinFollow

Abstract

This disclosure describes a system for continuous token streaming and inference for large language models. Large language model inference can be a high-latency operation, which can result in noticeable wait times for users to receive an LLM output. This system can reduce latency by streaming tokens from a user’s device to an LLM in real time as the user inputs new tokens, and by performing partial inference operations on received tokens to generate intermediate inference states. Once a final token is received, the system can use the intermediate inference states to finalize inference operations and generate an output in less time than a full inference operation, resulting in reduced latency and improved user experience. This system can further support token modifications by reverting a partial inference operation to a previous intermediate inference state.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS