Abstract

On-device large language model (LLM) pipelines that employ hardware accelerators frequently encounter limitations regarding the maximum context size that can be processed on-device. This constraint arises because neural network accelerators are designed to favor the execution of statically-shaped graphs, whereas varying context sizes necessitate dynamic dimensions within the attention layer of a neural network. Conventional solutions present challenges such as computational inefficiency and lack of scalability. This disclosure addresses the issue of context size limitation when performing inference using an on-device LLM by partitioning the attention graph into multiple reusable, statically-shaped graphs that are mathematically equivalent to the original computation. This approach enables inference with arbitrary context size using a fixed set of statically-shaped graphs. Attention graph partitioning is used to provide flexibility and scalability to support long context sizes for on-device LLMs.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Wang, Xusong and Wang, Miao, "On-device Large Language Model Inference with Unlimited Context Size", Technical Disclosure Commons, (October 24, 2025)
https://www.tdcommons.org/dpubs_series/8783

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

On-device Large Language Model Inference with Unlimited Context Size

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

On-device Large Language Model Inference with Unlimited Context Size

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information