Abstract
On-device large language model (LLM) pipelines that employ hardware accelerators frequently encounter limitations regarding the maximum context size that can be processed on-device. This constraint arises because neural network accelerators are designed to favor the execution of statically-shaped graphs, whereas varying context sizes necessitate dynamic dimensions within the attention layer of a neural network. Conventional solutions present challenges such as computational inefficiency and lack of scalability. This disclosure addresses the issue of context size limitation when performing inference using an on-device LLM by partitioning the attention graph into multiple reusable, statically-shaped graphs that are mathematically equivalent to the original computation. This approach enables inference with arbitrary context size using a fixed set of statically-shaped graphs. Attention graph partitioning is used to provide flexibility and scalability to support long context sizes for on-device LLMs.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Wang, Xusong and Wang, Miao, "On-device Large Language Model Inference with Unlimited Context Size", Technical Disclosure Commons, (October 24, 2025)
https://www.tdcommons.org/dpubs_series/8783