Abstract

On-device large language model (LLM) pipelines that employ hardware accelerators frequently encounter limitations regarding the maximum context size that can be processed on-device. This constraint arises because neural network accelerators are designed to favor the execution of statically-shaped graphs, whereas varying context sizes necessitate dynamic dimensions within the attention layer of a neural network. Conventional solutions present challenges such as computational inefficiency and lack of scalability. This disclosure addresses the issue of context size limitation when performing inference using an on-device LLM by partitioning the attention graph into multiple reusable, statically-shaped graphs that are mathematically equivalent to the original computation. This approach enables inference with arbitrary context size using a fixed set of statically-shaped graphs. Attention graph partitioning is used to provide flexibility and scalability to support long context sizes for on-device LLMs.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS