Abstract
Large language models (LLMs) operating in a server-client framework can incur high latency, particularly for complex queries. This disclosure describes a framework to serially distribute LLM inference layers between compute nodes based on just-in-time query optimization. The division of the transformer layers is parametrized using the number of compute nodes and the frequencies of the queries to enable optimal asynchronous operation for LLM task completion. Due to the asynchronous nature of the described distributed computing and inference, latent and direct queries do not compete with each other, thereby reducing the latency for LLM task completion.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
NA, "Multi-Device Asynchronous Distributed Compute Framework of LLM Layers for Cascaded Inference", Technical Disclosure Commons, (September 04, 2025)
https://www.tdcommons.org/dpubs_series/8559