Abstract

Large language models (LLMs) operating in a server-client framework can incur high latency, particularly for complex queries. This disclosure describes a framework to serially distribute LLM inference layers between compute nodes based on just-in-time query optimization. The division of the transformer layers is parametrized using the number of compute nodes and the frequencies of the queries to enable optimal asynchronous operation for LLM task completion. Due to the asynchronous nature of the described distributed computing and inference, latent and direct queries do not compete with each other, thereby reducing the latency for LLM task completion.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

NA, "Multi-Device Asynchronous Distributed Compute Framework of LLM Layers for Cascaded Inference", Technical Disclosure Commons, (September 04, 2025)
https://www.tdcommons.org/dpubs_series/8559

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Multi-Device Asynchronous Distributed Compute Framework of LLM Layers for Cascaded Inference

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Multi-Device Asynchronous Distributed Compute Framework of LLM Layers for Cascaded Inference

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information