Defensive Publications Series

Pipelined Execution Of Attention And Feed-Forward Network Operations On Heterogeneous Cores For Llm Decode Acceleration

Abstract

The technology described in this paper relates to maximization of resource utilization for transformer based large language models (LLMs), specifically during an autoregressive decode phase. For the decode phase, each layer of a transformer model is primarily composed of two distinct computational blocks, an attention operator block and a feed-forward network (FFN) operator block. The attention operator block is overwhelmingly memory-bandwidth bound. In contrast, the FFN operator block is largely compute-bound, and the computations are dominated by large dense matrix multiplications. A chip architecture may be provided that contains at least two types of specialized co-located cores, a first core optimized for the arithmetic intensive operations and a second core optimized for the bandwidth intensive operations. The chip architecture may then be leveraged to perform disaggregated execution of attention operators and FFN operators on heterogenous compute cores co-located on the same chip, enabling fine-grained, intra-layer pipelining to maximize resource utilization during the LLM decoding.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

N/A, "Pipelined Execution Of Attention And Feed-Forward Network Operations On Heterogeneous Cores For Llm Decode Acceleration", Technical Disclosure Commons, (February 03, 2026)
https://www.tdcommons.org/dpubs_series/9271

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Pipelined Execution Of Attention And Feed-Forward Network Operations On Heterogeneous Cores For Llm Decode Acceleration

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Pipelined Execution Of Attention And Feed-Forward Network Operations On Heterogeneous Cores For Llm Decode Acceleration

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information