Inventor(s)

N/AFollow

Abstract

The technology described in this paper relates to maximization of resource utilization for transformer based large language models (LLMs), specifically during an autoregressive decode phase. For the decode phase, each layer of a transformer model is primarily composed of two distinct computational blocks, an attention operator block and a feed-forward network (FFN) operator block. The attention operator block is overwhelmingly memory-bandwidth bound. In contrast, the FFN operator block is largely compute-bound, and the computations are dominated by large dense matrix multiplications. A chip architecture may be provided that contains at least two types of specialized co-located cores, a first core optimized for the arithmetic intensive operations and a second core optimized for the bandwidth intensive operations. The chip architecture may then be leveraged to perform disaggregated execution of attention operators and FFN operators on heterogenous compute cores co-located on the same chip, enabling fine-grained, intra-layer pipelining to maximize resource utilization during the LLM decoding.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS