Inventor(s)

NAFollow

Abstract

The performance and throughput of machine learning (ML) models such as transformers, large language models (LLM), etc. is limited by the bandwidth of high-bandwidth memories (HBM). This disclosure describes tensor decomposition techniques that leverage the sparsity of tensors and optimize their quantization to reduce memory and bandwidth requirements during ML model operation. Low-rank matrices used in transformers are decomposed into n-dimensional vector sums. The design space and ML serving accuracy inform the hardware-software codesign and acceleration parameters. Advantageously, hardware-accelerated model training can be done when the model weights are in a compressed (tensor is in a rank-decomposed state), eliminating offline decomposition/quantization. However, inference is performed with normal matrices, e.g., matrices rebuilt from their decomposed form. Tensor operations (such as multiplications) can be performed conventionally by a hardware accelerator to maintain backward compatibility with existing algorithms.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS