Abstract
The performance and throughput of machine learning (ML) models such as transformers, large language models (LLM), etc. is limited by the bandwidth of high-bandwidth memories (HBM). This disclosure describes tensor decomposition techniques that leverage the sparsity of tensors and optimize their quantization to reduce memory and bandwidth requirements during ML model operation. Low-rank matrices used in transformers are decomposed into n-dimensional vector sums. The design space and ML serving accuracy inform the hardware-software codesign and acceleration parameters. Advantageously, hardware-accelerated model training can be done when the model weights are in a compressed (tensor is in a rank-decomposed state), eliminating offline decomposition/quantization. However, inference is performed with normal matrices, e.g., matrices rebuilt from their decomposed form. Tensor operations (such as multiplications) can be performed conventionally by a hardware accelerator to maintain backward compatibility with existing algorithms.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
NA, "Compression of Large Language Model (LLM) Weights via Tensor Decomposition", Technical Disclosure Commons, (October 16, 2024)
https://www.tdcommons.org/dpubs_series/7436