Abstract

The performance and throughput of machine learning (ML) models such as transformers, large language models (LLM), etc. is limited by the bandwidth of high-bandwidth memories (HBM). This disclosure describes tensor decomposition techniques that leverage the sparsity of tensors and optimize their quantization to reduce memory and bandwidth requirements during ML model operation. Low-rank matrices used in transformers are decomposed into n-dimensional vector sums. The design space and ML serving accuracy inform the hardware-software codesign and acceleration parameters. Advantageously, hardware-accelerated model training can be done when the model weights are in a compressed (tensor is in a rank-decomposed state), eliminating offline decomposition/quantization. However, inference is performed with normal matrices, e.g., matrices rebuilt from their decomposed form. Tensor operations (such as multiplications) can be performed conventionally by a hardware accelerator to maintain backward compatibility with existing algorithms.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

NA, "Compression of Large Language Model (LLM) Weights via Tensor Decomposition", Technical Disclosure Commons, (October 16, 2024)
https://www.tdcommons.org/dpubs_series/7436

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Compression of Large Language Model (LLM) Weights via Tensor Decomposition

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Compression of Large Language Model (LLM) Weights via Tensor Decomposition

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information