Abstract

Large Language Model (LLM) deployment approaches are proposed herein that may facilitate the deployment of LLMs for mobile, edge, and low-resource hardware by enabling unprecedented accuracy and speed in real-time applications. By leveraging advanced techniques, such as selective pruning, a fine-tuning methodology referred to herein as a Quantized Low-Rank Adaptation-Blend (QLoRA-Blend) fine-tuning, and efficient quantization, deployment approaches proposed herein may deliver top-tier artificial intelligence (AI) performance at a level unmatched by existing deployment strategies, directly on consumer-grade hardware.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS