Inventor(s)

Nadhem J. Al-Fardan

Abstract

Presented herein are novel techniques for integrating low-rank adaptation (LoRA) adapters in a distributed Graphics Processing Unit (GPU) environment while addressing privacy concerns. By hosting a base model and the adapters on separate GPUs, the techniques provide for balancing computational efficiency with data security. Through the techniques proposed herein, a dynamic system for AI model inference can be provided that separates the base model and the LoRA adapter onto different GPUs or two groups of clustered GPUs connected via a network. A cluster is defined as a set of GPUs that run a similar binary such as a base model or a LoRA adapter. Specifically, the base model of a quantized version of the base model can be and processed on a service provider's infrastructure, while the LoRA adjustments can be loaded on a client's GPU, ensuring a level of precision in task-specific fine-tuning. This setup optimizes data transfer and maintains privacy by allowing the service provider access to only a query, and not the final output, with potential enhancements to improve privacy through model parallelism.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS