Abstract

Current techniques to secure large language models against prompt attacks (injection, hacking, and other attacks) are typically static and do not adjust to the continuously evolving prompt attack landscape. They are expensive to operate, introduce latency, and add management/monitoring overhead by requiring ongoing tuning. This disclosure describes techniques to dynamically adjust the security configurations of a large language model (LLM) based on the current risk evaluations of prompt attacks and other vulnerabilities. An attack classification model, trained over various types of prompt attacks, infers an appropriate control profile for the current threat situation and the input prompt sequence and response. The attack classification model adjusts security controls to harden the system against prompt attacks. By leveraging an AI model to automatically fine-tune and match the security controls of an LLM to the threat presented, manual monitoring/tuning is obviated. Users enjoy improved efficiency, throughput, and responsiveness, while the LLM operator enjoys hardened and automated protection at a reduced cost.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS