Abstract

Current techniques to secure large language models against prompt attacks (injection, hacking, and other attacks) are typically static and do not adjust to the continuously evolving prompt attack landscape. They are expensive to operate, introduce latency, and add management/monitoring overhead by requiring ongoing tuning. This disclosure describes techniques to dynamically adjust the security configurations of a large language model (LLM) based on the current risk evaluations of prompt attacks and other vulnerabilities. An attack classification model, trained over various types of prompt attacks, infers an appropriate control profile for the current threat situation and the input prompt sequence and response. The attack classification model adjusts security controls to harden the system against prompt attacks. By leveraging an AI model to automatically fine-tune and match the security controls of an LLM to the threat presented, manual monitoring/tuning is obviated. Users enjoy improved efficiency, throughput, and responsiveness, while the LLM operator enjoys hardened and automated protection at a reduced cost.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Namer, Assaf and Maltzman, Brandon, "Adaptive Hardening of Large Language Model Security", Technical Disclosure Commons, (July 31, 2024)
https://www.tdcommons.org/dpubs_series/7250

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Adaptive Hardening of Large Language Model Security

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Adaptive Hardening of Large Language Model Security

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information