Abstract

While multi-turn attacks on a large language model (LLM) can be thwarted by evaluating the prompt within its context window, lengthy context-window evaluation is costly and time intensive. This disclosure describes techniques that evaluate multi-turn prompts for the presence of multi-turn attacks encapsulated within the context window. In contrast to traditional brute-force scanning of the entire context window, per the techniques, prompt/context-window metadata is obtained by aggregating the ongoing sequence of prompts into a table, each row of which includes the evolution results from the previous step. The prompt metadata provides valuable security information to a security scanner, thereby reducing the amount of data that needs scanning and the time needed to scan. Through a feedback loop, the security scanner sends the result of the last turn to the application to aggregate the evaluation scores for each turn. The techniques scale with the context window.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Namer, Assaf and Kulkarni, Prashant, "Scalable Evaluation of Large Language Model Prompts to Thwart Multi-turn Attacks", Technical Disclosure Commons, (January 22, 2025)
https://www.tdcommons.org/dpubs_series/7748

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Scalable Evaluation of Large Language Model Prompts to Thwart Multi-turn Attacks

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Scalable Evaluation of Large Language Model Prompts to Thwart Multi-turn Attacks

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information