Abstract

While multi-turn attacks on a large language model (LLM) can be thwarted by evaluating the prompt within its context window, lengthy context-window evaluation is costly and time intensive. This disclosure describes techniques that evaluate multi-turn prompts for the presence of multi-turn attacks encapsulated within the context window. In contrast to traditional brute-force scanning of the entire context window, per the techniques, prompt/context-window metadata is obtained by aggregating the ongoing sequence of prompts into a table, each row of which includes the evolution results from the previous step. The prompt metadata provides valuable security information to a security scanner, thereby reducing the amount of data that needs scanning and the time needed to scan. Through a feedback loop, the security scanner sends the result of the last turn to the application to aggregate the evaluation scores for each turn. The techniques scale with the context window.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS