Large language models (LLMs) are susceptible to security risks wherein malicious attackers can manipulate LLMs by poisoning their training data or using malicious text prompts or queries designed to cause the LLM to return output that includes sensitive or confidential information, e.g., that is part of the LLM training dataset. This disclosure describes the use of a data loss prevention (DLP) system to protect LLMs against data exfiltration. The DLP system can be configured to detect specific data types that are to be prevented from being leaked. The LLM output, generated in response to a query from an application or user, is passed through the DLP system which generates a risk score for the LLM output. If the risk score is above a predefined threshold, the LLM output is provided to an additional pre-trained model that has been trained to detect sensitive or confidential data. The output is modified to block, mask, redact, or otherwise remove the sensitive data. The modified output is provided to the application or user. In certain cases, the output may indicate that no response can be provided due to a policy violation.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Namer, Assaf; Miller, Jim; Vagts, Hauke; and Maltzman, Brandon, "A Cost-Effective Method to Prevent Data Exfiltration from LLM Prompt Responses", Technical Disclosure Commons, (November 13, 2023)