Abstract
Guardrails are a set of limitations, guidelines, and operational protocols designed to govern the behavior and outputs of Large Language Models (LLMs). Current guardrail creation methods often face limitations such as lack of transparency, overly restrictive rules, and difficulty keeping pace with the evolving threat landscape. To overcome these limitations, techniques are proposed herein that provide automation for the generation of guardrails, or safeguarding rules, for LLMs using Reinforcement Learning (RL).
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Shrivastava, Ritvik, "AUTOMATING UNSUPERVISED SECURITY GUARDRAIL CREATION FOR LARGE LANGUAGE MODELS", Technical Disclosure Commons, (July 26, 2024)
https://www.tdcommons.org/dpubs_series/7239