Abstract

Guardrails are a set of limitations, guidelines, and operational protocols designed to govern the behavior and outputs of Large Language Models (LLMs). Current guardrail creation methods often face limitations such as lack of transparency, overly restrictive rules, and difficulty keeping pace with the evolving threat landscape. To overcome these limitations, techniques are proposed herein that provide automation for the generation of guardrails, or safeguarding rules, for LLMs using Reinforcement Learning (RL).

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Shrivastava, Ritvik, "AUTOMATING UNSUPERVISED SECURITY GUARDRAIL CREATION FOR LARGE LANGUAGE MODELS", Technical Disclosure Commons, (July 26, 2024)
https://www.tdcommons.org/dpubs_series/7239

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

AUTOMATING UNSUPERVISED SECURITY GUARDRAIL CREATION FOR LARGE LANGUAGE MODELS

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

AUTOMATING UNSUPERVISED SECURITY GUARDRAIL CREATION FOR LARGE LANGUAGE MODELS

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information