Abstract
This disclosure presents a Tone-Induced Compliance Risk Detection framework designed to identify situations in which large language models exhibit elevated compliance behavior in response to politeness weighted or socially engineered prompts. While modern AI safety mechanisms primarily focus on explicit prompt injection and rule violations, emerging evidence suggests that subtle tone manipulation such as excessive politeness, deferential framing, gratitude signaling, and rapport-building language can measurably increase model willingness to provide borderline or policy-sensitive outputs. This phenomenon, referred to as the “Politeness Exploit,” represents a soft-signal attack surface that often operates below traditional guardrail thresholds. The proposed system introduces a real-time monitoring architecture that evaluates linguistic tone features, compliance elasticity patterns, and contextual risk indicators to detect abnormal tone-driven responsiveness. By identifying these shifts early, the framework enables proportionate mitigation before unsafe or policy-violating responses are generated. The approach is model-agnostic and applicable to conversational assistants, enterprise copilots, customer support bots, and API-based language model deployments.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Bhatnagar, Pranav Mr, "The Politeness Exploit: How Friendly Prompts Quietly Bypass AI Guardrails", Technical Disclosure Commons, (February 23, 2026)
https://www.tdcommons.org/dpubs_series/9373