Inventor(s)

Abstract

Techniques are disclosed for detecting adversarial, obfuscated, or harmful prompts in generative AI platforms by using model disagreement as a risk signal. A user prompt is routed to an ensemble of language models selected for diversity in architecture, size, generation, and safety tuning, optionally including smaller canary models. Each model generates a response, and the responses are embedded into a shared semantic vector space. Pairwise divergence values, such as cosine distances between response embeddings, are computed and aggregated into a divergence score for the prompt, with optional per-model divergence diagnostics. The divergence score is compared to a calibrated threshold to flag prompts as potentially adversarial without harm-category-specific training. Repeated high-divergence events can be accumulated into an actor-level profile to support enforcement actions. Execution may be asynchronous alongside primary inference to reduce latency.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS