Abstract

Techniques are disclosed for detecting adversarial, obfuscated, or harmful prompts in generative AI platforms by using model disagreement as a risk signal. A user prompt is routed to an ensemble of language models selected for diversity in architecture, size, generation, and safety tuning, optionally including smaller canary models. Each model generates a response, and the responses are embedded into a shared semantic vector space. Pairwise divergence values, such as cosine distances between response embeddings, are computed and aggregated into a divergence score for the prompt, with optional per-model divergence diagnostics. The divergence score is compared to a calibrated threshold to flag prompts as potentially adversarial without harm-category-specific training. Repeated high-divergence events can be accumulated into an actor-level profile to support enforcement actions. Execution may be asynchronous alongside primary inference to reduce latency.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Anonymous, "Ensemble Model Divergence Analysis for Adversarial Prompt Detection in Generative AI Systems", Technical Disclosure Commons, (June 29, 2026)
https://www.tdcommons.org/dpubs_series/10634

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Ensemble Model Divergence Analysis for Adversarial Prompt Detection in Generative AI Systems

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Ensemble Model Divergence Analysis for Adversarial Prompt Detection in Generative AI Systems

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information