Abstract
Contemporary vision-language models are vulnerable to visual prompt injections in which malicious instructions are embedded within images in a manner that is often imperceptible to human observers but legible to machine vision systems. These hidden payloads can subvert the intended operation of automated agents, leading to unauthorized actions and inaccurate auditing. This publication describes a defense-in-depth framework that utilizes image manipulation and consensus evaluation to mitigate such attacks. Input images, such as application screenshots, are processed to generate multiple “sanitized” variants using techniques like bit-depth quantization (posterization) or luminosity-based thresholding. These variants, in conjunction with the original image version, are then provided as parallel prompts to a model. A consensus mechanism evaluates the consistency of the model’s outputs across the various image versions. A lack of consensus indicates a probable prompt injection, triggering a security flag or user intervention. This methodology provides a robust technical layer to detect and neutralize adversarial visual content while maintaining the integrity of agentic interactions.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Chandwaney, Ashok and Mukherjee, Mekhola, "Mitigating Visual Prompt Injection via Image Manipulation and Consensus Evaluation", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/9970