Abstract
Autonomous artificial intelligence systems are increasingly deployed as primary defenders in modern cybersecurity environments. These systems rely on internal world models to interpret network behavior, assess threats, and guide automated response. While existing research has focused extensively on attacks that manipulate inputs, outputs, and learning processes, limited attention has been given to threats that target internal perception. This paper introduces Reality Distortion Attacks, a novel class of adversarial strategies that manipulate how autonomous security agents model and understand their operational environment. Rather than inducing immediate misclassification, these attacks gradually reshape situational awareness by influencing sensor inputs, feedback mechanisms, contextual signals, and historical memory. As a result, AI defenders may continue to function coherently while operating within a fabricated internal reality. We analyze the structural foundations of world-model-based agents, develop a taxonomy of distortion mechanisms, and examine realistic deployment scenarios. Furthermore, we propose a perceptual integrity defense framework aimed at preserving alignment between perceived and actual system conditions. Our findings demonstrate that protecting internal perception is essential for ensuring the long-term reliability and trustworthiness of autonomous cyber defense systems.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Bhatanagar, Pranav Mr, "Reality Distortion Attacks on Autonomous Security Agents: Manipulating Internal World Models in AI Defenders", Technical Disclosure Commons, (February 09, 2026)
https://www.tdcommons.org/dpubs_series/9297