Abstract

Artificial intelligence systems are increasingly deployed as autonomous defenders in modern cybersecurity environments. These systems continuously analyze network behavior, evaluate threat intelligence, assess risk levels, and recommend or execute defensive actions. Central to their effectiveness is the formation of internal beliefs regarding attacker intent, infrastructure trustworthiness, vulnerability severity, and operational priority. While existing research has focused on attacks that manipulate detection accuracy, evade classifiers, or bypass response mechanisms, far less attention has been given to vulnerabilities in belief formation processes themselves. This paper introduces Belief Hijacking Attacks, a novel class of adversarial strategies that target how AI defenders construct, update, and reinforce internal beliefs about securityrelevant phenomena. Rather than inducing immediate technical failure, these attacks operate through sustained interaction, biased feedback, and selective information exposure to gradually reshape defensive reasoning. By exploiting memory components, learning pipelines, and trust calibration mechanisms, adversaries can redirect system judgment toward systematically flawed conclusions. We develop a formal threat model for belief hijacking, propose a comprehensive taxonomy of manipulation techniques, and analyze realistic deployment scenarios across enterprise, cloud, and critical infrastructure environments. Our analysis demonstrates that compromised belief systems can persist even after adversarial activity ceases, rendering technically robust defenses ineffective over time. Finally, we outline an initial epistemic security framework aimed at preserving belief integrity in autonomous defense platforms. This work establishes belief protection as a foundational requirement for trustworthy AI-driven cybersecurity.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS