Abstract
Explainable Artificial Intelligence (XAI) systems are widely promoted as mechanisms for increasing transparency, trust, and accountability in automated decision-making. By providing human-readable explanations for model outputs, these systems are intended to support oversight, regulatory compliance, and informed human judgment. However, the growing reliance on automated explanations has created a previously overlooked security risk: the explanations themselves can be manipulated, exploited, and weaponized. This paper introduces Interpretation Attacks, a class of adversarial strategies that target how AI systems generate, present, and justify their decisions. Rather than manipulating model predictions directly, these attacks exploit the interpretability layer to influence human perception, distort accountability processes, and legitimize unsafe or biased outcomes. Through selective input framing, context shaping, and explanation steering, attackers can induce systems to produce convincing yet misleading justifications that obscure underlying risks. We analyze how contemporary explainability techniques, including feature attribution methods, attention-based explanations, and post-hoc rationalization models, are vulnerable to systematic manipulation. Using realistic deployment scenarios from cybersecurity operations, enterprise automation, and decision-support systems, we demonstrate how interpretation attacks can erode trust, amplify automation bias, and undermine governance mechanisms without triggering conventional security alerts. The paper further examines why existing defensive frameworks fail to address these risks, highlighting limitations in static validation, output-centric auditing, and human-in-the-loop oversight. Finally, we outline research directions for building resilient explanation systems that remain robust under adversarial pressure, emphasizing longitudinal consistency, provenance tracking, and semantic integrity monitoring. By reframing explainability as a potential attack surface rather than a purely protective feature, this work contributes to a more comprehensive understanding of security in transparent AI systems.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Bhatanagar, Pranav Mr, "Interpretation Attacks: Exploiting How AI Explains and Justifies Decisions (Turning Explainability Itself into an Attack Surface)", Technical Disclosure Commons, (February 09, 2026)
https://www.tdcommons.org/dpubs_series/9295