Abstract
Techniques are described for an agent-mediated content injection firewall in recommendation systems. Candidate recommendation content such as descriptions, reviews, and metadata is evaluated using a dual pipeline that yields a human-oriented quality score and an injection risk score. The injection risk score may be computed as a calibrated ensemble of instruction-pattern matching, obfuscation detection, a transformer-based instruction classifier, and an adversarial judge based on behavioral divergence of a language model when conditioned on the content versus a neutralized version. The injection risk score is integrated into ranking using an agent-adjusted penalty scaled by an agent vulnerability profile to produce soft demotion of risky items. Content delivered to agent-facing APIs may be wrapped with trust metadata and instruction-hierarchy controls, optionally with integrity hashing. Provenance and seller reputation are tracked to adjust baseline risk, and a red-team loop generates and tests adversarial variants to retrain detectors over time.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Anonymous, "Content Injection Firewall for Detecting and Neutralizing Adversarial Instructions in Agent-Facing Recommendation Content", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10739