Inventor(s)

Abstract

Techniques are described for an agent-mediated content injection firewall in recommendation systems. Candidate recommendation content such as descriptions, reviews, and metadata is evaluated using a dual pipeline that yields a human-oriented quality score and an injection risk score. The injection risk score may be computed as a calibrated ensemble of instruction-pattern matching, obfuscation detection, a transformer-based instruction classifier, and an adversarial judge based on behavioral divergence of a language model when conditioned on the content versus a neutralized version. The injection risk score is integrated into ranking using an agent-adjusted penalty scaled by an agent vulnerability profile to produce soft demotion of risky items. Content delivered to agent-facing APIs may be wrapped with trust metadata and instruction-hierarchy controls, optionally with integrity hashing. Provenance and seller reputation are tracked to adjust baseline risk, and a red-team loop generates and tests adversarial variants to retrain detectors over time.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS