Abstract
Disclosed techniques perform post-training repair of reinforcement-learning pricing agents by targeted editing of action-value information to eliminate high-price attractor cycles. A frozen policy graph is constructed from a greedy policy derived from Q-values, and the graph is decomposed to identify attractor cycles. Each cycle is scored using an elevation metric relative to benchmark prices and compared to a threshold to classify high-price attractors. For states on a high-price cycle, the greedy collusive action is replaced with a one-shot best-response action by demoting Q(s,a_collusive) and promoting Q(s,a_BR) by an editing margin that flips the argmax, while leaving all other Q-entries unchanged. The policy graph is globally re-enumerated and re-verified after edits, and iterations continue until no high-price attractor remains. Outputs may include edited Q-values and a repair report documenting edits and verification results.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Anonymous, "Attractor Surgery for Post-Training Repair of Collusive Pricing Policies via Targeted Q-Entry Editing", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10615