Inventor(s)

Abstract

Disclosed techniques perform post-training repair of reinforcement-learning pricing agents by targeted editing of action-value information to eliminate high-price attractor cycles. A frozen policy graph is constructed from a greedy policy derived from Q-values, and the graph is decomposed to identify attractor cycles. Each cycle is scored using an elevation metric relative to benchmark prices and compared to a threshold to classify high-price attractors. For states on a high-price cycle, the greedy collusive action is replaced with a one-shot best-response action by demoting Q(s,a_collusive) and promoting Q(s,a_BR) by an editing margin that flips the argmax, while leaving all other Q-entries unchanged. The policy graph is globally re-enumerated and re-verified after edits, and iterations continue until no high-price attractor remains. Outputs may include edited Q-values and a repair report documenting edits and verification results.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS