Abstract
Techniques are described for computing online advertisement pricing adjustments using user feedback signals while separating user/context propensity from ad-candidate effects. A first machine learning model generates a state-dependent prediction V(s) of a feedback event probability from state features that exclude ad-candidate features. A second machine learning model generates a candidate-dependent prediction Q(s,a) from the state features and action features describing the ad candidate. An advantage-style combination, such as A(s,a)=Q(s,a)-V(s), produces an ad-attributable signal that controls for inherent user/context feedback propensity. The advantage output is converted to a pricing adjustment, for example UDV×Scalar×(Q-V), and provided to an auction-stage pricing layer. A multi-task neural network with feature routing may implement separate heads for V(s) and Q(s,a), with the combination performed in a pricing layer. The approach may be applied to multiple feedback types and ad formats.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Anonymous, "Dual-Model Advantage Function Architecture for Debiased User Feedback Signals in Digital Advertisement Pricing", Technical Disclosure Commons, (June 30, 2026)
https://www.tdcommons.org/dpubs_series/10653