Inventor(s)

Abstract

A shadow scoring framework evaluates LLM-based recommendation ranking without serving LLM-ranked results to users. For enrolled requests, a production ranker serves a primary ranking while an LLM ranker computes a shadow ranking asynchronously off the serving critical path. The system logs paired observations including the primary ranking, the shadow ranking, and observed engagement with the served slate. Multi-level enrollment controls cost using persistent user hashing at a configurable rate, per-user request subsampling, and a daily token budget cap that pauses shadow scoring when exceeded. Counterfactual metrics are computed from logs, including NDCG comparison using observed engagement as relevance, promoted item engagement rate for items ranked in top-K by the shadow ranker but outside top-K by production, and top-K ranking agreement. Experiment management integrates token cost tracking, cost estimation, and cost-per-metric-improvement reporting with automatic budget-based pausing.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS