Abstract

A shadow scoring framework evaluates LLM-based recommendation ranking without serving LLM-ranked results to users. For enrolled requests, a production ranker serves a primary ranking while an LLM ranker computes a shadow ranking asynchronously off the serving critical path. The system logs paired observations including the primary ranking, the shadow ranking, and observed engagement with the served slate. Multi-level enrollment controls cost using persistent user hashing at a configurable rate, per-user request subsampling, and a daily token budget cap that pauses shadow scoring when exceeded. Counterfactual metrics are computed from logs, including NDCG comparison using observed engagement as relevance, promoted item engagement rate for items ranked in top-K by the shadow ranker but outside top-K by production, and top-K ranking agreement. Experiment management integrates token cost tracking, cost estimation, and cost-per-metric-improvement reporting with automatic budget-based pausing.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Anonymous, "Shadow Scoring System for Cost-Effective LLM Recommendation Experimentation", Technical Disclosure Commons, (June 30, 2026)
https://www.tdcommons.org/dpubs_series/10714

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Shadow Scoring System for Cost-Effective LLM Recommendation Experimentation

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Shadow Scoring System for Cost-Effective LLM Recommendation Experimentation

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information