Abstract
A hybrid online serving architecture for recommendation systems selectively applies large language model (LLM) reasoning at a late ranking stage. A request is processed through retrieval and early-stage ranking to reduce candidates and produce confidence statistics. A cost-benefit routing orchestrator computes a routing score using traditional-model confidence, user value tier, budget utilization, latency constraints, and surface priority, and compares the routing score to a dynamically adapted threshold responsive to budget pressure, quality gap, and tail-latency pressure. Prior to LLM invocation, an ROI estimate is computed from expected quality lift and token cost, and the LLM is invoked only when ROI is positive. Under constraints, a partial LLM mode scores only a top-K subset with calibrated scores for merging with traditional scores. The system outputs a final slate from LLM or traditional scoring based on the routing decision.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Anonymous, "Hybrid Serving Architecture with Cost-Benefit Routing for LLM-Based Recommendation Systems", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10715