Inventor(s)

Abstract

A hybrid online serving architecture for recommendation systems selectively applies large language model (LLM) reasoning at a late ranking stage. A request is processed through retrieval and early-stage ranking to reduce candidates and produce confidence statistics. A cost-benefit routing orchestrator computes a routing score using traditional-model confidence, user value tier, budget utilization, latency constraints, and surface priority, and compares the routing score to a dynamically adapted threshold responsive to budget pressure, quality gap, and tail-latency pressure. Prior to LLM invocation, an ROI estimate is computed from expected quality lift and token cost, and the LLM is invoked only when ROI is positive. Under constraints, a partial LLM mode scores only a top-K subset with calibrated scores for merging with traditional scores. The system outputs a final slate from LLM or traditional scoring based on the routing decision.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS