Abstract

A hybrid online serving architecture for recommendation systems selectively applies large language model (LLM) reasoning at a late ranking stage. A request is processed through retrieval and early-stage ranking to reduce candidates and produce confidence statistics. A cost-benefit routing orchestrator computes a routing score using traditional-model confidence, user value tier, budget utilization, latency constraints, and surface priority, and compares the routing score to a dynamically adapted threshold responsive to budget pressure, quality gap, and tail-latency pressure. Prior to LLM invocation, an ROI estimate is computed from expected quality lift and token cost, and the LLM is invoked only when ROI is positive. Under constraints, a partial LLM mode scores only a top-K subset with calibrated scores for merging with traditional scores. The system outputs a final slate from LLM or traditional scoring based on the routing decision.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Anonymous, "Hybrid Serving Architecture with Cost-Benefit Routing for LLM-Based Recommendation Systems", Technical Disclosure Commons, (June 30, 2026)
https://www.tdcommons.org/dpubs_series/10715

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Hybrid Serving Architecture with Cost-Benefit Routing for LLM-Based Recommendation Systems

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Hybrid Serving Architecture with Cost-Benefit Routing for LLM-Based Recommendation Systems

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information