Abstract
Method and system for reducing tail latency in machine learning inference pipelines are disclosed. The system includes a primary orchestrator configured to receive an inference request from a client tier component and dispatch the request to a primary execution pool. An adaptive delay controller computes a dynamic hedge delay by optimizing a joint objective balancing tail latency and duplicate execution cost. During the delay interval, a latency risk index (LRI) module computes a LRI using telemetry signals including queue gradients, stage timing deviations, peer load metrics, and network jitter. Upon delay expiry and elevated stall risk, the primary orchestrator forwards a sealed hedge request to a secondary orchestrator. An arbitration and single commit component select a successful response and ensures exactly-once semantics. Outcome metrics stored in a monitoring / metrics store refine hedge timing and reduce p99.9 latency.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
QIN, HAO, "ADAPTIVE PEER HEDGING FOR DYNAMIC TAIL LATENCY CONTROL IN ML INFERENCE PIPELINES", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/9551