Abstract

Method and system for reducing tail latency in machine learning inference pipelines are disclosed. The system includes a primary orchestrator configured to receive an inference request from a client tier component and dispatch the request to a primary execution pool. An adaptive delay controller computes a dynamic hedge delay by optimizing a joint objective balancing tail latency and duplicate execution cost. During the delay interval, a latency risk index (LRI) module computes a LRI using telemetry signals including queue gradients, stage timing deviations, peer load metrics, and network jitter. Upon delay expiry and elevated stall risk, the primary orchestrator forwards a sealed hedge request to a secondary orchestrator. An arbitration and single commit component select a successful response and ensures exactly-once semantics. Outcome metrics stored in a monitoring / metrics store refine hedge timing and reduce p99.9 latency.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

QIN, HAO, "ADAPTIVE PEER HEDGING FOR DYNAMIC TAIL LATENCY CONTROL IN ML INFERENCE PIPELINES", Technical Disclosure Commons, (March 18, 2026)
https://www.tdcommons.org/dpubs_series/9551

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

ADAPTIVE PEER HEDGING FOR DYNAMIC TAIL LATENCY CONTROL IN ML INFERENCE PIPELINES

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

ADAPTIVE PEER HEDGING FOR DYNAMIC TAIL LATENCY CONTROL IN ML INFERENCE PIPELINES

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information