Inventor(s)

HAO QIN, VISAFollow

Abstract

Method and system for reducing tail latency in machine learning inference pipelines are disclosed. The system includes a primary orchestrator configured to receive an inference request from a client tier component and dispatch the request to a primary execution pool. An adaptive delay controller computes a dynamic hedge delay by optimizing a joint objective balancing tail latency and duplicate execution cost. During the delay interval, a latency risk index (LRI) module computes a LRI using telemetry signals including queue gradients, stage timing deviations, peer load metrics, and network jitter. Upon delay expiry and elevated stall risk, the primary orchestrator forwards a sealed hedge request to a secondary orchestrator. An arbitration and single commit component select a successful response and ensures exactly-once semantics. Outcome metrics stored in a monitoring / metrics store refine hedge timing and reduce p99.9 latency.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS