Inventor(s)

Hao Qin, VISAFollow

Abstract

The present disclosure relates to a method and system for adaptive routing of machine learning inference requests using composite health scores. The method involves maintaining a dynamic health score for each inference engine instance based on multiple performance factors including latency, success rate, timeout occurrence, error rate, saturation, and staleness. Instances are organized in a priority queue according to their health scores, with the instance having the lowest score selected for processing requests. A fairness adjustment is applied to prevent overloading any single instance. Instances that meet specific quarantine criteria are isolated and subject to periodic health assessments, while successfully recovered instances are reintegrated with adjusted penalty parameters. The system further includes a health score engine and a quarantine manager to facilitate these operations, enhancing the efficiency of the inference routing process while minimizing latency and redundancy.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS