Generally, the present disclosure is directed to optimizing use of computing resources in a system. In particular, in some implementations, the systems and methods of the present disclosure can include or otherwise leverage one or more machine-learned models to predict task allocation for a job serving a plurality of machine-learned models based on current system state and queries per second (QPS) data for the plurality of models. Alternatively, the tasks can be allocated according to one or more rules (e.g., a new task is allocated to a job until the compute usage for the job falls below a scaling threshold). Thus, the systems and methods of the present disclosure are able to efficiently serve a mix of high-QPS and low-QPS machine-learned models at low latency with minimal waste of compute resources (e.g., CPU, GPU, TPU, etc.) and memory (e.g., RAM).

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.