Machine-learning models are consuming an increasing fraction of the world's computing resources. The cost of computing inferences with some machine-learning models is extremely high. Provisioning computing resources for peak performance, e.g., high availability and quality of service, entails the creation of headroom for traffic spikes (increases in demand) and preparing for the possibility of outages (decreases in capacity). Executing computer applications that utilize machine-learning models, also known as machine-learned models, can require significant capital and operational expenses.
This disclosure describes techniques to optimize use of computing resources for a machine-learning model. Multi-resolution models and/or models with recurrence are utilized. These models can compute inferences to varying degrees of quality (resolution). The multi-resolution models are served in an elastic manner such that a model of a resolution that fits both the available computing resources and is utilized to compute inferences.
Olston, Christopher; Fiedel, Noah; Chi, Ed H.; and Beutel, Alexander, "Elastic multi-resolution model-serving to compute inferences", Technical Disclosure Commons, (September 15, 2017)