Abstract

On-device machine learning models can provide inferences quickly; however, such models often have a large size and need to be downloaded to the device. In many domains, such as optical character recognition or translation, the nature of the problem necessitates that several on-device models be made available.

This disclosure describes techniques to split inference computation between the device and a remote server by leveraging the observation that many ML models have domain independent layers that are shared between multiple models and domain specific layers that are specific to individual models. Per techniques described herein, an available smallest intermediate representation from the shared layers is transmitted to the server inference. Alternatively, a bottleneck layer is inserted between the domain-independent and domain-specific layers of the model. The shared layers of the model are on-device while the domain-specific layers are on the server. The use of the smallest intermediate representation eliminates the need to store large models locally, reduces the cost and delays of network communication by minimizing the data transmitted during inference, and allows leveraging the power and flexibility of server-based machine-learning models.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS