Abstract

The present disclosure relates to systems and methods for serving machine learning models by partitioning model components for distributed execution across heterogeneous hardware resources. In particular, the disclosure describes techniques for splitting models into independently executable subgraphs deployed across a cluster of machines with varying hardware configurations (e.g., CPUs, GPUs, TPUs). The system leverages the distinct strengths of different hardware types by mapping memory-intensive or preprocessing-heavy components to CPU machines, while assigning compute-intensive subgraphs to accelerators such as TPUs. The distributed architecture employs a runtime system that manages inter-machine communication and orchestration of inference tasks, enabling improved utilization, scalability, and performance for serving large, embedding-heavy models.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS