Latency minimization is critical to high-performance computing (HPC). Network monitoring tools that rely on out-of-band data to control latency cannot assist latency-sensitive network workloads in real time. This disclosure describes techniques that combine in-band network telemetry (INT) with the software-defined network (SDN) controller used by the cloud platform to mitigate HPC latency. INT gathers hardware-level information about buffer and queue utilization. Such information is used by the cloud SDN controller to make changes to the virtual environment. The SDN controller can directly affect decisions of the HPC master node relating to the assignment of tasks to worker nodes. The techniques leverage the deep, hardware-level information about potential latency issues signaled by buffer accumulations to inform cloud-HPC scheduling algorithms.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Iorga, Radu and Legrand, Aurélien, "Using Low-level Telemetry in Cloud Platforms to Mitigate Latency Risks for High-Performance Computing", Technical Disclosure Commons, (May 08, 2022)