Abstract

For Artificial Intelligence (AI), Machine Learning (ML), or other Deep Learning applications involving network fabric deployments, the fabric is typically deployed as a Layer 3 (L3) fabric without overlays such that heterogeneous high throughput application traffic is remote direct memory access (RDMA) over Converged Ethernet version 2 (RoCEv2) (e.g., involving User Datagram Protocol (UDP)/Distributed Deep Learning applications) and Transmission Control Protocol (TCP) (e.g., involving storage/HDFC/Network File System (NFS) applications). In such environments, when Graphic Processor Units (GPUs) are increased, deployment completion time starts increasing, so there is typically no benefit from adding extra GPUs due to congestion/microbursts that can be caused by application traffic and quality of service (QoS) configuration options. Presented herein is a solution to improve application performance for RDMA/AI/ML fabric deployments by horizontally scaling computing and storage clusters without causing PAUSE frames to be generated from the network, as PAUSE frames reduce throughput, increase latency, and cause horizontal scaling to stop working. Broadly, the solution presented herein involves monitoring local and remote events at the data path level and moving flows using weighted path selection. Flows may be moved by modifying access control lists (ACLs) and access control entries (ACEs) for load balancing and redirection components. Further, the weights can be auto-adjusted based on various system level events.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS