Abstract

For Artificial Intelligence (AI), Machine Learning (ML), or other Deep Learning applications involving network fabric deployments, the fabric is typically deployed as a Layer 3 (L3) fabric without overlays such that heterogeneous high throughput application traffic is remote direct memory access (RDMA) over Converged Ethernet version 2 (RoCEv2) (e.g., involving User Datagram Protocol (UDP)/Distributed Deep Learning applications) and Transmission Control Protocol (TCP) (e.g., involving storage/HDFC/Network File System (NFS) applications). In such environments, when Graphic Processor Units (GPUs) are increased, deployment completion time starts increasing, so there is typically no benefit from adding extra GPUs due to congestion/microbursts that can be caused by application traffic and quality of service (QoS) configuration options. Presented herein is a solution to improve application performance for RDMA/AI/ML fabric deployments by horizontally scaling computing and storage clusters without causing PAUSE frames to be generated from the network, as PAUSE frames reduce throughput, increase latency, and cause horizontal scaling to stop working. Broadly, the solution presented herein involves monitoring local and remote events at the data path level and moving flows using weighted path selection. Flows may be moved by modifying access control lists (ACLs) and access control entries (ACEs) for load balancing and redirection components. Further, the weights can be auto-adjusted based on various system level events.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Thirumurthi, Rajendra Kumar and Kumar, Deepak, "DYNAMIC FLOW ASSIGNMENT IN AI/ML APPLICATIONS BASED ON OBSERVABILITY AND TELEMTRY INFORMATION", Technical Disclosure Commons, (May 26, 2022)
https://www.tdcommons.org/dpubs_series/5171

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

DYNAMIC FLOW ASSIGNMENT IN AI/ML APPLICATIONS BASED ON OBSERVABILITY AND TELEMTRY INFORMATION

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

DYNAMIC FLOW ASSIGNMENT IN AI/ML APPLICATIONS BASED ON OBSERVABILITY AND TELEMTRY INFORMATION

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information