Abstract
Artificial intelligence (AI) fabrics for distributed training and inference often retain stale communication state after a job, worker, endpoint, or session has failed, completed, or stopped making progress. This stale state can waste resources, preserve bad path choices, and reduce accelerator utilization. Proposed herein are techniques to facilitate lifecycle-aware management of communication state in AI fabrics. The proposed techniques use cross-layer telemetry to classify communication state as active, dormant, degraded, or orphaned, then safely quarantine and reclaim only the state that is truly invalid. Unlike generic telemetry analytics or rerouting, the proposed techniques add workload-aware validation, post-remediation verification, and rollback to enable safe lifecycle-aware cleanup of AI-fabric communication state.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
GOVINDRAJ, PRADEEP BHAGYA, "CROSS-LAYER DETECTION, QUARANTINE, AND SAFE RECLAMATION OF STALE COMMUNICATION STATE IN AI FABRICS", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10364