Inventor(s)

Abstract

The present disclosure is directed to an error-aware deduplication system and method for high-performance computational accelerators, such as application specific integrated circuits (ASICs), graphics processing units (GPUs), and other hardware accelerators, particularly those supporting generative artificial intelligence (GenAI) and machine learning (ML) workloads. The present disclosure introduces a smart filtering mechanism that evaluates the specific types of errors occurring within a deduplication period. By prioritizing error severity over chronological order, the system can ensure that high-priority failures, such as hardware (HW) errors, are surfaced even if they are preceded by filtered or less significant events. The disclosure details two primary implementation embodiments: a two-scan process utilizing a Boolean presence table to identify and then locate severe errors, and a computationally efficient single-scan approach that employs an integer tracking array to record the first occurrence of each error type in a single pass. Furthermore, the system can incorporate environmental state tracking, monitoring for workload rescheduling, reboots, and repairs, to ensure deduplication is only applied to continuous error propagations. By intelligently analyzing error cascades, the disclosure significantly improves the accuracy of root-cause diagnostics and the overall maintenance of cloud-based accelerator clusters.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS