Defensive Publications Series

Error Type Aware Deduping For GenAI/ML Workloads On Computational Accelerators

Abstract

The present disclosure is directed to an error-aware deduplication system and method for high-performance computational accelerators, such as application specific integrated circuits (ASICs), graphics processing units (GPUs), and other hardware accelerators, particularly those supporting generative artificial intelligence (GenAI) and machine learning (ML) workloads. The present disclosure introduces a smart filtering mechanism that evaluates the specific types of errors occurring within a deduplication period. By prioritizing error severity over chronological order, the system can ensure that high-priority failures, such as hardware (HW) errors, are surfaced even if they are preceded by filtered or less significant events. The disclosure details two primary implementation embodiments: a two-scan process utilizing a Boolean presence table to identify and then locate severe errors, and a computationally efficient single-scan approach that employs an integer tracking array to record the first occurrence of each error type in a single pass. Furthermore, the system can incorporate environmental state tracking, monitoring for workload rescheduling, reboots, and repairs, to ensure deduplication is only applied to continuous error propagations. By intelligently analyzing error cascades, the disclosure significantly improves the accuracy of root-cause diagnostics and the overall maintenance of cloud-based accelerator clusters.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

N/A and N/A, "Error Type Aware Deduping For GenAI/ML Workloads On Computational Accelerators", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10005

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

Error Type Aware Deduping For GenAI/ML Workloads On Computational Accelerators

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

Error Type Aware Deduping For GenAI/ML Workloads On Computational Accelerators

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information