Inventor(s)

Abstract

A system and method are described for a statistical testing and analysis framework configured for computational accelerators. The framework facilitates programmable error severity settings, firmware status reporting, and configuration percentage rollouts. Techniques include using a management controller to append firmware versions to error logs, employing error mask maps provided by a user-space driver to mask firmware-level errors, and utilizing configurable error severity levels to manage how exceptions are handled without stopping active machine learning workloads. Keywords: computational accelerator, firmware testing, error masking, configuration rollout, automated analytics.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS