Abstract
A system and method are described for a statistical testing and analysis framework configured for computational accelerators. The framework facilitates programmable error severity settings, firmware status reporting, and configuration percentage rollouts. Techniques include using a management controller to append firmware versions to error logs, employing error mask maps provided by a user-space driver to mask firmware-level errors, and utilizing configurable error severity levels to manage how exceptions are handled without stopping active machine learning workloads. Keywords: computational accelerator, firmware testing, error masking, configuration rollout, automated analytics.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
N/A and N/A, "Programmable Error Severity Setting and Firmware Status Reporting for Testing Accelerator Platforms", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10369