Inventor(s)

N/AFollow

Abstract

The mean time-to-repair (MTTR) and downtime costs of multi-node machines, often deployed for high-value cloud and machine learning (ML) or artificial intelligence (AI) server fleets, can be high. This disclosure describes techniques to expedite repair validation in multi-node servers by tailoring installation and test procedures to repair actions. Repair actions proposed by a diagnoser based on component-to-node maps are used to compute a subset of nodes to be installed and tested. Installation and testing are performed at the submachine level alongside a subset of nodes that are relevant to the performed repair actions. Testing is split into a focused test that targets the portion of the machine directly impacted by repair actions and a complementary test that provides full coverage by augmenting the focused test. The techniques can significantly improve the availability (by reducing MTTR) and capacity of multi-node machines, especially for those deployed for AI/ML computation where a single machine has a significant impact on cluster-level availability.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS