Abstract

For servers today, that run mission critical workloads, downtime is not an option and any outage of these servers usually translates to reduced revenue, reduced profitability and potential customer loss. Any interruption in the operation or availability of these workloads will have a ripple effect throughout the organization. Gathering valid and necessary data about the event of failure from all possible sources plays a significant role in determining how quickly and accurately the root cause for the server down-time is identified. The data required for such analysis is spread across Firmware and Operating System (OS) and comes from different sources on the server. This information comprises of data collected and logged by the firmware such as the error log buffers, event logs and also the state of the system at the time of failure, collected by the operating systems in the core dump files. Most often the challenge faced is with collection of the set of interdependent information originating and stored at different locations on the system. The proposed solution enables a high availability design by eliminating single point of failure during the log collection and retrieval process. This disclosure proposes a method and apparatus for faster, reliable and consolidated logging of necessary data from different sources on occurrence of a system failure.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Share

COinS