In a microservice architecture, a user request can go through a large number of servers owned by several different teams before a response is returned. The request can fail due to failure in any of the servers. Troubleshooting an outage that affects the end user experience in microservice architecture can involve multiple teams and can take a substantial amount of time. This disclosure describes techniques to rapidly locate the root cause entity of a customer-facing failure to node(s) deep within the infrastructure of the service. Per the techniques, end user product teams mark requests with metadata known as critical user interactions (CUI). The metadata is propagated along with the request. Performance metrics are gathered from servers that the requests go through. The performance metric is keyed by CUI, server node, and peer node for every adjacent pair of nodes. These piecemeal metrics keyed by CUI together offer end-to-end visibility for a set of requests grouped by the CUI of the end product, enabling the rapid and automatic triage of an outage to an interior server without requiring domain expertise on the product or the server.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.