Inventor(s)

N/AFollow

Abstract

In cloud networks, numerous safety constraints are applied prior to taking network components that provide critical functions out of service. Therefore, local health status is collected and monitored for network components and services. Reliance on local health status leads to locally optimal, but not globally optimal decisions. For instance, if all services are functioning with degradation, all services may declare themselves unhealthy. Depending on the policy of the load-balancer, all or most servers may be taken out of service causing a global outage. Therefore, the safety constraints can fail to operate properly when a component or service concludes that it is healthy while causing a global failure.

Each component across multiple clusters can be randomly assigned a group, and components within the same group can probe each other’s health check. Then, for each member of the group, a consensus protocol can be applied to reach an agreement on a health status of the member of the group. In addition, all service instances can be dynamically assigned to a group of peers. Within each group of peers, each member can probe every other member of the group on its self-reported health check. Then, an agreement protocol can be used to agree on a strategy on how to decide which services are considered to be globally healthy. Each service can then apply the agreed upon strategy to identify a subset of services that are globally healthy. Thus, each service can respond to probes from load-balancers, monitors, and the like, based on global health of services. By probing the same endpoint as that probed by load-balancers and monitors, the decision on global health of a service can be observable by the rest of the peers in the group.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS