Detecting problems in new releases of server software is made difficult by the fact that some error classes can originate due to incorrect user actions but are indistinguishable from true software bugs. This disclosure describes techniques to determine whether elevated error rates are due to bugs in a new server release or due to customer action, enabling administrators to accurately gauge server health, to execute rollback upon the detection of a genuine bug, and to roll out software reliably. Customers are aggregated into buckets by uniformly hashing customer identifiers. Errors generated while processing a customer's request are attributed to the corresponding customer bucket and the relative error rates in each bucket are tracked. A simultaneous rise in error rate across a significant number of customer buckets that occurs at the same time as a new software release likely indicates a bug. Due to the uniform hashing of customer identifiers, the rise in error rates across multiple buckets is not attributable to deliberate or inadvertent actions by individual customers.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.