A mechanism is provided for debugging large-scale data pipelines by sampling inputs and outputs of the large-scale data with consistent hashing. The mechanism can ensure the same output set across different pipelines given the same input set and the same machine learning model. The mechanism computes consistent hashing based on inputs and produces a consistent sample (e.g., the same subset) of events in the input and output for computing alignment. The mechanism tracks the alignment of input and output sets throughout the pipeline to identify any bugs and to determine exactly where misalignment is introduced.
Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.
Skvortsov, Evgeny and Nguyen, Long, "DEBUGGING LARGE-SCALE DATA PIPELINES WITH CONSISTENT HASHING", Technical Disclosure Commons, (March 22, 2018)