A Framework For Large-Scale Fault-Tolerant Data Pipelines

Long Nguyen
Zachary Frazier


A system and a method for frequent and fault-tolerant running of large-scale data pipelines across data centers are disclosed. The system includes one or more data centers, each having multiple servers, with a framework including a mutex design pattern for multi-binary data processing. The mutex design includes a minimal locking scheme where locks are acquired on the outputs of the stages on startup from a distributed locking service, and are released upon failure or successful writing of the output. A multi-homed pipeline may be utilized to make the system fault-tolerant. The method wraps each binary in the pipeline with locking on output data tables. The locking scheme guarantees that instances of the same binary will not interfere with each other. The disclosed system and method automatically coordinates pipelines in case of a failure or data delay to produce consistent and replicated output data tables on time.