Inventor(s)

ALOK ROY, VISAFollow

Abstract

The present disclosure relates to a method and system for performing checkpointing and state synchronization for fault tolerance in long-running MapReduce jobs. The method involves integrating checkpointing mechanism into both the Map and Reduce phases of the MapReduce jobs, capturing and storing critical data and metadata at key intervals. Additionally, the method includes replicating these checkpoints across all clusters in the active-active setup, ensuring that any cluster can access the most recent checkpoint and resume the MapReduce job in case of failure or re-routing. Finally, the method ensures that the checkpoints are synchronized across clusters before the job proceeds, providing a consistent and reliable recovery point. Present disclosure improves fault tolerance and ensures more efficient processing for MapReduce jobs in distributed environments.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS