Abstract

The present disclosure relates to the field of distributed data processing, in particular to storage-aware adaptive shuffle planning (SASP) for shuffle optimisation in distributed computation engines. SASP provides a storage-aware planning layer that restructures shuffle execution based on actual storage capacity. The system collects object-store telemetry, including bandwidth, IOPS, latency, throttling, and concurrency, to construct a storage pressure vector for each prefix. Concurrently, expected shuffle partition sizes are analysed from execution plans. Based on these inputs, SASP forecasts partition-level strain and generates a Shuffle Prefix Mapping Table (SPMT) to assign partitions to output prefixes, ensuring balanced I/O load. A plan rewriter modifies execution plans to incorporate storage-aware partitioning. During execution, a custom ShuffleWriter writes data to assigned prefixes, and a ShuffleReader retrieves data using the mapping. The system further supports dynamic adaptation by redirecting partitions across prefixes in response to throttling, enabling improved performance and reliability.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS