Abstract
The present disclosure relates to a method and a system for HDFS gRPC writer for unified, high-performance data ingestion system designed to modernize the way both streaming and batch data are written into HDFS. The present disclosure suggests single, scalable gRPC-based ingestion interface capable of handling multiple data formats, while also efficiently managing batching, schema evolution, fault tolerance, and security. After receiving the data, the present disclosure suggests HDFS gRPC writer addresses the small files problem at its source by implementing intelligent size-based and time-based batching. Thereafter, the present disclosure ensures optimal HDFS file sizes without the need for downstream compaction jobs. The system supports both file-based schemas and Schema Registry-based schemas, enabling seamless schema evolution without requiring service downtime. Subsequently, the present disclosure operates as a long-running, stateful ingestion service optimized for throughput and minimal latency, while providing enterprise-grade reliability. Benchmarks indicate a 35% reduction in ingestion latency, a 50% improvement in throughput, and a 90% reduction in file count, yielding significant improvements in HDFS cluster health and operational efficiency.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Khan, Raouf Mr and Yadav, Mahendra Mr, "HDFS gRPC WRITER: A UNIFIED HIGH-PERFORMANCE DATA INGESTION ENGINE", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10449