Abstract
Techniques are described for tiered embedding management in which embedding rows are evicted from GPU memory to NVMe SSDs and prefetched from SSDs into GPU memory using GPU-direct storage DMA that bypasses CPU memory. Two page-aligned GPU-resident buffers are registered with a GPU-direct storage interface and are alternated by a double-buffer scheduler such that one buffer serves embedding access for model computation while the other buffer performs I/O. Eviction uses GPU-direct writes from GPU memory to SSD, and prefetch uses GPU-direct reads into a GPU-resident buffer followed by scatter to target GPU addresses. A batch coalescing layer groups small per-row transfers by SSD offset, compacts data into contiguous GPU regions, and issues fewer larger I/O operations to reduce per-call overhead. Runtime detection selects zero-copy or fallback transfer modes when GPU-direct storage is unavailable.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Anonymous, "Pipelined Double-Buffer Eviction Scheduling with GPUDirect Storage for Zero-Copy Tiered Embedding Management", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10695