Abstract
A GPU-storage I/O isolation architecture is described for distributed embedding training with SSD-backed embedding tables. Storage-related callbacks are registered using a host function launch mechanism that holds a GPU driver mutex only during enqueue, allowing blocking key-value store reads to execute on a CPU thread without stalling other CUDA streams. SSD read completions are delivered through a condition-variable-backed fill queue that wakes a filler thread without polling. Cache eviction is overlapped with prefetch using a double-buffer eviction manager that alternates buffers across training steps while a background thread writes dirty entries back to SSD. A stream isolation scheduler assigns dedicated CUDA streams for storage, computation, and communication with explicit priorities and event-based synchronization, enabling forward/backward kernels and collective communication to proceed concurrently with storage activity.
Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.
Recommended Citation
Anonymous, "GPU-Storage I/O Isolation for Distributed Embedding Training Systems", Technical Disclosure Commons, ()
https://www.tdcommons.org/dpubs_series/10697