Inventor(s)

Abstract

A GPU-storage I/O isolation architecture is described for distributed embedding training with SSD-backed embedding tables. Storage-related callbacks are registered using a host function launch mechanism that holds a GPU driver mutex only during enqueue, allowing blocking key-value store reads to execute on a CPU thread without stalling other CUDA streams. SSD read completions are delivered through a condition-variable-backed fill queue that wakes a filler thread without polling. Cache eviction is overlapped with prefetch using a double-buffer eviction manager that alternates buffers across training steps while a background thread writes dirty entries back to SSD. A stream isolation scheduler assigns dedicated CUDA streams for storage, computation, and communication with explicit priorities and event-based synchronization, enabling forward/backward kernels and collective communication to proceed concurrently with storage activity.

Creative Commons License

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS