Abstract

A GPU-storage I/O isolation architecture is described for distributed embedding training with SSD-backed embedding tables. Storage-related callbacks are registered using a host function launch mechanism that holds a GPU driver mutex only during enqueue, allowing blocking key-value store reads to execute on a CPU thread without stalling other CUDA streams. SSD read completions are delivered through a condition-variable-backed fill queue that wakes a filler thread without polling. Cache eviction is overlapped with prefetch using a double-buffer eviction manager that alternates buffers across training steps while a background thread writes dirty entries back to SSD. A stream isolation scheduler assigns dedicated CUDA streams for storage, computation, and communication with explicit priorities and event-based synchronization, enabling forward/backward kernels and collective communication to proceed concurrently with storage activity.

Creative Commons License

This work is licensed under a Creative Commons Attribution 4.0 License.

Recommended Citation

Anonymous, "GPU-Storage I/O Isolation for Distributed Embedding Training Systems", Technical Disclosure Commons, (June 30, 2026)
https://www.tdcommons.org/dpubs_series/10697

Download

COinS

Technical Disclosure Commons

Defensive Publications Series

GPU-Storage I/O Isolation for Distributed Embedding Training Systems

Abstract

Creative Commons License

Recommended Citation

Browse

Search

Submit

Additional Information

Technical Disclosure Commons

Defensive Publications Series

GPU-Storage I/O Isolation for Distributed Embedding Training Systems

Inventor(s)

Abstract

Creative Commons License

Recommended Citation

Share

Browse

Search

Submit

Additional Information